Sunteți pe pagina 1din 17

A DSP Implementation of Source Location

Using Microphone Arrays


Daniel V. Rabinkin, Richard J. Renomeron, Arthur Dahl,
Joseph C. French, James L. Flanagan1, and Michael H. Bianchi2

Abstract
The design, implementation, and performance of a low-cost, real-
time DSP system for source location is discussed. The system consists
of an 8-element electret microphone array connected to a Signalogic
DSP daughterboard hosted by a PC. The system determines the loca-
tion of a speaker in the audience in an irregularly shaped auditorium.
The auditorium presents a non-ideal acoustical environment; some of
the walls are acoustically treated, but there still exists signi cant re-
verberation and a large amount of low frequency noise from fans in
the ceiling.
The source location algorithm is implemented in a two step pro-
cess: The rst step determines time delay of arrival (TDOA) for select
microphone pairs. A modi ed version of the Cross-Power Spectrum
Phase Method is used to compute TDOAs and is implemented on the
DSP daughterboard. The second step uses the computed TDOAs in
a least mean squares gradient descent search algorithm implemented
on the PC to compute a location estimate.

1 Introduction
In a teleconferencing environment, it is often necessary to focus audio and
video sensors on di erent locations as participants contribute to the discus-
sion. One approach to this problem is to provide a hand-held or body-worn
1 Ctr. for Computer Aids for Industrial Productivity, Rutgers Univ., P.O. Box 1390,
Piscataway, NJ 08855-1390
2 Bell Communications Research, 445 South St., Morristown, NJ 07960-6438

1
microphone to each potential speaker to capture sound, and to use a set of
human-controlled cameras to provide video. However, this requires exten-
sive wiring, long setup time, and is potentially inconvenient to the partic-
ipants of the teleconference. An alternative method is to use microphone
array processing techniques to determine the location of the active speaker.
This information can be used to automatically focus audio and video devices
without the need for camera operators and unwieldy sound equipment. An
implementation of a source location system using inexpensive and readily
available hardware components is described in this paper.
Brandstein [1] classi es approaches to the source location problem into
three classes: steered beamformer based locators, high-resolution spectral
estimation locators, and Time Delay of Arrival (TDOA) based locators.
Steered beamformer based locators scan the space of interest with an acous-
tic beam and nd a region which produces the highest beam output energy.
This method is very simple to implement and was used exclusively in early
analog array systems [2]. However, it su ers from extremely poor spatial res-
olution, and poor response time. It becomes fairly impractical when a large
number of discrete locations are to be scanned, or when continuous spatial
resolution is desired.
High-resolution spectral estimation locators [3] are based on a spectral
phase correlation matrix derived for all elements in the array. Such a matrix
is estimated from the captured data. The spectral estimation-based tech-
niques attempt to perform a \best" statistical t for source location using
the above matrix. Although numerous techniques exist, most are limited in
application to narrowband signals and regularly spaced sensor arrays. These
techniques also tend to rely on a high degree of signal statistical stationarity
and an absence of re ections and interfering sources. Most applications of
the spectral estimation locators occur in the radar domain, and are not gen-
eral and robust enough to be applicable to wideband acoustic signals. The
techniques also tend to be computationally intensive, and are thus not well
suited for real-time applications.
The TDOA technique has been the technique of choice in recent digital
systems [4, 5]. It is based on evaluating the delay of arrival of the sound
wavefront between a given microphone pair. A set of delays for chosen mi-
crophone pairs is computed, and the source geometry is derived from these
pairs. The delay of arrival technique is essentially the spectral estimation
technique reduced to a two sensor matrix.
The system described in this paper uses a modi cation of the method
2
developed by Omologo and Svaizer [6] to determine delays between selected
microphone pairs. After the delay estimates have been computed, a nondi-
rected gradient descent search is performed over all possible locations to nd
a best match for a sound source location based on these delays. The system
then directs a computer controlled camera at the source of sound.

2 Delay Estimation Algorithm


Given a single source of sound that produces a time varying signal x(t) each
microphone in the array will receive the signal
mi(t) = i x(t ? ti) + ni (t) (1)
where i is the microphone number, ti is the time it takes sound to propa-
gate from the source to microphone i, and ni(t) is noise signal present at
microphone i. The Time Delay of Arrival (TDOA) is de ned for a given
microphone pair i; k as:
Dik = ti ? tk (2)
The goal of the rst step of the source location process is to determine Dij
for some subset of microphone pairs. The Fourier transform of the captured
signal can be expressed as:
mi(t) $ Mi(!) = iX (!)e?jwt + Ni(!)i
(3)
where X (!) is the Fourier transform of the source signal x(t).
It can be assumed that the average energy of the captured signals is
signi cantly greater than that of the interfering noise, namely that
i jX (!)j  jNk (!) j for all i; k
2 2 2
(4)
In addition, the signal to noise ratio (SNR) of the captured signal is de ned
to be: 0 1
4
SNRi(!) = 10 log @ i 2 jX i (! )j 2
A (5)
jNi(!)j 2

If the condition in (4) is true, the SNR and the cross correlation of mi(t)
and mk (t): Z1
Rik ( ) = ?1
mi(t)mk (t ?  )dt (6)

3
can be expected to be maximum at Dik . The frequency domain representa-
tion of (6) is
Rik ( ) $ Sik (!) = Mi (!)Mk(!) (7)
and may be expanded using (3):
Sik (!) = ( iXi (!)e?jwt + Ni(!))( k Xk(!)ejwt + Nk(!))
i k

= i k jXi(!)j2 e?jw(t ?t ) + Ni (!)Nk(!)


i k

+ iXi (!)e?jwt Nk(!) + k Xk(!)ejwt Ni(!)


i k
(8)
The last three terms of (8) are negligible compared to the rst term based
on the assumption in (4). Expression (8) now reduces to:
Sik (!)  i k jXi (!)j2 e?jwD ik
(9)
Thus Dik can be found by evaluating
 F fSik (! )g
Dik = max ?1
 Rik ( ) = max (10)
where F ?1 indicates the inverse Fourier transform.

2.1 Cross-Power Spectrum Phase (CPSP)


Given no a priori statistics about the source signal and the interfering noise,
it can be shown [7] that the optimal approach for estimating Dik is to whiten
the cross-correlation by normalizing it by its magnitude:
Sik (!) = e?jwD $ ( ? D ) (11)
ik
i k jX (!)j2
ik

Since i k jX (!)j2 is not known in (11) the approximation in (4) may


again be applied, and it is possible to normalize by the product of the mag-
nitudes of the captured signals. The described function is de ned as the
cross-power spectrum phase (CPSP) function:
4 Mi (! )Mk (! )
CPSPik (!) = jMi (!)j jMk (!)j (12)
and its time-domain counterpart
( )
4 ?1 Ffmi (t)gFfmk (t)g
cpspik ( ) = F jFfm (t)gjjFfm (t)gj (13)
i k

4
It is seen from (11) that the output of the CPSP function is delta-like with
a peak at  = Dik .
The analysis above is derived based on analog signals and stationary
fDik g. For digital processing, the fmi(t)g are sampled and become discrete
sequences fmi(n)g. Also, the fDik g are not stationary, since the source of
sound is liable to change location. Finite frames of analysis are also required
due to computational constraints. Hence the usual windowing techniques
are applied and the sampled signals are broken into analysis frames. After
conversion into the discrete nite sequence domain, (13) becomes:
( )
DFTfmi(n)gDFTfmk (n)g
cpspik (n) = IDFT jDFT (14)
fmi(n)gj jDFTfmk (n)gj
2.2 Modi ed CPSP
The whitening of the cross correlation spectrum is based on the assumption
that the statistical behavior of both signal and noise is uniform across the
entire spectrum. More speci cally, it is assumed that the approximation in
(4) is equally valid for all !. In practice, the SNR level varies with !. In
untreated enclosures there is commonly large amounts of acoustic noise below
200Hz. It is therefore desirable to discard the portion of the CPSP below
that level.
Also, given that the source component x(t) provides the dominant amount
of overall energy in mi(t) it can be expected that there is a higher SNR at
frequencies where the magnitude of Mi(!) is greater. It is therefore desirable
to weigh the portions of Sik (!) with greater magnitude more heavily. An al-
ternate expression to (12) is proposed that performs only a partial whitening
of the spectrum:
CPSPik (!) =4 Mi(!)Mk(!) 0    1 (15)
(jMi (!)j jMk (!)j)
A discrete equivalent for the modi ed CPSP is expressed as:
( )
DFT f m i (n)gDFTfmk (n)g
cpspik (n) = IDFT (jDFTfm (n)gj jDFTfm (n)gj) 0    1 (16)
i k
Setting  to zero produces unnormalized cross-correlation, while setting  to
one produces (14). A good value of  may be determined experimentally, and

5
M1

t1

H t’1

t’2
S’
M2
t2
S

Figure 1: Sound propagation from source to microphones.

varies with the characteristics of room noise and the acoustical re ectivity
of room walls. An optimal value for  was determined to be about 0.75 for
several di erent enclosures. (see section 5).

3 Search Algorithm
The goal of the search algorithm is to determine the location of the sound
source based on a selected set of TDOA's. Consider a sound source s with
coordinates s = fxs; ys; zsg, and a microphone pair m1; m2 with coordinates
m1 = fxm1 ; ym1 ; zm1 g and m2 = fxm2 ; ym2 ; zm2 g as shown in Figure 1. It
takes sound time t1 to propagate from s to m1 and time t2 to propagate from
s to m2. A given time ti of propagation may be computed by the expression:
q
d (m ; s) (x ? x ) + (ym ? ys) + (zm ? zs)
ti = V i = m s
2 2 2
i

V
i i
(17)
sound sound

where Vsound is the speed of sound equivalent to about 340 meters/second at


normal temperature and pressure, and d() is the measure of physical distance.

6
3.1 Estimation of Source Position
The TDOA estimate computed for the pair m1; m2 de nes the di erence
t1 ? t2. This di erence in turn de nes a hyperplane H . All points on H have
associated propagation times t01; t02 for which the di erence t01 ? t02 is equal to
the TDOA estimate D12. All points lying on this plane are potential locations
of source s. The hyperplane is a three dimensional hyperboloid de ned by
the parametric equation:
d(p ? m1) ? d(p ? m2) = t ? t = D (18)
Vsound 1 2 12

where p = fxp; yp; zpg de nes a point on the hyperplane H .


A given set of TDOA estimates fDik g has an associated set of hyperplanes
fHik g. The location of the sound source must be a point p that lies on every
Hik and satis es the set of associated parametric equations:
8 d(p?m1 )?d(p?m2 ) 9
>
> Vsound = D12 >
>
>
< ... >
=
> (19)
>
d(p?mi )?d(p?mk )
Vsound = Dik >
>
>
: ... >
;
Thus a set of three fDik g uniquely speci es the coordinates of the source
(barring a degenerate case). For sets of four or more fDik g a solution may
only exist in a least mean square (LMS) sense.
Approximate closed form solutions of (19) exist in a two-dimensional case1
[6]. In a three-dimensional case an approximate solution may be obtained
when special restricted arrangements of the microphone pairs are used [8].
If three dimensional resolution is required for arbitrary microphone arrange-
ments, a closed form solution does not exist, and numerical methods must
be used.
Given a set of computed TDOA's fDik g, we seek to nd a point p0 with
an associated set of TDOA's fDik0 g such that the error between fDik g and
fDik0 g is minimized: X
E = (Dik0 ? Dik )2 (20)
alli;k
1 Two dimensional refers to a case when it is assumed that all microphones, and the
sound source, lie on a common plane

7
M5 M6 M7 M8

M1 H56
H’78

M2 H12

H34
P’
M3

H78
M4 Correct position estimate

Initial position estimate


H78 is incorrect based on
a bad TDOA estimate D78

Figure 2: Search with incorrect TDOA estimate

p0 is de ned as:
p0 = p : min
p E (21)
A nondirected gradient search is used to nd p0 [9]. Variable step size
was used to speed up the search. It was observed that the error function
E tends to be quite smooth. Virtually no instances of search failure due to
local minima were ever detected over the course of the work presented here.

3.2 False TDOA Rejection


It is found that the TDOA estimates described in section 2 are produced with
a small but nite probability of having signi cant error. Figure 3.2 shows a
simpli ed two-dimensional case where an incorrect TDOA estimate has been
computed.
Since (19) is overdetermined, we may prune equations from the system
set and still be able to nd a suitable p0. We wish to develop a procedure to
remove a bad TDOA estimate Dbad from the set of fDik g. After the search
reaches an error minimum, each Dik has an associated contribution to the

8
12 TDOA estimates

Evaluate system of
TDOA equations

Yes
Error < threshold?

Output
Position No

Yes Real-time
Constraint
Exceeded?
Time out
reject frame No

No Too few
TDOA’s
Left?
Inconsistent
TDOA’s
reject frame Yes
Drop TDOA with
biggest error
contribution

Figure 3: Flowchart of procedure to reject incorrect TDOA estimates

error sum:
Eik = (Dik ? D0ik) 2
(22)
If all but one of the TDOA estimates are good we may expect the position
estimate p0 of the rst search to be fairly close to the correct p. We can
therefore expect that the error E associated with D to be bigger than
bad bad
all the other fE g. We can also expect the overall error for the given set of
ik
TDOA estimates to be large. A owchart of a strategy dealing with incorrect
delay estimates is given in Figure 3.
If after the search converges the overall error exceeds a set threshold,
the equation associated with the Dik that contributes the most to the error
sum is removed from (19) and a new search is performed on the reduced
system. This procedure may be repeated several times to remove multiple
bad TDOA estimates. However, the reliability of the consequent postition
estimates decreases with each iteration. The probability that a given Dik is
correct must be high in order for rejection procedure to be used e ectively.

9
4 Implementation
The CAIP Source Location System, shown in Figure 4, consists of an 8-
element microphone array connected to a Signalogic SIG32C-8 DSP board
which is hosted by a generic 66 MHz 486DX2-based PC, and a Canon VC-
C1 camera which is connected to the serial port on the PC. The software
for the system consists of two parts. The rst is a Microsoft Windows-based
program which controls the DSP, computes the source position from the
TDOA estimates, and controls the camera. The second is a DSP32C program
which performs A/D conversion and calculates the TDOA estimates in real
time. Communication between the DSP board and the host PC is handled
via Signalogic-developed Dynamic Link Library (DLL) routines.

7KH6RXUFH/RFDWLRQ6\VWHP
$7 7'63&

6LJQDORJLF

3&%XV

'63+RVW3&
'63%RDUG
&DOFXODWHVVRXUFHFRRUGLQDWHV
&DOFXODWHV7'2$HVWLPDWHV
IURP7'2$HVWLPDWHV

&DOFXODWHVSDQWLOWYDOXHV

DQGFRQWUROVFDPHUD

(OHPHQW0LFURSKRQH$UUD\
&RQQHFWHG6HULDO3RUWV

&DPHUD

Figure 4: The CAIP Source Location System.

10
Sound is captured using the 8 microphones and digitized by the DSP
board at 16 kHz. The TDOA estimation is performed at most twice per
second to satisfy real-time computational constraints. A speech activity de-
tector is used to determine if the incoming signal is speech and should be
processed. The detector analyzes the signal from one of the microphones in
the array. The signal is highpass ltered at 200Hz to remove room mode
noise. The output of the lter is then recti ed and applied to three separate
integrators with di erent rates of integration. The slowest integrator is used
to determine the noise oor. If the output of both the medium rate inte-
grator and the fast integrator exceed the noise oor and the output level of
the fast integrator exceeds the level of the slow integrator during intervals of
time, the signal is deemed to be speech. The variation of the fast integrator
over the medium integrator is used to indicate variation of signal energy,
and indicates nonmechanical noise. The weights with which the integrator
constants are compared, as well as the coecients of integration, have been
determined empirically.
Once speech activity is detected, a 1024-sample frame of data from the
eight channels is gathered and an in-place FFT is performed on all 8 chan-
nels. The spectrum is preemphasized to mitigate low-frequency noise from
ventilation systems and other ambient room noise, and then a Cross-Power
Spectrum estimate for each pair is computed according to (14).
The program running on the DSP32C is capable of providing a set of
12 TDOA estimates approximately twice per second. When it completes
calculation of a TDOA estimate set, the DSP32C sends an interrupt to the
controlling DLL, which in turn sends a Windows message to the PC-side
program to pick up the data. The estimated position of the source is then
calculated using the algorithm discussed in Section 3.
The TDOA estimates obtained by the DSP32C program will not always
lead to a valid position estimate because re ections and noise will cause
inaccuracies in the estimation of the TDOA. To guard against pointing the
camera towards an incorrect location, the PC-side program assumes that the
position estimate is incorrect (and does not move the camera) if the calculated
error metric exceeds a certain threshold, if more than three TDOA estimates
are dropped, or if the search does not nd a optimal position estimate after
45 iterations. In addition, the last \good" position estimate is used as the
starting point for the search on the next set of TDOA estimates. This allows
the algorithm to \home in" on a source if the initial location estimate is
close to the actual position of the source but is rejected because of excessive
11
error or number of iterations, and keeps incorrect position estimates from
in uencing future position estimates.
If a correct position estimate is obtained, the camera is aimed toward the
position by sending pan and tilt values to the camera via the PC's serial port.
The pan and tilt values are computed from the position coordinates using
a simple geometric formula, and are sent to the camera via an o -the-shelf
driver for the VC-C1 camera. After this is completed, the PC-side program
waits for another data-ready message from the DSP board communication
DLL.

5 Performance
5.1 Oine Results
The algorithms described in Sections 2 and 3 were implemented oine and
run with data recorded in real rooms with the microphone array. This en-
abled convenient evaluation of the various aspects of the source location al-
gorithm. The performance of the algorithm was evaluated in two acoustical
settings:
1. Regular room - Partially foam padded (Around the array) but with
most of the wall area, as well as the oor and ceiling, untreated (Figure
5). The room is about 3x4 meters in size.
2. Reverberant auditorium - A highly reverberant and completely un-
treated auditorium. All walls are untreated cement. The ceiling has
large metal beams running down its length (Figure 6). The auditorium
is about 10  15 meters in size.
The performance of the TDOA estimator was evaluated in both settings.
The sentence \I have a question. How much are tomatoes?" was uttered
by several speakers from 25 locations representing a fair coverage of the
room oor space (Figure 5). Frames from the captured signal that were
detected to contain speech were processed using the described algorithms,
and TDOA estimates were computed using the previously described methods.
The estimates were compared to ones that should have been obtained based
on the known positions of the talkers. A similar procedure was performed
for the auditorium (Figure 6) with 20 positions covering the oor space.
12
4 Meters

Array Microphones

4 Meters

Recording
Locations

3 Meters

Figure 5: A 3 m  4 m room used to test the source location system.

Array Microphones
15 Meters

Recording
Locations

10 Meters

Figure 6: A 10 m  15 m room used to test the source location system.

13
Highly Reverberant Auditorium 10x15 Meters
28
x − error>1 sample
26
o − error>2 samples
24
% Error

22

20

18

16
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
p
Regular Room 3x4 Meters
2
x − error>1 sample
1.5 o − error>2 samples
% Error

0.5

0
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
p

Figure 7: Error in TDOA estimates.

A resulting TDOA estimate was deemed incorrect if it di ered from the


expected TDOA by more than a preset error bound. The percentage of all
TDOA's that exceeded the allowed error is given in Figure 7. Plots are
given for the regular room and the auditorium. Error curves are given for
an error bound of one sample and two samples 2. The error rate is plotted
against the whitening parameter  de ned in equation 16. It is observed
that error is minimized when  is set to 0.75 in the regular room, and 0.8
in the reverberant room. The overall performance of the TDOA estimation
algorithm is excellent in the benign regular room environment, and su ers
considerably in the harsh reverberant environment of the auditorium.

5.2 Real-Time Results


The system has been tested in the rooms shown in Figures 5 and 6, and
in an 8 m  5 m conference room at Bellcore which was signi cantly less
reverberant than the auditorium shown in Figure 6. The unmodi ed CPSP
expression (14) and simple energy detection was used in the implementation.
2At a sampling rate of 16 kHz a sample represents an estimate error of 62 seconds.
With a microphone pair distance of 30 cm this represents a broadside deviation of 3

14
In both the small room (Figure 5) and the Bellcore conference room,
the system performed moderately well, taking at most 2 seconds to aim the
camera at a speaker, provided the speaker was talking at a moderate level.
The position estimates deemed correct by the search algorithm were within
30 cm of the actual position of the speaker. In addition, the camera was
never moved to point to an incorrect position (i.e. every location estimate
deemed correct by the system was truly correct). These results, especially in
the Bellcore conference room, were very encouraging.
However, the performance in the highly reverberant auditorium (Figure
6) was quite poor. While the camera was never aimed erroneously, it of-
ten failed to nd a correct location estimate for sources, and the speaker
needed to shout to cause the camera to move. Examination of the TDOA
estimates produced by the program revealed an excessive number of incorrect
values. The probable cause is the considerable amount of reverberation in
the room, which causes the TDOA estimation algorithm to be less reliable,
and is consistent with the oine simulation described above.

6 Conclusion
A low-cost, real-time system for source location using microphone arrays was
presented. The system is capable of determining the location of the source
and controlling a camera in real time. Using the Cross-Power Spectrum
Phase (CPSP) algorithm, the TDOA for the di erent microphone pairs are
estimated. These TDOA estimates are then transferred to the PC, where
they are used to calculate the position of the source. Once the position of
the source has been obtained, the the appropriate parameters are sent to the
camera.
The real-time system has been tested successfully in a typical medium-
sized conference room with a moderate amount of reverberation, and in a
small acoustically treated room. It performed poorly in an exteremely rever-
berant auditorium, but it is believed that the performance can be improved
somewhat by implementing the speech detector described in Section 4 and
changing the CPSP algorithm to emphesize portions of the spectrum where
there is high speech energy.
Future work will include re ning the current system for better perfor-
mance and extending the system to include spatially selective sound capture.
The speech detector and modi ed CPSP expression (16) will be added to the

15
real-time implementation and will hopefully mitigate some of the problems
caused by reverberation. Furthermore, the output of the source location sys-
tem can be used to to steer a beamformer towards the sound source, thus
allowing focused audio to accompany the focused video.

Acknowledgements
This work has been supported by a contract with Bellcore, and by NSF Grant
No. MIP-9314625 and ARPA Contract DABT63-93-C0037.

References
[1] M. Brandstein. A framework for speech source localization using sensor
arrays. Doctoral Dissertation, Brown University, May 1995.
[2] J. Flanagan, J. Johnston, R. Zahn, and G. Elko. Computer-steered
microphone arras for sound transduction in large rooms. Journal of the
Acoustical Society of America, pages 1508-1518, November 1985.
[3] D. Johnson, and D. Dudgeon. Array signal processing - concepts and
techniques. Prentice Hall, rst edition, 1993.
[4] M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A practical time-
delay estimator for localizing speech sources with a microphone array.
Computer, Speech, and Language, 9(2):153{169, April 1995.
[5] M. Omologo and P. Svaizer. Use of the crosspower-spectrum phase in
acoustic event localization. Technical Report No. 9303-13, IRST, Povo
di Trento, Italy, March 1993.
[6] M. Omologo and P. Svaizer. Acoustic event localization using a
crosspower-spectrum phase based technique. Proceedings of ICASSP-
1994, Adelaide, Australia, 1994.
[7] C. Knapp, and G. Carter. The generalized correlation method for es-
timation of time delay. IEEE Transactions on Accoustics Speech and
Signal Processing. Bol. ASSP-24,n.4, August 1976.

16
[8] M. S. Brandstein, J. E. Adcock, and H. F. Silverman. A closed-form
method for nding source locations from microphone-array time-delay
estimates. In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pages 3019{3022, Detroit, MI,
May 8-12 1995.
[9] W. Press, S Teukkolsky, W. Vetterling, and B. Flannery. Numerical
recipes in C: The art of sceinti c computing. 2nd Ed. Cambridge Uni-
versity Press, 1992.

17

S-ar putea să vă placă și