File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/h90-1033_abstr.xml

Size: 18,241 bytes

Last Modified: 2025-10-06 13:46:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1033">
  <Title>An Algorithm for Determining Talker Location using a Linear Microphone Array and Optimal Hyperbolic Fit</Title>
  <Section position="1" start_page="0" end_page="155" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> One of the problems for all speech input is the necessity for the talker to be encumbered by a head.</Paragraph>
    <Paragraph position="1"> mounted, hand-held, or fixed position microphone. An intelfigent, electronically-aimed unidirectional microphone would overcome this problem. Array techniques hold the best promise to bring such a system to practicality. The development of a robust algorithm to determine the location of a talker is a fundamental issue for a microphone-array system. Here, a two-step talker-location algorithm is introduced. Step 1 is a rather conventional filtered cross-correlation method; the cross-correlation between some pair of microphones is determined to high accuracy using a somewhat novel, fast interpolation on the sampled data.</Paragraph>
    <Paragraph position="2"> Then, using the fact that the delays for a point source should fit a hyperbola, a best hyperbolic fit is obtained using nonlinear optimization. A method which fits the hyperbola directly to peak-picked delays is shown to be far less robust than an algorithm which fits the hyperbola in the cross-correlation space. An efficient, global nonlinear optimization technique, Stochastic region Contraction (SRC) is shown to yield highly accurate (&gt;90%), and computationally efficient, results for a normal ambient.</Paragraph>
    <Paragraph position="3"> Introduction One of the problems for all speech input is the necessity for the talker to be encumbered by a hcadmounted, hand-held, or fixed position microphone, or, perhaps, a technician-conlxolled mechanical unidirectional microphone. Whether for teleconferencing \[I\], speech recognition \[2\], or large-room recording or conferencing \[3\], an intelligent, eleclronically-aimed unidirectional microphone would overcome this problem. Array techniques hold the best promise to bring such a system to practicality.</Paragraph>
    <Paragraph position="4"> Algorithms for passive tracking -- the determination of range, bearing, speed, and signature as a function of time for a moving object -- have been studied for nearly 100 years partiomLqrly for radar and sonar systems. While there is currently much activity involved with the wacking of multiple sources using variants of the eigenvalue-hased decomposition MUSIC algorithm, \[4\], \[5\], \[6\], \[7\], \[8\], most systems still use correlational techniques \[9\], \[10\], \[11\].</Paragraph>
    <Paragraph position="5"> The method presented here is also based on correlation. First, a coarse, normalized cross-correlation function is computed over the delay range of interest. It turns out that, even for the relatively high sampfing rate of 20kHz, the 5Olas resolution of the time-delay estimates causes derived locations to be unsatisfactory. However, the latter may be refined by nearly two orders of magnitude through accurate interpolation techniques which can be attained for a relatively small computational using multirate filtering\[12\].</Paragraph>
    <Paragraph position="6"> For M microphones, one can estimate M-1 independent relative delays. As, theoretically, only two relative delays are needed to triangulate a source, for M &gt;3, the system is overspecified. However, since noise is always present in a real system, this extra information can be profitably used to overcome some of the effects of the noise. In fact, the geometry of the array constrains the vector of relative delays. For example, a simple linear array, with all the microphones on the axis, y=0, has delays constrained to be on a particular hyperbola with a focus on the target. Therefore, errors in the estimation of the delays may be corrected by fitting the best hyperbola. Two methods for doing so are presented here.</Paragraph>
    <Paragraph position="7"> In the first method, Time-Delay Estimation, Hyperbolic Fit (TDEHF), peak-picking is used on the results of the interpolated cross-correlations to estimate the individual time delays. Then, constrained nonfincar optimization is used to fit the best hyperbola through the sparse rime-delay estimations. As the data turn out to be pretty much unimodal, gradient techniques \[13\] were used to minimize a least-squares functional. TDEHF suffers when original time-delay estimates exhibit large, and  often &amp;quot;dumb&amp;quot; errors. TDEHF is introduced in Section 4. The second (and more robust) method Interpolated Cross-eorrelation Hyperbolic Fit (ICHF), fits the best hyperbola to the actual output of the interpolated crosscorrelations. As reasonable crosscorrelations are always positive, the sum of the crosscorrelations across all the microphones for a given hyperbola is used as a functional to maximize. As the functional surface is multimodal, results for a hierarchical grid search and for application of Stochastic Region Contraction (SRC), [14] , [IS], a new method for efficient global nonlinear optimization, are presented.</Paragraph>
    <Section position="1" start_page="151" end_page="151" type="sub_section">
      <SectionTitle>
Coarse Cross-Correlation
</SectionTitle>
      <Paragraph position="0"> Consider a linear microphone array having M microphones, each located on the line y =O at a distinct point (z,,O) in the xy plane. A simple case is to be considered in this paper in which a single some (talker) is located at some point (x,y) in frdnt of the array.</Paragraph>
      <Paragraph position="1"> although there will be ambient noise. Without loss of generality, microphone 1 is selected as the reference. It is assumed that the signal at each microphone is appropriately sampled at some reasonable rate, R and that each microphone thus receives a signal of time (indexed by j).</Paragraph>
      <Paragraph position="2"> p:~). As sources might be separable in the frequency domain, one can, in general, filter each received signal using a zero-phase FIR filter, this is the only reasonable choice as delay estimation is yet to be performed. This implies, where f,G) is a 2J+1 element symmetric FIR filter. It is advantageous, as will be seen later, to define rectangularly-windowed data, referenced to time index k', for the correlations as,  where A,(k') is a normalizing factor. A reasonable normalization is to make the autocorrelation of the unshifted reference signal have a value of unity for any particular time reference k ', Combining (2.3) and (2.4) gives, which generalizes to, Computational Considerations for the CrossCorrelations null An important consideration is the selection of L.</Paragraph>
      <Paragraph position="3"> the number of points in the crosscorrelation. When autocorrelations are taken for LPC analysis, the length is limited by the assumption that the vocal tract is essentially stationary over the interval. As one is not doing this pseudo-stationary modeling of the vocal tract, this fact does not limit L here. Rather, the tradeoff between information content - tending to make one increase L - and computational load -- tending to make one decrease L -governs this decision. For the typical human talker, computing a position about five times per second is sufficient. With no redundancy, selecting L to correspond to 100200ms of data is reasonable, as the experimental data show.</Paragraph>
      <Paragraph position="4"> The range of the correlations, [-K-, K+], may be determined from the sample rate and the geometry shown in Figure 1 for a onedimensional array. For a symmetric arrangement in a room. K- = K+ and</Paragraph>
      <Paragraph position="6"> where c is the speed of sound with value about 342M/s.</Paragraph>
      <Paragraph position="7">  As an example, consider a one-dimensional array of length one meter, a room four meters wide, one-half meter of &amp;quot;block-out space&amp;quot; and a sampling rate of 20,000 samples-per-second. For this case, correlations will require 2000 multiplication-addition operations for lOOmsec of data. As the maximum relative delay may be seen to</Paragraph>
      <Paragraph position="9"> ing the Analog Devices ADSP-2100A digital signal processor at 12.5MHz clock rate \[16\]. For eight microphones, about 160 ms would be required, and the location could be computed in real-dine for the required five updates per second.</Paragraph>
      <Paragraph position="10"> The relative delay between each microphone and its reference could be estimated by selecting the highest positive point in the correlation outputs, i.e.,</Paragraph>
      <Paragraph position="12"> where d,~ \[k'\] is defined to be the delay, relative to microphone 1, for microphone m. Note that the accuracy is only to that of the sample rate, and that this simple peak-picking algorithm is subject to serious errors when real data are used! Interpolation for Higher Accuracy Even for the relatively high (for speech) sampling rate of 20kHz, estimation accuracy of the tracking position is inadequate; a variation of more than one meter in the y dimension is the norm for talkers two meters directly in front of the microphone. Experience has shown that an acceptable region of uncertainty may be achieved for a sampling inteerval of about llas.</Paragraph>
      <Paragraph position="13"> The most straightforward way to achieve the needed high resolution would be to sample at a much higher rate, R&amp;quot; -- around 1MHz - and perform the correlations on the data, i.e., C,~'\[k,kq= B.(k3 LR'-~ E r~'(k'+l).r~'(k'+k+l)(3.1) LtC--lk l I=o where B,(k') is a normalizing factor and L R' is the number of high-resolution samples in L. Relative to 20kHz sampling, this would force the computation to increase by a factor of 502 = 2500, making the procedure absurd. For an appropriately anti-aliased speech signal, one would be dealing with greatly oversampled signals.</Paragraph>
      <Paragraph position="14"> Thus, with no loss in accuracy, one could generate the signal at sampling rate R' from the signal sampled at rate R by the simplest standard multirate method if R&amp;quot; -= Z'R, (3.2) where 2~ is an integer greater than 1.</Paragraph>
      <Paragraph position="15"> The proof for computationally efficient interpolation is given in \[17\]. The results for computation are:</Paragraph>
      <Paragraph position="17"> Computational Considerations for the Interpolation null One important aspect of the computation of Equation (3.3) is the storage requirement for O. Appropriate resolution is achieved for Z=64, R=20k.Hz and a filter length of 641, implying QR =5. Then the range of oi and 02 is only 11. Thus (11)(11)(64) = 7744 storage locations are required.</Paragraph>
      <Paragraph position="18"> The number of multiplication-additions is (11)2= 121 to compute the cross-correlation for each interpolated point. One should note that this number is a far cry from the &amp;quot;direct&amp;quot; method in which, for L = 2000, (621)(64)(2000) = 80,000,000 operations had to be done to get each interpolated signal and (64)(2000) = 128,000 operations had to be done for each interpolated crosscorrelation! null</Paragraph>
    </Section>
    <Section position="2" start_page="151" end_page="153" type="sub_section">
      <SectionTitle>
Best Hyperbolic Fit Algorithms
Triangulation
</SectionTitle>
      <Paragraph position="0"> In binaural hearing, both amplitude and phase informarion is fed to the ~ and is used -- expertly -- to determine the location of a sound source. If the phase information -- the delay estimates - alone were to be used to determine location of a source, a minimum of three microphones is required for this &amp;quot;triangulation&amp;quot; procedure. If microphone 1 is considered to be the reference, and d2 and d3 the time delays for microphones 2 and 3 respectively, relative to the arrival at microphone 1, then the estimation of the source location xo, Yo may be determined from,</Paragraph>
      <Paragraph position="2"> (One should note that these triangulation formulae are normally listed for polar coordinates.) These relatively ugly, nonlinear expressions tend to be very sensitive to variations due to noise in the estimates of d2 and d3.</Paragraph>
      <Paragraph position="3">  For the case of the linear array, where the microphones are all considered to be on y=0, the locus of the relative delays for points along this line forms a hyperbola. This is clear from Figure 2 in which the relative delay loci are plotted for various point-source locations (x,y). At (zm,0), the absolute delay d= may be computed from the Pythagorean Theorem as</Paragraph>
      <Paragraph position="5"> and, relative to microphone 1,</Paragraph>
      <Paragraph position="7"> The points (z=,d.) lie on a hyperbola parameterized by the speed of sound, c, and the location of the source, (x,y). Thus, there is a one-to-one relationship between a specific hyperbola and a source-point (x,y) located in front of the array -- there is a mirror in back of the array.</Paragraph>
      <Paragraph position="8"> The task, then, is to fit the best member of this class, the best hyperbola, to the set of relative delay estimates zmd,~'\[l~'\], where m e \[2,M\].</Paragraph>
      <Paragraph position="9"> &amp;quot;~ it. ~- E</Paragraph>
      <Paragraph position="11"> In TDEHF an estimate of the relative delay for each microphone is obtained by peak-picking as indicated by Equations (2.10) and (2.11). Interpolation is done locally to get a higher resolution estimate, d,~'(k'). While many criteria are possible, a typical squared-error measure is defined as  When real data are used, it is often the case that the cross-correlation peak which must be determined in TDEHF is inappropriate. This is due to 1) periodicity in the signal, 2) room reverberations, and 3) noise. A more robust algorithm would clearly resdt ff the specific determination of the delays did not have to be explicitly done. In ICHF, one tries to determine the &amp;quot;optimal:fit&amp;quot; hyperbola in the cross.correlation space itself; thus, no pattern recognition errors are made prior to the optimization.</Paragraph>
      <Paragraph position="12"> Plots for real data are presented in Figures 3 and 4.</Paragraph>
      <Paragraph position="13"> In each case, the d at~ are produced by a loud talker situated at (1M,2M) with low ambient noise. In Figure 3, TDEHF worked well, as the peaks are relatively easy to pick correctly. In Figure 4, however, TDEHF yielded poor results, although it is evident that a hyperbolic fit in the cross-correlation space itself could give the right location. null</Paragraph>
    </Section>
    <Section position="3" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
6.8 8.8
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
    <Section position="4" start_page="153" end_page="154" type="sub_section">
      <SectionTitle>
Succeeds
</SectionTitle>
      <Paragraph position="0"> In nonlinear optimization, one must develop a functional that measures &amp;quot;goodness (badness)&amp;quot; as a function of the set of variables over which one wants to optimize. In this case, one wants to develop a measure of the average &amp;quot;goodness&amp;quot; of a particular hyperbola parametefized by (x,y) over the space shown in Figures 3, 4 having independent variables of x, the x spatial variable, and if, the relative delay. Points for the microphones (z,,,,d,,) may be computed from Equations 4.3 and 4.4; this guarantees they all lie on a unique hyperbola. If a continuous cross-correlation function, C (x, d) were available, then a reasonable functional for maximization would be,</Paragraph>
      <Paragraph position="2"> /~(k') represents a measure of the average height of the cross-correlation function measured over the points on the hyperbola taken by the set of microphones. One should note that it would be expected that the value should be positive for reasonable situations, and approaching unity for ideal ones, and thus/~ (k') could also be used to threshold decisions.</Paragraph>
      <Paragraph position="3">  v.. ta.,R'- o.* + 0.sj.</Paragraph>
      <Paragraph position="4"> Then, C. (z.. din) may be accurately approximated by</Paragraph>
      <Paragraph position="6"> which is exactly as derived previously. A three dimensional plot of the surface for E (k 3 is given in Figure 5.</Paragraph>
      <Paragraph position="7"> Notice the strong peaking due to the hyperbolic-fit transformation.</Paragraph>
    </Section>
    <Section position="5" start_page="154" end_page="155" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> Some preliminary results for one loud talker standing at (1M,2M) with a low ambient are shown in Figures 6 and 7. A linear array of eight microphones was used for all cases. For these Figures, an algorithm was assumed to have c(rrecfly located the talker ff it indicated a location within the rectangular region from 1.9M to 2.1M in x and 1.5M to 2.5M in y. As algorithms have improved, the measure of &amp;quot;correctness&amp;quot; is also to be refined in further work. In both TDEHF and ICHF, the tendency is for better performance when larger-size cross-correlations are used, although there seems to be no reason to go beyond 3500 samples (175ms). It is also clear that ICHF is far more robust than is TDEHF. Furthermore, as might be expected, one gets improved performance using bandpass-filtered data. (The filter used is a  There is high correlation between &amp;quot;correctness&amp;quot; and the resultant value of/~ \[k q for ICHF. Therefore, it is expected that, in regions where the algorithm fails -perhaps in silence or a high-ambient interval -- the value of E\[k'\] would be low and the incorrect location would not be accepted. Given this thresholding, one would expect to almost always get an accurate prediction of a talker's location, providing no other talkers are competing acoustically, a case not yet studied.</Paragraph>
      <Paragraph position="1"> Computationally, ICHF is implementable in real-time due to the use of Stochastic Region Contraction \[14\] for the nonlinear optimization. Relative to a coarse-fine full search, SRC has provided an order-of-magnitude im-</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML