Audio scene monitoring using redundant ad-hoc microphone array networks

We present a system for localizing sound sources in a room with several ad-hoc microphone arrays. Each circular array performs direction of arrival (DOA) estimation independently using commercial software. The DOAs are fed to a fusion center, concatenated, and used to perform the localization based on two proposed methods, which require only few labeled source locations (anchor points) for training. The first proposed method is based on principal component analysis (PCA) of the observed DOA and does not require any knowledge of anchor points. The array cluster can then perform localization on a manifold defined by the PCA of concatenated DOAs over time. The second proposed method performs localization using an affine transformation between the DOA vectors and the room manifold. The PCA has fewer requirements on the training sequence, but is less robust to missing DOAs from one of the arrays. The methods are demonstrated with five IoT 8-microphone circular arrays, placed at unspecified fixed locations in an office. Both the PCA and the affine method can easily map out a rectangle based on a few anchor points with similar accuracy. The proposed methods provide a step towards monitoring activities in a smart home and require little installation effort as the array locations are not needed.

Abstract-We present a system for localizing sound sources in a room with several microphone arrays. Unlike most existing approaches, the positions of the arrays in space are assumed to be unknown. Each circular array performs direction of arrival (DOA) estimation independently. The DOAs are then fed to a fusion center where they are concatenated and used to perform the localization based on two proposed methods, which require only few labeled source locations for calibration. The first proposed method is based on principal component analysis (PCA) of the observed DOA and does not require any calibration. The array cluster can then perform localization on a manifold defined by the PCA of concatenated DOAs over time. The second proposed method performs localization using an affine transformation between the DOA vectors and the room manifold. The PCA approach has fewer requirements on the training sequence, but is less robust to missing DOAs from one of the arrays. The approach is demonstrated with a set of five 8-microphone circular arrays, placed at unknown fixed locations in an office. Both the PCA approach and the direct approach can easily map out a rectangle based on a few calibration points with similar accuracy as calibration points. The methods demonstrated here provide a step towards monitoring activities in a smart home and require little installation effort as the array locations are not needed. August 25, 2021 Index Terms-Smart homes, circular microphone arrays, sound localization, self-calibration.

I. INTRODUCTION
Microphone arrays, in the form of smart speakers, have become an affordable household item. As a result, these systems are ubiquitous, and there may be many microphone arrays in a single room. Most audio array systems require knowledge of the relative locations and orientations of the arrays. We instead use a few source calibration points with known relative locations, which are easier to implement. By using redundant arrays, we obtain higher accuracy and are less concerned with array placement.
In this paper we describe localization approaches which use several microphone arrays of unknown locations. The DOAs from each array are then collected through WiFi to a central fusion center where they are concatenated to form a global DOA vector. Based on the global DOA vector we perform localization in a room with two proposed methods: a subspace method based on PCA and an affine mapping-based approach.
We first consider localization using principal component analysis (PCA) to obtain a low-dimensional mapping of the P. Gerstoft We also consider a localization approach based on affine mapping between the DOA vector and room coordinates. This mapping is more robust to missing array DOAs than the PCA approach, as it is based on a physical mapping to room coordinates. To train the mapping it also requires relative location of calibration points.
For our experiments, we have five arrays, placed at fixed unknown locations in a reverberant office environment. The arrays are connected to a Raspberry Pi 3, which runs the Open embeddeD Audition System (ODAS) software, [1], [2] and can track the directions of arrival (DOAs) for up to four sound sources simultaneously. This open-source framework is appealing as it allows on-board DOA estimation for each microphone array. We use this existing system to observe the sound sources and compute their DOA relative to the array.
Such a setup could be useful for improving monitoring of sound events in a smart home, using existing arrays [3]. Since the DOA processing is performed on-board, this helps preserve privacy, as audio in this approach is not share over the network. This could be used for fall monitoring [4] and daily events classification [5]. This system could also be combined to sound scene classification, e.g., [6], [7].

A. Related work
Alternative methods would combine the whole array stream of data at the fusion center by using maximum likelihood beamforming for multiple arrays [8], [9] and localize the arrays in an ad hoc network [10]. In our approach, where the DOAs are processed at the individual arrays, we benefit from transmitting just the DOA stream. This requires less bandwidth than using the raw array output. In terms of privacy this is a huge improvement as no signals are transmitted to the fusion center. In a fusion center multiple DOAs can be used for localization based on known array positions [11].
In the future, more advanced machine learning (ML) [30] could help with this task as it already did in related localization problems. In particular, ML approaches might betteraddress the inherent non-linearity in sound localization. Neural network classifier [31], [32], [25], [33] or semi supervised learning [34], [35], [36] would be of first interest.
Many indoor localization systems have been proposed [37]. Sound source localization (SSL) also has widespread applications in human-robot interaction [38], ocean acoustics [39], teleconferencing [40], drone localization [41], and automatic speech recognition [42]. These have been designed for various applications as indoor navigation, communication and health. In general, they involve a first step where receiver geometric configuration such as distances and angles are measured. In the second step, the target is located using the measured data. Our proposed system replaces the first step with a training/calibration step.
DOAs are the only features used here. Other systems also use travel time difference of arrival between the arrays [43]. This requires a precise relative clock for each array system, maybe requiring more hardware. The actual system implemented here cost about 200 USD for one array. The proposed system is demonstrated with active sources at wellknown points. This serves to better validate the approach.

II. PROCESSING LOCALIZATION
We assume a room equipped with M time-synchronized circular microphone arrays, which provides the DOAs based on a local coordinate system. The DOA for each array lies on a virtual unit half-sphere in the positive z-plane (see Fig. 1), and is defined as x + d 2 y + d 2 z = 1 and d z ≥ 0. Our development and processing use the output DOA vectors of the time-synchronized arrays as they arrive at the fusion center.  between 3M dimensional d-vector and the local N = 3 dimensional room.
For the first approach we need all (or the same) arrays observing the sound source of interest, but no calibration sources are needed. For the second approach we need a set of calibration sources (more than 3).

A. PCA
Performing PCA on the observed DOA vectors gives D = UΛV T , with U, V the left and right singular vectors and Λ the singular values. Assuming the relation between DOA and spatial location is approximately linear (3) with N (2 or 3) spatial coordinates, the J first (2 or 3) singular DOA vectors should be sufficient to describe the location. Throughout the rest of the paper, we use J = 2 first components, as all sources are mainly in the 2-D horizontal plane.
Decomposing the ith DOA observation d i with only the first J = 2 singular vectors used, the reduced where a i = [a i1 a i2 ] T are the coefficients of the two singular vectors that define the DOA vector d i . Since physically there should only be 2-3 large components, finding the decomposition in PCA space defines an unknown room position.
In the above processing, it is assumed that all arrays observe all the sources. However, in a real system there will always be missing observations due to malfunctioning arrays, nature of the room setup, source directivity or weak sources. This can happen in either the calibration or the mapping part of the experiments.
For mapping to PCA coordinates, omitting the missing array is easy, but the sound source will then map to a different location in PCA space. Thus, for sources evolving sequentially in time it is preferred to only use the arrays that is continuously tracking the source.

B. Affine mapping to room manifold
In general, the mapping between N -dimensional room observations r ∈ R N (N is 2 or 3) and 3M -dimensional DOA vector d is non-linear and unknown (2) The non-linear variation of the DOA vector d vs spatial location for one array is indicated in Fig. 2. We assume a linear mapping (affine transformation) might be sufficient for this mapping of a room. This was observed to work well for the DOA variation in the room used here, see Sec. IV. The affine transformation is where r 0 ∈ R N is the offset and B ∈ R N ×3M are the linear coefficients. Both r 0 and B are determined in Sec. II-C by performing recording at K locations r k . Typically, a weak source is not observed on all arrays, giving a DOA observation d with non-active elements, we then retain the active elements in d a produced by the list of active arrays I a . From the calibration DOA matrix D we only use the active arrays I a . For each DOA observation d and I a , we then determine the B(I a ) function with only the entries of the calibration DOA D corresponding to active arrays I a and use this to determine the B matrix for just I a . This B matrix is then multiplied with the active DOAs d a . The non-active elements are not used in the mapping. This gives For each case of active DOAs, the B matrix is determined as in Sec. II-C.

C. Linear mapping calibration
We perform a sound source recording at K locations, for location where i Lj is a vector of length L j with all ones. Defining the cost function (using the trance operator Tr and Differentiating the cost function ∂φ ∂E = 2E Forcing (8) to zero gives (· is the mean operator over observations L) Forcing (9) to zero gives where + denotes the Moore-Penrose pseudo-inverse. Thus, the mapping is determined from (11). The number of calibration points needed depends on the uniqueness of the points in spatial and DOA space. In general, for an N -dimensional room space, N + 1 calibration points spanning the space are sufficient. Using just 2 calibration points maps all observations to a line between the points and 3 calibration point is sufficient spanning a 2D plane.

D. Disturbances from reference point
For a known reference point between a DOA vector and room location (d i , r i ), an approximate mapping from a measurement d n to a new room location r n is for a given B matrix: For a given B matrix learned though the calibration points, we do the mapping by selecting a reference point (d i , r i ) close to the observed d n and use (12). Similarly, for a known DOA PCA components a i (as in (1)) and room location r i and given a PCA observation a n the new room location r n is r n − r i = B(d n − d i ) = BU(a n − a i ) = C(a n − a i ) , (13) and C = BU describes the mapping from the DOA PCA components to spatial location. From (13) it is clear that there are an approximately a linear relation between PCA space and location.

E. Unknown calibration source location
The PCA method comes handy when it is impractical to measure the exact positions of sound sources for calibration. Thus, we perform the K source experiments with unknown locations and compute the singular vectors from these observations to enable PCA. Then for a new source we perform localization in this PCA space and describe the solution in term of proximity to the K sources in PCA space.

A. Circular microphone array
We used arrays that are part of development kit called Matrix Creator 1 with software (ODAS 2 ) that computes the DOA using the Steered response power with phase transform (SRP-PHAT) [12], [1] as summarized below as background information.
Each array consists of 8 microphones, uniformly placed in circle with 10 cm diameter, and connected to a Raspberry Pi 3 device 3 . Each device runs its own instance of the ODAS framework, which outputs one DOA with a 3D unit vector in the array's local coordinate system, see Fig. 1.
To represent potential DOAs, the unit half sphere is discretized using 1321 points obtained by subdividing a 20-sided convex polyhedron [1], giving a solid angle of approximately 3 • between grid points. A set of time difference of arrival (TDOAs) matches each DOA based on the microphone array geometry and speed of sound (assuming plane waves). For each hypothetical DOA i, ODAS computes SRP-PHAT, with power E i [12]: where the superscript * denotes the complex conjugate and X p [k] is the short-time Fourier transform (STFT) frame, p is the microphone index and k the frequency bin index. Each frame consists of N samples, and the expression τ i,p,q is the TDOA at point i between microphones p and q. SRP-PHAT benefits from the low-complexity of the Fast Fourier Transform (FFT) and makes sound source localization robust to indoor reverberation. At each time step, ODAS returns up to four potential DOAs corresponding to the points with maximum power, as shown in (14). ODAS [1] relies on a tracking module to improve the stability and resolution of DOA estimation. Speech sources include silence periods, which should be accounted for while tracking sound sources over time. Robust detection and tracking therefore becomes an important feature. ODAS provides the DOA of the four sources with highest SRP-PHAT power (14), and then use Kalman filters to track the DOAs over time. For this application, we restrict ODAS to track one source.
The multi-channel raw audio is sampled at 44,100 samples/s from the Matrix Creator array, resampled by ODAS at 16,000 samples/s, which then returns an updated DOA estimation with an 8 ms time resolution to the fusion center. Figure 2 shows the ODAS pipeline for the tracked DOA.

B. Array setup
An office room was equipped with M = 5 arrays, see

C. Processing chain
Each array processes the data independently on a Raspberry Pi 3, which is synced to a common clock. The Raspberry Pi then sends the DOAs to a fusion center, where all DOAs arrays are stored in a database after careful time alignment. All data in the next sections were extracted from this database.

IV. EXPERIMENTAL DEMONSTRATION
The approach consists of a calibration and a mapping step. In the calibration step, either the B matrix is created based on sounds from known calibration locations, or the SVD is determined using available DOAs with no need for knowing the source locations. In the mapping step we used the observed DOA vector to map to either room coordinates (3) or PCA coordinates (1).

A. Setup
In order to maximally activate DOA localization on each array, an omnidirectional loudspeaker (10 cm long) playing a female voice is used. The reverberation time in the office was measured as RT60=0.3 s. ODAS can track up to four simultaneous sound sources, but we limit it to one in these experiments. The ODAS sound source tracking module is also tuned to track both static and moving sources.
An office room is equipped with M = 5 arrays (see Fig. 3 for locations). Three arrays are mounted on the ceiling (array-2, array-3, and array-4), array-0 is placed on south facing wall and array-1 is installed on west facing wall. Note that the processing is independent on knowing array locations or  orientations. For better physical understanding, we order the arrays with the local y-component pointing up for array-0, array-1, and North for array-2, array-3, and array-4.
The global x-axis (East) corresponds to the negative local xaxis for (array-0, array-2, array-3, and array-4) and z-axis for array-1. The global y-axis (North) corresponds to the positive local y-axis for (array-1, array-2, array-3, and array-4) and zaxis for array-0. These relations are for interpretation of results only, and are not used for mapping.
Each Matrix Creator array is connected to a Raspberry Pi 3 (RP3) single-board computer which provide a DOA estimate every 8 ms. The local clock of each RP3 is synchronized to a local time server at ntp.ucsd.edu over office WIFI network using the Network Time Protocol (NTP) [44]. When left running continuously, each local NTP client on RP3 was observed to maintain synchronization within 10 ms, but most much less. Each array generates a time-stamped digital summary of detected DOAs once every 64 ms. These Summaries are stored on local memory of the array and are pushed to the cloud every five minutes.
Every five minutes new data is down-loaded from the cloud to our fusion center and loaded into a relational database. The data is then sorted and joined based on the time stamp. The result is a chronological table which contains the DOAs from all arrays synchronized using the timestamps. This table can be queried to get the full information for any period of time.
ODAS tracks the loudest sound source and transfers the DOAs to the fusion center. At the fusion center, we bin the DOAs using a 64 ms time window. The average of all DOAs from each array is used in this 64 ms window and stored in the database. The 64 ms resolution is sufficient as the sound sources move slowly. Fig. 4 shows the handheld loudspeaker being moved manually along the table edge.

B. Calibration
The loudspeaker is placed at each calibration point for 30 s. For these points we record a 3-component DOA from each of the 5 arrays, giving a 3×M = 15 element DOA-vector d every 0.064 s (about 450 for each point). Based on these records in D, we extract singular vectors corresponding to measurements points on chair, table and chair+table, see Fig. 5. Since all measurements were performed in nearly a horizontal plane, the first two PCA components carry most of the energy, see Fig. 5a. Focusing on chair+table in Fig. 5d, the first singular vector has large amplitudes, 0.5, for x-component of (array-0 array-2, array-3, and array-4), all pointing westward. Thus, an increase in 1st PCA component correspond to a more westerly source. The second singular vector has large amplitudes, 0.5, for y-component of (array-2, array-3, and array-4) and xcomponent of array-1. An increase in the 2nd PCA component correspond to a more northerly source.
From these point measurements, the B-matrix (11) is extracted Fig. 6, we split them into batches of chair, table, and chair+table. For all 3 mappings the y-component of array-2 is strongest for bottom row, corresponding to the y-component of the room. This indicates that an increase in y-component for array-2 gives a larger global y-component. Focusing on chair+table in Fig. 6c a larger y-component of array-2 gives a larger global y-component.
Although the true mapping is non-linear, it is possible to find a linear mapping for the 5 points (Fig. 7a, table), 6 points (Fig. 7a, chair), and 11 points (Fig. 7a). For each point, 450 DOA vectors were recorded with small fluctuations around the true DOA values. The fluctuations are from the Kalman filter output of the fixed sources. These fluctuations cause a cloud around each point when applying the estimated B matrix. The noise cloud near each point increases as more calibration points are used in Fig. 7c. This is because as number of points increases, due to non-linearity the mapping on these points become noisier.

C. Mapping
We slide the loudspeaker along the table edge (Fig. 4) forming ideally a rectangle (point 1-4) or a line (point [5][6] to demonstrate the method, see Fig. 8. First, a B-matrix computed using 2 calibration points was used and we confirmed that in such a case, the mapping was just a projection in the direction of these two points (not shown). All three Bmatrices determined in Fig. 6 were used. The B matrix based on chair points (blue line) gives the largest error by drifting away from initial point 1 and 6. This is because chair B matrix is determined at different height and larger horizontal spread, However, for all mappings it is easy to recognize the rectangle and the line.

D. PCA mapping
In an actual room setup, it might be impractical to measure the calibration source locations. We could potentially learn a room from ambient noise DOAs, and then do a PCA mapping to the first J = 2 components. For comparison here we use the DOAs from the calibration points as illustrated in Fig. 9. The singular vectors are different for the 3 sets of observation points, see Fig. 5, but in each of the PCA projections the shape of the rectangular table and table edge are recognized. In Fig. 9, the PCA component changes due to changes in the magnitude of the singular vectors entries but the major change is due to sign change of the singular vectors.
The rectangle on the

E. Missing arrays
A weak source can easily cause an array to miss a source, in this subsection we assume that all sources for calibration are sufficiently strong. During the mapping step, the setup could be such that a sound source does not activate the DOA localization for all arrays. This will give an undefined DOA vector for that array. In this section the B-matrix and PCA are determined by the calibration points on the table (points 1-4).
For the PCA mapping, it is problematic to drop an array as either the PCA has to be recalculated without the array or the missing DOA from that array should be estimated. However, a simple solution is to omit that part of the DOA vector, but still use the same SVD vector based on all arrays. Fig. 10 shows how this affects the estimation for the whole rectangle. Depending on which array is dropped it projects to quite different area of the PCA. Thus, if an array is dropped, it does not give clear results for one observation. However, Fig. 10 shows that if we drop an entire array in computing the SVD with the calibration points and for the PCA, the metasurface shows a consistent result with the rectangle of sources clearly mapped.
The non-linear mapping (4) maps to the domain and therefore shows good results as long as there are sufficient arrays to perform the mapping, see Fig. 11. Since all calibration points and measurements are in a horizontal plane, even just one array (Fig. 11e) can do well in the mapping provided its DOA are nearly perpendicular to the plane of motion of the sound source. Here, the ceiling arrays (array-2, array-3, array-4) have their DOAs approximately perpendicular to the plane of the motion of the sound source. For one array, the arrays on the walls (array-0, array-1) have DOAs to close to parallel with the source plane, and hence do not give good result. There is significant stability improvement in increasing to two arrays (Fig. 11d). Figure 12 illustrate the difference in (a) affine mapping (4) and (b) PCA mapping for each observation 50% chance one array is dropped, which array is dropped is decided uniformly. In this figure the B-matrix is determined by the calibration points on the table (points 1-4). Similar to the above results, the figure shows that simple PCA is noisy with large uncertainty.

V. CONCLUSION
Sound source localization in rooms with redundant unlocalized arrays are discussed. Each array performs the direction of arrival (DOA) estimation independently and feed it to a cluster where the DOA from all arrays are concatenated. The DOAs can then be used for source localization, either in principal component space or directly mapping to the room.
The method was demonstrated with five circular microphone arrays at un-measured locations in an office. Both methods work well for tracking a source in a room. The direct mapping to room coordinates is more robust to array failures. The methods demonstrated here provides a step towards monitoring activities in smart home with little installation effort.   Fig. 11: PCA for sliding source along the table edge (points 1-2-3-4) with all arrays (blue) or with one array dropped for the whole experiment giving one rectangle for each combination of 4 arrays. When one array is dropped the DOAs for that array is set to 0 and the PCA for the full array is used. It is seen that when one array is dropped the rectangle maps to other region in PCA space, but can still be identified.