A Geometric Blind Source Separation Method Based on Facet Component Analysis

Given a set of mixtures, blind source separation attempts to retrieve the source signals without or with very little information of the the mixing process. We present a geometric approach for blind separation of nonnegative linear mixtures termed {\em facet component analysis} (FCA). The approach is based on facet identification of the underlying cone structure of the data. Earlier works focus on recovering the cone by locating its vertices (vertex component analysis or VCA) based on a mutual sparsity condition which requires each source signal to possess a stand-alone peak in its spectrum. We formulate alternative conditions so that enough data points fall on the facets of a cone instead of accumulating around the vertices. To find a regime of unique solvability, we make use of both geometric and density properties of the data points, and develop an efficient facet identification method by combining data classification and linear regression. For noisy data, we show that denoising methods may be employed, such as the total variation technique in imaging processing, and principle component analysis. We show computational results on nuclear magnetic resonance spectroscopic data to substantiate our method.


Introduction
Blind source separation (BSS) is a major area of research in signal and image processing [7]. It aims at recovering source signals from their mixtures with minimal knowledge of the mixing environment. The applications of BSS range from engineering to neuroscience. A recent emerging research direction of BSS is to identify chemical explosives and biological agents from their spectral sensing mixtures recorded by various spectroscopy such as Nuclear Magnetic Resonance (NMR), Raman spectroscopy, Ion-mobility spectroscopy (IMS), and differential optical absorption spectroscopy (DOAS),etc. Spectral sensing is a critical area in national security and a vibrant scientific area. The advances of modern imaging and spectroscopic technology have made it possible to classify pure chemicals by their spectral features. However, mixtures of chemicals subject to changing background and environmental noise pose additional challenges. The goal of this paper is to develop a BSS method to process data in the presence of noise based on geometric spectral properties.
To separate the spectral mixtures, one needs to solve the following matrix decomposition problem where A ∈ R m×n is a full rank unknown basis (dictionary) matrix or the so called mixing matrix in some applications, N ∈ R m×p is an unknown noise matrix, S = [s(1), . . . , s(p)] ∈ R n×p is the unknown source matrix containing signal spectra in its rows. Here p is the number of data samples, m is the number of observations, and n is the number of sources. Various BSS methods have been proposed based on a priori knowledge of source signals such as statistical independence, sparseness, nonnegativity, [3,6,7,11,12,15,16,17,20,23,24] among others. As a matrix factorization problem in the noise free case (N = 0), BSS has permutation and scaling ambiguities in its solutions similar to factorizing a large number into product of primes. For any permutation matrix P and invertible diagonal matrix Λ, (AP Λ, Λ −1 P −1 S) is another pair equivalent to the solution (A, S), since Recently there has been active research on BSS by exploiting data geometry [1,2,5,13,20,21,22,23,24,25,26,27]. For simplicity, let N = 0 in (1.1). The geometric observation [1,20,27] is that if each row of S has a dominant peak at some location (column number) where other rows have zero elements, then the problem of finding the columns of the mixing matrix 1 A reduces to the identification of the edges of a minimal cone containing the columns of mixture matrix X. In hyperspectral imaging (HSI), the condition is known as pixel purity assumption (PPA, [5]). In other words, each pure material of interest exists by itself somewhere on the ground. The PPA based convex cone method (known as N-findr [27]) is now a benchmark in HSI, see [5,20,21,22,23] for its more recent variants. The method termed vertex component analysis (VGA) proposed in [22] is worth mentioning here being a fast unmixing algorithm for hyperspectral data. In Nuclear Magnetic Resonance (NMR) spectroscopy which motivates our work here, PPA was reformulated by Naanaa and Nuzillard in [20]. The source signals are only required to be non-overlapping at some locations of acquisition variable (e.g. frequency) which leads to a dramatic mathematical simplification of a general non-negative matrix factorization problem (1.1) which is nonconvex [14]. More precisely, the source matrix S ≥ 0 is assumed to satisfy the following Assumption 1 (NNA). : For each i ∈ {1, 2, . . . , n} there exists an j i ∈ {1, 2, . . . , p} such that s i,j i > 0 and s k,j i = 0 (k = 1, . . . , i − 1, i + 1, . . . , n) .
Simply put, the stand-alone peaks possessed by each source allow formation of a convex cone enclosing all the columns of X, and the edges (or vertices) of the cone are the columns of the mixing matrix. To illustrate the idea, let us consider an example of three sources and three mixtures (A ∈ R 3×3 , X ∈ R 3×p , S ∈ R 3×p , and A is non-singular).
X(:, 1), X(:, 2), · · · , X(:, p) = A(:, 1), A(:, 2), A(:, 3) ·   * · · · * 1 0 0 * · · · * * · · · * 0 1 0 * · · · * * · · · * 0 0 1 * · · · *   , where X(k, :) represents the k-th row of X, 1 represents nonzero entry, and the three 1 ′ s are three stand-alone peaks. It can be seen that A's columns are actually among those of X's up to constants, and other columns of X are nonnegative linear combinations of columns of A. Geometrically, the columns of A span a convex cone enclosing all X's columns. The estimation of A is equivalent to the identification of this cone. As a matter of fact, the vertices of the cone are the columns of A. Fig. 1 shows this vertexrepresented cone containing all data points. To find the vertices (or edges) of the cone, the following optimization problem is solved for each column k It is shown [9] that X(:, k) is an edge of the convex cone if and only if the optimal objective function value c * is greater than 1.
If the data are contaminated by noise, the following optimization problem is solved for each k [20]: A score is associated with each column of X. Columns with low scores are unlikely to be a column of A because this column is roughly a nonnegative linear combination of the other columns of X. On the other hand, a high score means that the corresponding column is far from being a nonnegative linear combination of other columns. The n rows from X with highest scores are selected to form A, the mixing matrix.
Though the vertex based convex cone method above is geometrically elegant, the working condition is still restrictive. The success of the cone method highly depends on the recognition of its vertices. If PPA or NNA is violated, the vertices are not in the data matrix X, and may not be the primary objects for identification. In this paper, we consider such a scenario where data points (scaled columns of X) lie either inside or on the facets of a convex cone, yet none of them are located on edges (vertices), see Fig. 2 for an example. The cone structure will be reconstructed from its facets instead of the vertices. A facet component analysis (FCA) should be pursued for the data considered. The appropriate source condition for this case is that most columns of source matrix S (m rows) possess m − 1 nonzero entries. We shall pursue a more precise study on the source condition for solvability in the next section. The problem above calls for an identification method of flat submanifolds from a point cloud, as the facets of a convex cone in general lie on hyperplanes. The vertices are obtained from the intersections of the hyperplanes expanded from the facets. We shall study the identification of these flat manifolds and the subequent recovery of the source signals. The extraction of meaningful geometric data structures help the data matrix factorization and reduce the computational cost.
Recently, a dual cone based approach to BSS problem was proposed in [21]. The method first calculates the dual of the data cone, then selects a subset of the source signals from a set of source candidates by means of a greedy process. Specifically, the first step consists in computing a generating matrix of the dual of the data cone by the double description method [19]. The second step consists in extracting an estimate of the source matrix from a larger matrix which is the product of the generating matrix with its transpose. Multiple solutions can be obtained in the second step, so the author imposed an additional orthogonality constraint on the source sigals to get a sub-optimal solution. Overall, the method proposed in [21] appears indirect and computationally unwieldy, besides requiring the orthogonality of source signals. Here we opt to solve the problem directly by identifying the facets of the data cone under a unique solvability condition.
The rest of the paper is organized as follows. In section 2, we propose a new source condition for solving (1.2) based on the geometric structure of the data points, and develop an algorithm for facet identification and reconstruction of the associated convex cone. In section 3, we present computational examples and results. For noisy data, we show that a denoising method maybe needed to help remove or reduce the noise. A denoising method based on total variation and distance function is discussed. Con- cluding remarks and future work are in Section 4.
The work was partially supported by NSF-ATD grant DMS-0911277.

Proposed Method
Let us consider the noiseless case in order to illustrate the basic idea behind our method. The first source condition on the problem is as follows: Assumption 2. S contains at least m − 1 linearly independent column vectors orthogonal to each unit coordinate vector e i , i = 1, . . . , m.
Let M be a matrix of R m×p , the subset of R m defined by is a convex cone, and M is said to be a generating matrix of M since every element of M is a nonnegative linear combination of M columns. Let X = cone(X) and A = cone(A). Under Assumption 2, we have the following (or combining Lemma 3 and Lemma 5 of [21]): For readers' convenience, a short proof is given below.
The second claim follows as we notice that: (1) A has m facets and each one is spanned by m − 1 column vectors of A; (2) Using Assumption 2, X = A S has at least m − 1 linearly independent column vectors located in each facet of A; ′ has a m-dimensional convex hull with l facets(l ≥ m + 1). Let us denote it by Conv(X). In fact, Conv(X) results from the cone X being truncated by the plane x T · 1 = 1. We then acquire all the facets and the associated vertices of Conv(X). This can be done by means of the function 'Convhulln' from the Matlab library. It is based on 'Qhull' which implements the Quickhull algorithm for computing the convex hull.
One of the facets is contained in the plane x T · 1 = 1, and we call it trivial facet. The other l − 1 facets having O as one of their vertices are called nontrivial facet. By Theorem 1, X has at least the same number of nontrivial facets as A. In fact, it has more nontrivial facets than A in most cases. If we randomly choose m nontrivial facets from those of X , the solutions are clearly nonunique. Hence we need an additional source assumption to provide a selection criterion:  By assumption 3, we count and sort the number of data points in each nontrivial facets of Conv(X) followed by selecting the m facets with the largest numbers of data points. Each of the facets is actually contained in one facet of A. So the intersection of any m − 1 facets out of the acquired m facets is an edge of A. By intersecting all m edges with the hyperplane x T · 1 = 1, we obtain the m column vectors of the mixing matrix A. Finally, nonnegative least squares method yields the source matrix S.

(Preprocessing
1. We can denoise by increasing the threshold ρ in step 1. The effects of denoising with varied ρ are present in Fig. 4. We add white Gaussian noise with signal-to-noise ratio (SNR) = 80 dB to three noiseless mixtures. In the left plot of Fig. 4, an outside triangle forms due to the intersections of the plane x+ y + z = 1, x-plane, y-plane and z-plane. When noiseless signal is corrupted by additive white Gaussian noise, many data points with norms of almost 0 tend to get negative entries. Then we apply the threshold of 0 to X's entries in step 1. Thus these points fall into x-plane, y-plane or z-plane. If we choose a very very small ρ = 10 −10 as in the left plot, these points will still remain (i.e. hardly any denoising effect). So projecting them onto the plane x + y + z = 1 forms the outside triangle. It is awful to get such a geometric structure because it is impossible to identify A. However, if ρ goes up to 2 × 10 −3 as in the right plot, the structure of A emerges. Clearly a better geometric structure of data points is achieved by larger ρ. However, we also need to avoid ρ being too large in that we may lose the structure of Conv(X) with few data points.
2. Considering the presence of noise, we modify the criterion of one point belonging to a facet as follows. The point is not required to lie exactly on the facet. Instead, we only require that it is near the facet yet not around the vertices of the facet. The thresholds ǫ and σ in the following section of numerical experiments vary from 10 −6 to 0.1 depending on the level of noises. If there is almost no noise, we can achieve perfect recovery by setting these two thresholds to be any values less than 10 −5 . A higher level of noises demands larger values of ǫ and σ. These two parameters are usually of the same order.
3. In step 4, we introduce the threshold δ to avoid selecting two nearly coplanar fitting planes which may actually correspond to the same facet of Conv(X). Normally we set δ to be 0.99.
4. We will introduce denoising methods by smoothing filters such as box filter, Gaussian filter, and total variation (TV denoising) in next section. The combination of these denoising methods and our FCA tend to perform better than FCA alone when the noise level is high. We embed the noise filter into FCA after step 3 where we are able to preserve the data points close to the facets of Conv(X) only. Then applying the denoising methods yield a more desirable geometric structure of data points.

Numerical Experiments
We report the numerical results of our algorithm in this section. The data we have tested include real-world NMR spectra as well as synthetic mixtures. All the entries of the mixing matrices and the sources matrices are positive.
As we have pointed out in the last section, NN's assumption is a special case of ours. So our algorithm is supposed to work for the separations of NN data. The first example is to recover three sources from three mixtures, where the source signals have stand-alone peaks. For the data, we used true NMR spectra of four compounds β-cyclodextrine,β-sitosterol, and menthol as source signals. The NMR spectrum of a chemical compound is produced by the Fourier transformation of a time-domain signal which is a sum of sine functions with exponentially decaying envelopes [10]. The real part of the spectrum can be presented as the sum of symmetrical, positive valued, Lorentzian-shaped peaks. The NMR reference spectra of β-cyclodextrine,βsitosterol, and menthol are shown in the top panel of Fig. 6 from left to right. For the parameters, we set ρ = 10 −3 , ǫ = σ = 10 −6 , δ = 0.99. Fig.  5 shows the geometric structure of the mixtures and mixing matrix. The reference spectra and computational results are shown in Fig. 6. A 1 is the rescaled true mixing matrix, whileÂ 1 is the computed mixing matrix via our method. Apparently the separation results are very nice in that A 1 and A 1 are identical. In a second example, there are three Lorentzian source signals which no longer satisfy NN assumption. We use them to create three noisy mixtures by adding white Gaussian noise with SNR = 50 dB. A good result is achieved by setting ρ = 50, ǫ = 5 × 10 −3 , σ = 6 × 10 −3 , δ = 0.99. Fig. 7 and Fig. 8 show the geometry of data points and recovery results. True mixing matrix A 2 and computedÂ 2 are as below (first row ofÂ 2 is scaled to be same as that of A 2 ) To provide further insight into how to choose the parameters, two results caused by inappropriate threshold values are presented as well in Fig. 9, Fig.  10 and Fig. 11. In the first experiment, some noisy data points destroy the geometric structure as shown in Fig. 9 since ρ is not big enough to filter out them. In the second one, ǫ and σ are too small for the level of noises.
The last example concerns BSS in higher dimensions. We manage to recover four sources from four mixtures. The data points are in 4-dimensional space, so they are difficult to visualize. We chose ρ = 1, ǫ = 2 × 10 −5 , σ = 10 −5 , δ = 0.99 here. The original sources and recovered ones are present in Fig. 12. The true and computed mixing matrices are shown below (first row ofÂ 3 is scaled to be same as that of A 3 ).  To test the performance of our method, we compute the Comon's index [8]. The index is defined as follows: let A andÂ be two nonsingular matrices with L 2 -normalized columns. Then the distance between A andÂ denoted by ǫ(A,Â) which reads where D = A −1Â , and d ij is the entry of D. In [8] Comon proved that A andÂ are considered nearly equivalent in the sense of BSS (i.e.,Â = A P Λ) if ǫ(A,Â) ≈ 0. Fig. 13 and Fig. 14 show Comon's indices between the true mixing matrices and the computed matrices by our method. For the result in Fig. 13, we compute the Comon's indices using the four sources in Example 3 and 30 4 × 4 random mixing matrices. Clearly the Comon's indices are very small suggesting the equivalence in the sense of BBS of the true mixing matrices and the computed ones. Fig. 14 shows the performance of our method in the presence of noise. The three sources in Example 2 are combined to generate three noisy mixtures by adding white Gaussian noises with SNR varying from 16 dB to 50 dB. The reliability of our method is manifested from the plot.

Denoising
If there is considerable noise in the data, it would be desirable to reduce or remove the noise before feeding them to the proposed method. We shall propose to apply imaging denoising techniques to the FCA for noise reduction or removal. These denoising methods may be used when the noise level is high, and they can be combined with the FCA after step 3. In image processing, the simplest denoising method is the sliding mean or box filter [18]. Each pixel value is replaced by the mean of its local neighbors. The Gaussian filter is similar to the box filter, except that the values of the neighboring pixels are given different weighting, that being defined by a spatial Gaussian distribution. The Gaussian filter is probably the most widely used noise reducing filter.
These local image smooth filters could be applied for the point cloud noise reduction with slight modification. They are incorporated in to FCA after step 3 where the groups G i have been obtained. Within each group, the K-nearest neighbors are searched based on Euclidean distance, the best choice of K depends upon the data and noise level. For example, fewer neighbors should be selected when less noise presents. Note that these denoising methods generalize to any dimensional point cloud, they are easy to implement. A disadvantage is that they tend to smooth away edge or corner structures while reducing the noise. To overcome this shortcoming, we propose to apply the total variation idea of image denoising for noise removal. It is based on the principle that signals with excessive noise have high total variation, that is, the integral of the absolute gradient of the signal is high. According to this principle, reducing the total variation of the signal subject to it being a close to the original signal, removes the unwanted detail whilst preserving important details such as edges. The concept was originated in Rudin, Osher, and Fatemi in 1992 [28], and it can be applied to the point cloud noise removal. In the following, we shall use the example of a point cloud in xyz plane to illustrate the idea of total variation denoising. We first preprocess the data by rescaling them onto a plane x + y + z = 1, the projected data are two dimensional. For each point (x, y), a distance function to the point cloud (data points) is defined as where (x i , y i ) T corresponds to the i-th column of X, note that z i is not included since the z i = 1 − x i − y i . Fig. 15 shows an example of distance function to a unit circle, the left plot is the distance function with no noise, Figure 15: The distance function with and without noise. while the right plot is the function with noise. For computation, the distance function will be restricted on a rectangular region which contains the point cloud. Note that d(x, y) ≥ 0, the distance function actually defines intensities of an image. Clearly, the set of S = {(x, y) : d(x, y) = 0} is the point cloud for the noiseless case.
We proceed to solve the Rudin-Osher-Fatemi (ROF) model to obtain a denoised distance function u(x, y) where T V (u) is the total variation of u defined as for an image. A variation for ease of minimization is: We shall use the recent Chambolle's Algorithm [4] to solve this minimization problem. Note that the smaller the parameter λ, the stronger the denoising. Then the zero level set of the resulting minimizer u(x, y) will be taken as the  Figure 16: The point cloud before (left) and after (right) denoising denoised point cloud. In the real calculation, we will consider the following set with threshold S = {(x, y) : u(x, y) = τ } where τ takes on a tiny value. The noisy point cloud and the result after the noise removal are depicted in Fig. 16. The detected planes from both of them are shown in Fig. 17, and their intersections, i.e., the vertexes of the cone. It can be noted that total variation denoising is very effective at preserving edges (thick lines in the figures) whilst smoothing away noise in flat regions. The idea of denoising distance function by total variation extends to point cloud of any dimension. We conduct experiments on performances of FCA with and without TV denoising were conducted. A comparison of their performance is showed in Fig. 18. The mixtures are corrupted by additive white Gaussian noises varying from 16 dB to 25 dB. In general, TV denoising lowers Comon's indices at high noise level resulting in better separation.  validated the solvability condition, and showed satisfactory performance of the proposed algorithm. For noisy data, total variation denoising method serves as a viable preprocessing step.
A line of future work is to separate more source signals from their mixtures, known as an undetermined blind source separation, or uBSS. This problem presents more challenge than the determined or over-determined BSS in that the mixing matrix is non-invertible. Some recent study has been done by two of the authors based on a geometric approach to retrieve the mixing matrix under suitable solvability conditions [24].