Poisson hurdle model-based method for clustering microbiome features

Abstract Motivation High-throughput sequencing technologies have greatly facilitated microbiome research and have generated a large volume of microbiome data with the potential to answer key questions regarding microbiome assembly, structure and function. Cluster analysis aims to group features that behave similarly across treatments, and such grouping helps to highlight the functional relationships among features and may provide biological insights into microbiome networks. However, clustering microbiome data are challenging due to the sparsity and high dimensionality. Results We propose a model-based clustering method based on Poisson hurdle models for sparse microbiome count data. We describe an expectation–maximization algorithm and a modified version using simulated annealing to conduct the cluster analysis. Moreover, we provide algorithms for initialization and choosing the number of clusters. Simulation results demonstrate that our proposed methods provide better clustering results than alternative methods under a variety of settings. We also apply the proposed method to a sorghum rhizosphere microbiome dataset that results in interesting biological findings. Availability and implementation R package is freely available for download at https://cran.r-project.org/package=PHclust. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Over the past two decades, the number of microbiome studies has grown rapidly due to the advancement of next-generation sequencing (NGS) technologies. With lower cost and increasing computational power, we are able to obtain tremendous amounts of data regarding the diversity and function of the microbiome from a host or a habitat. One of the most popular NGS approaches is the amplicon-based sequencing (Poretsky et al., 2014) which generates data matrices of amplicon sequence variants (ASVs) or operational taxonomic units (OTUs), where ASVs and OTUs are unique taxonomic features. The ASV/OTU/taxa table is the starting point for most statistical analysis. However, these datasets present some challenges: they are high-dimensional and sparse (i.e. contain many zeroes), and there is high variability in sequencing depth across different samples (Cullen et al., 2020). These data characteristics make many classic and popular data analysis approaches not directly applicable to microbiome data and call for the development of new statistical methods.
Cluster analysis has been a popular method of multivariate data analysis that helps identify relationships among high-dimensional variables. It has been widely applied to high-dimensional gene expression data (Yeung et al., 2001). Applied to microbiome data, cluster analysis can help identify potential microbiome subcommunities, which give insights into how features (ASV/OTU/ taxa) with similar abundance levels are grouped together. With cluster analysis, researchers can more easily identify potential species patterns from highly diverse datasets. For example, clusters may represent taxa (or strains) that are functionally related to each other (i.e. guilds) or that share sensitivity to certain environmental conditions (i.e. niche selection), which can be further probed by downstream metagenomic techniques. When applied to time series data, clustering approaches could also help identify changes in microbial community states, which could provide important insights into microbiome assembly and manipulation.
Despite the increasing demand for methods for microbiome data analysis including cluster analysis, there are not many clustering algorithms developed specifically for microbiome data. To cluster microbial features, Gloor et al. (2016) applied the K-means clustering, Badri et al. (2020) used spectral clustering, while Casero et al. (2017) applied the model-based Poisson clustering developed by Si et al. (2014) for RNA-sequencing data. To cluster samples or microbial communities, Zhang et al. (2017) and Lonèar-Turukalo et al. (2019) applied spectral clustering while the latter paper proposed an implementation of kernel principal component analysis (PCA) for data reduction. None of the above-mentioned methods take into account the excessive zeros (sparsity) in microbiome data, which is a common issue especially after rarefaction (Gloor et al., 2017;McMurdie and Holmes, 2014). Such sparsity has made more and more researchers believe that the excessive zero counts in microbiome data need to be treated differently (Xu et al., 2015).
In this article, we propose a model-based algorithm for clustering microbiome features based on Poisson hurdle models. The hurdle models, introduced by Cragg (1971), separately model the zero part and the non-zero part of a random variable and hence naturally allow zero inflation that often occur in microbiome count data. Hurdle model also automatically deals with the issue of dropout events. Based on mixtures of Poisson hurdle models, we developed clustering methods to group microbial features sharing similar patterns of change across different treatments/conditions. Section 2 presents our method. We describe Poisson hurdle models for microbiome count data in Section 2.1 and propose our clustering algorithm based on mixture of Poisson hurdle models, including the expectation-maximization (EM) algorithm in Section 2.2, a stochastic modified EM algorithm in Section 2.3, an initialization method based on Kendall's s correlation in Section 2.4 and a hierarchical merging algorithm for determining number of clusters in Section 2.5. In Section 3, we compare the performance of our algorithms and other methods under a variety of simulation settings. In Section 4, we apply our proposed method to a sorghum microbiome dataset. We conclude this paper with some discussion in Section 5.

Poisson hurdle model-based clustering
Model-based clustering methods assume that data are generated by a mixture of probability distributions where each component corresponds to one cluster. Compared to traditional clustering methods such as K-means or hierarchical clustering, model-based clustering automatically offers a quantitative measure of the uncertainty of the clustering results, i.e. the probability of each feature belonging to each cluster. Extensive research has been done in model-based clustering with multivariate normal mixture distributions, see Fraley and Raftery (2002) for an excellent review. However, the count data with excessive zeros cannot be modeled directly using normal distributions. To handle the zero-inflated microbiome data, we propose a model-based clustering algorithm based on Poisson hurdle distribution.

Poisson hurdle distribution
Two types of statistical models have been commonly applied to modeling count data with extra zeros: zero-inflated models and hurdle models (also known as two-part models) (Hilbe, 2011). In fact, zero-inflated models are special cases of hurdle models: hurdle models can handle both zero inflation and zero deflation. Although many features of the microbiome have a lot of zeros, there are also features that are not zero-inflated and should not be modeled by zero-inflated distributions. In addition, estimates based on hurdle models tend to be more computationally stable, especially for data with small amounts of zeros (Xu et al., 2015). Hence, we propose to use Poisson hurdle models for microbiome count data.
Suppose, we have a microbiome dataset with G features and I treatment groups. Let N gij ; g ¼ 1; . . . ; G; i ¼ 1; . . . ; I; j ¼ 1; . . . ; n i denote the count data for feature g in replicate j of treatment i. The Poisson hurdle distribution models data by two parts separately: the zero part and the zero-truncated Poisson part. If N gij follows a Poisson hurdle distribution corresponding to a cluster k, then its probability mass function (pmf) is: where q kij is the probability of N gij in cluster k being positive (nonzero), and k kgij is the mean of the Poisson distribution before zerotruncation.
In expression (2) of the Poisson mean, s ij is a normalization factor that adjusts for technical variations in sequencing depth across samples. In this article, we use the log upper-quartile estimator. This normalization method, originally proposed for RNA-seq analysis (Bullard et al., 2010), has been shown to work well in microbiome datasets (Weiss et al., 2017). Once estimated, s ij is treated as known. The parameter a gk represents the geometric mean abundance level in the Poisson part across all treatments for feature g in cluster k, and l ki (with P I i¼1 l ki ¼ 0) represents the ith treatment effect in abundance level for features in cluster k.
In expression (3), we model q kij as a logistic function of the normalization factor s ij and allow different intercepts and slopes c 0ki ; c 1ki for different combinations of cluster and treatment (k, i). We further constraint c 1ki ! 0 because samples with larger sequencing depths tend to have larger non-zero proportions. Note that in model (1), features in the same cluster have the same treatment effects (l ki ) but we allow different geometric means (a gk ) across features in the same cluster. The reason is to cluster treatment effects, i.e. changes in abundance levels across treatments. Alternatively, we can also cluster features according to their abundance levels by assuming a reduced model with both the same geometric mean a k and the same treatment effects (l ki ) for all features in the same cluster. We present more details about this reduced model in Supplementary Section S1 and also have functions to implement it in our R package. The remaining part of the main text deals with the full model with a gk .
Assuming a total of K clusters, we model each cluster by Poisson hurdle models with cluster-specific parameter vectors l $ k and c $ k , where l $ k ¼ ðl k1 ; l k2 ; . . . ; l kI Þ, with P I i¼1 l ki ¼ 0, models the pattern of changes in abundance level across treatments and c $ k ¼ ðc 0k1 ; c 1k1 ; . . . ; c 0kI ; c 1kI Þ is the vector modeling q kij the probabilities of being positive counts for features in cluster k. Based on a mixture of Poisson hurdle models, the likelihood given observations of feature g can be expressed by represents the count vector for gth feature across all samples, f is the probability mass function for Poisson hurdle model (1), and p k is the mixing proportion corresponding to component k with p k ! 0 and P K k¼1 p k ¼ 1. The high dimensionality of microbiome data and complex relationships among microbial features makes it nearly impossible to model the dependency among features. In this article, we assume independence among features, but we evaluate the performance of our procedure under more realistic compositional structures among features. With independence assumption, the total likelihood can be expressed by

Poisson hurdle clustering via the EM algorithm
We apply an EM algorithm to obtain parameter estimates and clustering results. The EM algorithm for model-based clustering introduces a latent variable Z gk as the indicator that feature g belongs to cluster k for each combination of g and k. These indicators are treated as missing data, and the conditional expectation gives the conditional probability that feature g belongs to cluster k. EM algorithm proceeds by iteratively calculating conditional expectations (E step) and updating all unknown parameters by maximizing the likelihood function (M step) until convergence. Clustering results are obtained based on the final conditional expectations. EM algorithm is a common approach in model-based clustering (Fraley and Raftery, 2002). Si et al. (2014) and Rau et al. (2011) implemented EM algorithm in Poisson modelbased clustering. Compared to Poisson models, Poisson hurdle models are much more complicated due to the two-part structure. The total log-likelihood for mixture of Poisson hurdle models involve an extremely high dimension of parameters, and there are no closed-form solutions for the maximum likelihood estimator (MLE) of l ki and a gk in the M step. This poses extra challenges for the EM algorithm. In this article, we propose to utilize a numerical method based on coordinate descent in the M step.

Simulated annealing modification
As a strictly ascending algorithm, EM algorithm can be trapped in a local maximum. Various methods for adding randomness to help EM algorithm escape from local maximum have been introduced, and simulated annealing (SA) is one of them. The SA algorithm modifies the way to obtain conditional expectation in (3) by introducing a 'temperature' t ðmÞ > 0 and a 'cooling rate' c 2 ð0; 1Þ as follows:Z GivenZ ðmÞ gk , the SA algorithm clusters each feature g into class k with multinomial probabilityZ ðmÞ gk and generates an indicator matrix with entries 0 or 1 that replaces theẐ gk in the M step of the original EM algorithm. This clustering step of SA introduces more randomness (Celeux and Govaert, 1992) that is controlled by the temperature t, and larger t leads to larger randomness. SA usually starts with a relatively high temperature t ð0Þ and slowly reduces it to 0 as the algorithm proceeds, and the cooling rate c controls the speed of reduction. van Laarhoven and Aarts (1987) recommended t ð0Þ ¼ 2 and c ¼ 0.9 which is what we use. Simulation results in Section 3.3 show that SA algorithm yields competitive result compared with the original EM algorithm (Algorithm 1).

Initialization
EM algorithm is an iterative, strictly ascending algorithm whose convergence rate and final results are significantly influenced by the initialization (Biernacki et al., 2003;McLachlan et al., 2008;Melnykov and Melnykov, 2012). Commonly used approaches to initialize the EM algorithm start by picking K observations that are far from each other regarding some distance measure, such as (1 À Pearson's correlation), Euclidean distance (Arthur and Vassilvitskii, 2007), ranked Euclidean distance (Melnykov and Melnykov, 2012) and likelihood ratios (Si et al., 2014), etc. In this article, we propose to use ð1 À sÞ as the distance measure, where s is the Kendall's s correlation.
We gave a detailed description about the Kendall's s correlation and our proposed initialization algorithm in Supplementary Section S3. The main idea is that, we first select K observations that are well separated measured by ð1 À sÞ, and then we obtain MLEs of the model parameters based on the K selected observations and use these MLEs as the starting values for our EM algorithm.
In practice, we also recommend utilizing multiple sets of starting values to further avoid being trapped in local maximum, i.e. run the entire algorithm multiple times with different starting values and pick the result with the largest likelihood. Increasing the number of starts tends to provide a better performance at the cost of linearly increasing computation time. In the simulation section, we will show how this affect the clustering results.

Determining number of clusters
In real data analysis, the number of clusters K is typically not available and needs to be selected. There are existing methods developed for this purpose, but they do not fit into the scenario we are dealing with. In this article, we propose a hybrid clustering method with the application of likelihood ratio tests to select the number of clusters K. The likelihood ratio tests determine the optimal number of K by Algorithm 1: EM algorithm for Poisson hurdle clustering.
For each cluster k, obtain the initial values for parameters:  Maximizing the likelihood function is equivalent to maximizing the following log-likelihood for each cluster k, with the constraint 1þexp½Àðc 0ki þc 1ki sijÞ .We can maximize over the two set of parameters c 0ki ; c 1ki and a gk ; l ki separately. But still, there are no closed-form solutions for the maximum likelihood estimate (MLE) of those parameters. Hence, we propose to use a one-step coordinate descent algorithm to obtain numerical solutions for MLEs, which greatly reduce computation. Please see Supplementary Section S2 for more details. 4. Iterate the E step and M step until convergence, i.e. when the change in total likelihood is relatively small. 5. ObtainẐ gk from the last iteration, and assign feature g into cluster k where k ¼ argmax lẐgl .
testing, at each step of merging clusters, whether the merging will result in a reduced model that no longer fits data well. Please refer to Supplementary Section S4 for the discussion of existing methods, the reason why we do not use them, and our complete algorithm for choosing K.

Simulation studies
In this section, we present simulation studies to evaluate the performance of our proposed clustering methods and several existing clustering methods, including Poisson model-based clustering, negative binomial model-based clustering (Si et al., 2014), and the Kmeans clustering that has been applied to microbiome data (Gloor et al., 2016).

Simulation settings
We consider an experiment with I ¼ 3 treatment groups, J ¼ 5 replicates in each treatment group, and a total of G ¼ 1000 features that belong to K ¼ 7 true clusters. We assume equal mixing proportions of p k ¼ 1=7. A case of unequal mixing proportion is presented in Supplementary Section S5.
Data are simulated by a zero-inflated negative binomial model which introduces overdispersion, i.e. the variability is more than what is expected by our Poisson model. Each observation, N gij 2 class k, is a product of a Bernoulli ðq kgij Þ random variable and a negative binomial random variable with mean EðN gij Þ ¼ expðs ij þ a g þ l ki Þ and variance ð1 þ b expða g ÞÞEðN gij Þ For the negative binomial random variable: 1. Overdispersion rate is b expða g Þ, which depends on feature abundance level a g . 2. The geometric mean abundance levels a g' s and sequencing depth factors s ij 's are drawn from a Uniformð0:8; 1:2Þ distribution. 3. b controls the overdispersion and ranges from 0 to 0.5. This allows the overdispersion rate to range from 0 to 0:5 Ã e 1:2 ¼ 1:66, a reasonably large value for overdispersion. 4. The cluster-specific profile across treatment groups, l ki , is generated by l ki ¼ g l d ki where g l determines the magnitude of changes across treatments, and larger g l results in better separation of clusters. d $ k ¼ ðd k1 ; d k2 ; d k3 Þ characterizes the treatment effects in cluster k and is generated as follows: Note that the first six clusters cover different abundance profiles across treatments by including all six permutations of three different treatment effects: positive effect (1), no effect (0), and negative effect (À1). The last cluster corresponds to non-differential features whose abundance levels do not change across treatments. Such microbes are typically not of interest but exist in real data and affect the clustering performances.
Note that in the first scenario, all six permutations of high, medium and low zero inflation are present along with a group with equal mean zero-inflation rate ð0:5; 0:5; 0:5Þ across treatment groups. This is a case where the zero-inflation structure varies significantly among clusters and is more aligned with our model assumptions. In the second scenario, all clusters share the same mean zero-inflation rate with / ¼ 1 corresponding to no zero inflation, and the difference among clusters are only reflected through the Poisson mean structure. This is a less desirable circumstance for our method because the zero-inflation rate does not distinguish different clusters. In the main text, we will present simulation results for the second scenario that demonstrate our method performs better than others even in this unfavorable situation. Results for the first scenario are presented in the Supplementary Figure S2.
Finally, after a raw dataset is generated from the above procedure, we do an additional multinomial resampling on each column, with total counts C Ã G ¼ 1000C and probability vector proportional to each column of this raw dataset. This is to mimic the sequencing procedure and generate a compositional dataset with equal column sums. It brings in extra randomness and deviation from our assumed models thus can test the robustness of clustering methods.
We set the default simulation setting with b ¼ 0:02; g l ¼ 1; / ¼ 0:4, and average sequencing depth (total count/G) C ¼ 10. By varying each parameter at a time, we generate data for a variety of different simulation settings. For each simulation setting, 1000 datasets are simulated.

Simulation results
For each simulated dataset, we cluster the 1000 features into seven clusters using five different methods under comparison: (i) Poisson hurdle model-based clustering with EM algorithm (PH-EM), (ii) Poisson hurdle model-based clustering with simulated annealing (PH-SA), (iii) Poisson model-based clustering (MB-Poisson), (iv) negative binomial model-based clustering (MB-NB) and (v) K-means clustering with Euclidean distance (other popular non-model-based methods such as spectral clustering and hierarchical clustering produce similar or worse results and are not presented). The first four model-based methods are applied to the count data, while K-means is applied to the data after centered log ratio (clr) transformation (Aitchison, 1982), as was done by Gloor et al. (2016).
The clustering results are evaluated by three criteria: purity, adjusted Rand index (ARI) and normalized mutual information (NMI). All three criteria measure the agreement between clustering results and true clusters used to generate data, within the range of [0,1] with higher values indicating better performance. Purity measures how 'pure' the clusters are, i.e. to what extent each resulting cluster contains a single true cluster. Adjusted Rand index (Hubert and Arabie, 1985;Rand, 1971) measures similarity between two partitions (clustering results and true clusters) based on the proportion of pairs of features that are 'correctly' assigned. Mutual information (MI) measures the shared information between two partitions. The normalized MI (Strehl and Ghosh, 2002) adjusts the values so that NMI is in ½0; 1. See Supplementary Section S6 for definitions with mathematical expressions of all three criteria.
The results from all three criteria are consistent and show the same relative ranking of methods. In the main text, we present results based on NMI. Results on purity and ARI are presented in Supplementary Figure S1. We also evaluated the efficiency of the EM algorithm by checking the computational time and the number of iterations for convergence, which is described in Supplementary Section S7. Figure 1 presents results for a variety of settings of the second simulation scenario when the zero-inflation rates are constant across treatment groups. For all simulation settings, our proposed algorithms, PH-EM and PH-SA are almost indistinguishable from each other, and they are the best performers among all methods under comparison. Figure 1a shows that both PH-EM and PH-SA perform Cluster k 1 2 3 4 5 6 7

Clustering results
much better than the other methods when zero-inflation rate is between 0.2 and 0.8, a range commonly encountered in real data. When there is no zero inflation at all (/ ¼ 1), our algorithms still performs similarly to the other two model-based clustering methods that do not model zero inflation and better than K-means. As sequencing depth grows (Fig. 1b), the magnitude of treatment differences increases (Fig. 1c) or the level of fluctuation decreases (Fig. 1d), most methods tend to perform better. When sequencing depth is low, our methods are among the top-performing methods. For all other settings ( Fig. 1b-d), our methods are the best with obvious advantage over the other methods. These results show the consistency and robustness of the Poisson hurdle clustering algorithms. Results from the first simulation scenario (different zero-inflation structure among clusters) presented in Supplementary Figure S2 give the same conclusion.

Evaluation of initialization
We also compare three initialization methods: (i) using true parameters, (ii) using the MLEs based on K randomly selected observations and (iii) using the Kendall's s based initialization algorithm we propose in Section 2.4. As shown by the NMI results in Figure 2, our proposed initialization method uniformly outperforms random selection and is close to the result using true parameters which is the best we can get in simulation but not available in real data analysis.
Results of purity and ARI as performance measure are in Supplementary Figure S3 and give the same conclusion. In Supplementary Figure S4, we present the results of using different number of starts when we utilize multiple starting strategy. When we use 5 starting values, we get significant improvement in performance, while increasing to 10 starting values does not seem to have an additional notable effect. Therefore we choose 5 starts in all simulation studies.

Determining the number of clusters
In Section 2.5 and Supplementary Section 4, we describe a method for determining the number of clusters. Here, we evaluate this method using 2000 datasets generated from the default simulation setting where the true number of clusters is K ¼ 7. Table 1 shows the proportions of those datasets being identified with certain number clusters using four different methods: our proposed Hybrid method, Akaike information criterion, Bayesian information criterion and Gap statistic (Tibshirani et al., 2001). We can see that while those three methods tend to greatly underestimate the number of clusters, results from our hybrid method remain close to the true value K ¼ 7.

Real data analysis
A microbiome study was carried out in Nebraska where sorghum plants of the genotype Grassl were grown with two varying nitrogen levels (low/high). Rhizosphere microbiome samples were collected on four different dates throughout the growing season between June and September of 2017 and analyzed by 16S rRNA amplicon sequencing (Qi et al., 2021). Applying the persistence method (Shade and Handelsman, 2012) to identify core ASVs among the eight treatments, we obtained a subset of 449 ASVs with an average read count of 44.21, and 38% of counts in this subset are 0. Our proposed method for determining the number of clusters chose K ¼ 30.
We applied all four model-based clustering algorithms to the original count data and also applied the K-means clustering to the clrtransformed data to group the 449 ASVs into 30 clusters. Figure 3  plots the abundance profiles of all 30 clusters identified by our PH-EM method (Fig. 3a) and model-based NB clustering method (Fig. 3b), respectively. The Poisson model-based clustering produced worse results than the NB model-based method and is not shown here. Results for the PH-SA method and K-means are presented in Supplementary Figure S5 and yield similar conclusions. As shown in Figure 3a, the PH-EM method resulted in better separation of different clusters and more similar profiles within each cluster. In contrast, in Figure 3b, model-based NB clustering clusters over 80% of ASVs into 2 huge clusters, and 28 small clusters of which 15 are singletons. To parse finer-scale relationships between taxa in those two huge clusters, further clustering would be needed, complicating interpretation and placing additional bioinformatic burden on the researcher. This is not the case with our method, making it more useful for researchers looking to gain exploratory insight into the biological or ecological structure of microbial communities.
Our method revealed several bacterial clusters whose abundance was sensitive to plant developmental stage and nitrogen concentration. For example, PH-EM clustering revealed several clusters consisting of plant-growth-promoting taxa (e.g. Pseudomonas, Sphingomonas, Rhizobium, Arthrobacter and Streptomyces) whose abundance remained relatively stable under high nitrogen, but increased dramatically under low nitrogen (Fig. 3a, clusters 2 , 7, 9, 11, 14 and 27). These patterns suggest that under low nitrogen a putative guild of nitrogen-fixing taxa is selected for, at least partly driven by root exudation later in sorghum development. These results were corroborated by our analysis of amplicon sequences from 2016, as well as metagenomes assembled from both 2016 and 2017 samples that showed a significant increase in nitrogen-fixing genes under low nitrogen ( Supplementary Fig. S6). Similar patterns have been identified in other studies of the sorghum rhizosphere (Hara et al. 2019;Lopes et al. 2021;Yu et al. 2011;Wu et al., 2021). Interestingly, this nitrogen-fixing guild seems to be preceded by other putatively plant-growth-promoting taxa, such as Massilia and Bacillus (Clusters 12 and 17), whose abundances were also higher under low nitrogen. Importantly, these patterns were masked in the model-based NB clusters 10 and 18 whose average trends suggest the opposite: low abundance under low nitrogen and high abundance under high nitrogen. Based on these findings, we believe our clustering method could be particularly useful for parsing dynamic ecological relationships from datasets consisting of times series and/or many treatments.
To provide a quantitative evaluation of the clustering results, we measured the concordance between clustering results and taxonomic categories at the genus level. With the number of clusters, K, ranging from 2 to 50, we performed cluster analysis with different methods and calculated NMI based on the concordance between each clustering result and genera. Figure 4 shows that both PH-EM and PH-SA produced higher NMI values than K-means and model-based NB clustering for K larger than 7. When K is small (2-7), the modelbased NB method gave slightly higher NMI values than our algorithms. Thus, our method outperformed other clustering methods Each subplot corresponds to one cluster, with x-axis corresponding to the 8 treatment groups (two nitrogen levels with four chronically ordered dates, 1-4 correspond to high nitrogen, 5-8 correspond to low nitrogen) and y-axis corresponding to the abundance profile. Each grey line corresponds to the abundance profile estimated by the method of moments for an ASV, and the black line plots the geometric mean within each cluster. For each method, clusters will be referred to as Cluster No. 1-6 for the first row, 7-12 for the second row, . . ., and 25-30 for the last row Fig. 4. Clustering results for the rhizosphere microbiome dataset. Clustering results obtained by PH-EM, PH-SA, model-based negative binomial clustering (MB-NB), and K-means are compared with genera categories for each number of clusters K, ranging from 2 to 50. The NMI values shown are averages from 100 clustering results at each K when it came to clustering microbes based on microbial taxonomy, a proxy for ecological similarity. It is important to acknowledge that we did not expect our model to find perfect agreement between taxonomy and abundance (i.e. NMI values near 1). This is because microbiomes represent complex communities that can display large variation across individuals which cannot be explained by deterministic assembly mechanisms alone. The absence of perfect agreement between taxonomy and abundance could represent an alternative perspective: the combined influence of deterministic and stochastic assembly factors characteristic of competitive lottery models or priority effects (Sale, 1979). In situations where a particular niche space can support only a single species, stochastic dispersal followed by high competition among related species can result in only one 'winning' species-the identity of which can vary independently of any niche effect (Lee et al. 2013;Peay et al. 2012;Verster and Borenstein 2018). This could explain why ASVs of a particular genus group into more than one cluster.
For example, we found that Pseudomonas sp. from the sorghum rhizosphere grouped into 12 distinct clusters as compared to just 4 clusters with the model-based NB method. Our amplicon datasets from 2015 suggested that the sorghum rhizosphere exhibits a significant reduction in community diversity due to a 'bloom' of a diverse collection of Pseudomonas ASVs. Further analysis of isolate genomes and rhizosphere metagenomes from these samples revealed that this community represents several distinct Pseudomonas lineages that vary in their functional capacity especially regarding their commensal or pathogenic relationship with the host plant (Chiniquy et al., 2021). This diversity resulted in subtle but distinct abundance patterns between lineages. As mentioned above, our PH-EM method captured these abundance patterns in our 2017 dataset, while the model-based NB method failed to identify this heterogeneity, instead clustering most of these Pseudomonas sp. into a single large cluster, 18. Further, the PH-EM method confirmed that the abundances of many (although not all) of these Pseudomonas sp. are significantly higher under low nitrogen as compared to high nitrogen (Fig. 3a, clusters 2, 7, 14, 23 and 28). These patterns are undiscernible from the model-based NB results which clustered Pseudomonas ASVs with contrasting high and low nitrogen abundance patterns into the same cluster (Fig. 3b, cluster 18). Thus, in addition to identifying niche effects in microbial communities, our clustering method may also help researchers hone in on more complex ecological relationships that can be isolated, tested and confirmed via manipulative studies.

Discussion
Many features of microbiome data are zero-inflated while others do not exhibit zero inflation. Traditional clustering methods do not consider such characteristics. In this article, we model microbiome data with Poisson hurdle distributions that can fit features with or without zero inflation. To cluster features, we fit a mixture Poisson hurdle model with an EM algorithm (PH-EM) or another algorithm with a SA modification (PH-SA). Both algorithms have superior performances over other algorithms in both simulation studies and real data analysis. We also propose an initialization method and a method for determining optimal number of clusters, which are shown to be effective in our simulation studies. Compared with non-model-based clustering algorithms such as K-means method, our clustering algorithms also provide the uncertainties of the clustering results.
We also considered further extending our algorithms to negative binomial hurdle distributions to accommodate potential overdispersion that may exist in real dataset. We decided not to do so, due to the inefficiency in estimating the overdispersion parameter. Detailed explanations and simulation results are provided in Supplementary Section S8.