Cryo-electron tomography provides the opportunity for unsupervised discovery of endogenous complexes in situ. This process usually requires particle picking, clustering and alignment of subtomograms to produce an average structure of the complex. When applied to heterogeneous samples, template-free clustering and alignment of subtomograms can potentially lead to the discovery of structures for unknown endogenous complexes. However, such methods require scoring functions to measure and accurately rank the quality of aligned subtomogram clusters, which can be compromised by contaminations from misclassified complexes and alignment errors. Here, we provide the first study to assess the effectiveness of more than 15 scoring functions for evaluating the quality of subtomogram clusters, which differ in the amount of structural misalignments and contaminations due to misclassified complexes. We assessed both experimental and simulated subtomograms as ground truth data sets. Our analysis showed that the robustness of scoring functions varies largely. Most scores were sensitive to the signal-to-noise ratio of subtomograms and often required Gaussian filtering as preprocessing for improved performance. Two scoring functions, Spectral SNR-based Fourier Shell Correlation and Pearson Correlation in the Fourier domain with missing wedge correction, showed a robust ranking of subtomogram clusters without any preprocessing and irrespective of SNR levels of subtomograms. Of these two scoring functions, Spectral SNR-based Fourier Shell Correlation was fastest to compute and is a better choice for handling large numbers of subtomograms. Our results provide a guidance for choosing an accurate scoring function for template-free approaches to detect complexes from heterogeneous samples.