In a glance, observers can evaluate gist characteristics from crowds of faces, such as the average emotional tenor or the average family resemblance. Prior research suggests that high-level ensemble percepts rely on holistic and viewpoint-invariant information. However, it is also possible that feature-based analysis was sufficient to yield successful ensemble percepts in many situations. To confirm that ensemble percepts can be extracted holistically, we asked observers to report the average emotional valence of Mooney face crowds. Mooney faces are two-tone, shadow-defined images that cannot be recognized in a part-based manner. To recognize features in a Mooney face, one must first recognize the image as a face by processing it holistically. Across experiments, we demonstrated that observers successfully extracted the average emotional valence from crowds that were spatially distributed or viewed in a rapid temporal sequence. In a subsequent set of experiments, we maximized holistic processing by including only those Mooney faces that were difficult to recognize when inverted. Under these conditions, participants remained highly sensitive to the average emotional valence of Mooney face crowds. Taken together, these experiments provide evidence that ensemble perception can operate selectively on holistic representations of human faces, even when feature-based information is not readily available.