The task of automatic speaker recognition, wherein a system verifies or determines a speaker's identity using a sample of speech, has been studied for a few decades. In that time, a great deal of progress has been made in improving the accuracy of the system's decisions, through the use of more successful machine learning algorithms, and the application of channel compensation techniques and other methodologies aimed at addressing sources of errors such as noise or data mismatch. In general, errors can be expected to have one or more causes, involving both intrinsic and extrinsic factors. Extrinsic factors correspond to external influences, including reverberation, noise, and channel or microphone effects. Intrinsic factors relate inherently to the speaker himself, and include sex, age, dialect, accent, emotion, speaking style, and other voice characteristics. This dissertation focuses on the relatively unexplored issue of dependence of system errors on intrinsic speaker characteristics. In particular, I investigate the phenomenon that some speakers within a given population have a tendency to cause a large proportion of errors, and explore ways of finding such speakers.
There are two main components to this thesis. In the first, I establish the dependence of system performance on speakers, building upon and expanding previous work demonstrating the existence of speakers with tendencies to cause false alarm or false rejection errors. To this end, I explore two different data sets: one that is an older collection of telephone channel conversational speech, and one that is a more recent collection of conversational speech recorded on a variety of channels, including the telephone, as well as various types of microphones. Furthermore, in addition to considering a traditional speaker recognition system approach, for the second data set I utilize the outputs of a more contemporary approach that is better able to handle variations in channel. The results of such analysis repeatedly show variations in behavior across speakers, both for true speaker and impostor speaker cases. Variation occurs both at the level of speech utterances, wherein a given speaker's performance can depend on which of his speech utterances is used, as well as on the speaker level, wherein some speakers have overall tendencies to cause false rejection or false alarm errors. Additionally, lamb-ish speaker behavior (where the speaker tends to produce false alarms as the target) is correlated with wolf-ish behavior (where the speaker tends to produce false alarms as the impostor). On the more recent data set, 50% of the false rejection and false alarm errors are caused by only 15-25% of the speakers.
The second component of this thesis investigates a straightforward approach to predict speakers that will be difficult for a system to correctly recognize. I use a variety of features to calculate feature statistics that are then used to compute a measure of similarity between speaker pairs. By ranking these similarity measures for a set of impostor speaker pairs, I determine those speaker pairs that are easy for a system to distinguish and those that are difficult-to-distinguish. A variety of these simple distance measures could successfully select both easy- and difficult-to-distinguish speaker pairs, as evaluated by differences in detection cost and false alarm probability across a large number of systems. Of those tested, the best feature-measure at finding the most and least difficult-to-distinguish speaker pairs was the Euclidean distance between vectors of the mean first, second, and third formant frequencies. Even greater success was attained by the Kullback-Liebler (KL) divergence between pairs of speaker-specific GMMs. Furthermore, an examination of the smallest and biggest distances (as computed by the KL divergence) revealed individual speaker tendencies to consistently fall among the most (or least) difficult-to-distinguish speaker pairs.
I then develop an approach for finding those individual speakers who will be difficult for the system, using a set of feature statistics calculated over regions of speech. In particular, a support vector machine (SVM) classifier is trained to distinguish between difficult and easy speaker examples, in order to produce an overall measure of speaker difficulty as a target or impostor. The resulting precision and recall measures were over 0.8 for difficult impostor speaker detection, and over 0.7 for difficult target speaker detection. Depending on the application, the detection threshold can be tuned to improve precision, recall, or specificity in order to best suit the needs of a particular task. The same approach can be taken with single conversation sides, as with a set of conversation sides corresponding to the same speaker, since the input feature statistics can be calculated over any number of speech samples.