Sources of listener disagreement in voice quality assessment.
Published Web Locationhttp://scitation.aip.org/getpdf/servlet/GetPDFServlet?filetype=pdf&id=JASMAN000108000004001867000001&idtype=cvips
Traditional interval or ordinal rating scale protocols appear to be poorly suited to measuring vocal quality. To investigate why this might be so, listeners were asked to classify pathological voices as having or not having different voice qualities. It was reasoned that this simple task would allow listeners to focus on the kind of quality a voice had, rather than how much of a quality it possessed, and thus might provide evidence for the validity of traditional vocal qualities. In experiment 1, listeners judged whether natural pathological voice samples were or were not primarily breathy and rough. Listener agreement in both tasks was above chance, but listeners agreed poorly that individual voices belonged in particular perceptual classes. To determine whether these results reflect listeners' difficulty agreeing about single perceptual attributes of complex stimuli, listeners in experiment 2 classified natural pathological voices and synthetic stimuli (varying in f0 only) as low pitched or not low pitched. If disagreements derive from difficulties dividing an auditory continuum consistently, then patterns of agreement should be similar for both kinds of stimuli. In fact, listener agreement was significantly better for the synthetic stimuli than for the natural voices. Difficulty isolating single perceptual dimensions of complex stimuli thus appears to be one reason why traditional unidimensional rating protocols are unsuited to measuring pathologic voice quality. Listeners did agree that a few aphonic voices were breathy, and that a few voices with prominent vocal fry and/or interharmonics were rough. These few cases of agreement may have occurred because the acoustic characteristics of the voices in question corresponded to the limiting case of the quality being judged. Values of f0 that generated listener agreement in experiment 2 were more extreme for natural than for synthetic stimuli, consistent with this interpretation.