Document Similarity Misjudgment by LSA: Misses vs. False Positives
Skip to main content
eScholarship
Open Access Publications from the University of California

Document Similarity Misjudgment by LSA: Misses vs. False Positives

Abstract

Modeling text document similarity is an important yet challenging task. Even the most advanced computational linguistic models often misjudge document similarity relative to humans. Regarding the pattern of misjudgment between models and humans, Lee and colleagues (2005) suggested that the models’ primary failure is occasional underestimation of strong similarity between documents. According to this suggestion, there should be more extreme misses (i.e., models failing to pick up on strong document similarity) than extreme false positives (i.e., models falsely detecting document similarity that does not exist). We tested this claim by comparing document similarity ratings generated by humans and latent semantic analysis (LSA). Notably, we implemented LSA with 441 unique parameter settings, determined optimal parameters that yielded high correlations with human ratings, and finally identified misses and false positives under the optimal parameter settings. The results showed that, as Lee et al. predicted, large errors were predominantly misses rather than false positives. Potential causes of the misses and false positives are discussed.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View