Skip to main content
eScholarship
Open Access Publications from the University of California

Assessing and reducing the impact of LDA's non-determinism in software engineering

  • Author(s): Lopez Giraldo, Nicolas Francisco
  • Advisor(s): van der Hoek, Andre
  • et al.
Abstract

Latent Dirichlet Allocation is a generative technique, the application of which has recently gained traction in software engineering research. A particular focus has been the question of whether topic models, generated with a code base as input, can be used to support development activities. Out of this research has come a range of promising approaches in fault localization, code comprehension, and feature location, among others.

An essential step in using LDA is choosing a configuration of parameters to use in the underlying algorithm. The values chosen for the parameters k, alpha, and beta, determine how many topics the algorithm produces, the distribution of topics over documents, and the distribution of topics over terms, respectively. This determines how well a topic model will fit its purpose. In the typical case, it is necessary to experiment with multiple different configurations to find a topic model that performs best.

An essential step in using LDA is choosing a configuration of parameters to use in the underlying algorithm. The values chosen for the parameters k, alpha, and beta, determine how many topics the algorithm produces, the distribution of topics over documents, and the distribution of topics over terms, respectively. This determines how well a topic model will fit its purpose. In the typical case, it is necessary to experiment with multiple different configurations to find a topic model that performs best.

This dissertation explores how LDA's non-determinism impacts the process of selecting a configuration of parameters. To date, the research literature has largely been silent on this issue, yet knowing the severity of the impact is crucial because: (1) not knowing the effect makes replicability of results difficult, given that a published k, alpha, and beta; may lead to different results if someone regenerates the corresponding topic model, and (2) knowing the extent of the effect has implications for how we should go about selecting a best k, alpha, and beta.

This dissertation makes two primary contributions. First, it provides an assessment of the impact of the non-determinism of LDA, both in terms of how much variation it produces in the models as well in terms of how its impact is severe. Second, it introduces a new process that leads to the selection of values of the parameters that is much more stable in terms of the resulting topic models, when these parameters are used repeatedly.

Main Content
Current View