Multiword expressions (MWEs) play a critical role in language acquisition, production, and comprehension, serving as integral components of linguistic formulaicity. Despite their significance, the identification, analysis, and integration of MWEs in corpus-driven and corpus-based research have largely remained peripheral within the discipline. Most advancements in discovering, identifying, and incorporating MWEs into linguistic analyses have primarily emerged from the computational branch of the field. Accordingly, the majority of algorithms designed for the tasks of discovering and identifying MWEs have primarily focused on the outcome itself, privileging performance over interpretability. This dissertation introduces mMERGE (multidimensional MWE Extraction through Recursive Grouping of Elements), a novel algorithm that integrates linguistic insights and computational methods to address these challenges.
At the core of mMERGE are five well-known but underutilized information-theoretic dimensions—frequency, type frequency, dispersion, entropy, and association measures—designed to reflect key patterns of language use and cognitive processing. Furthermore, the algorithm implements these dimensions to address two critical challenges. First, many measures, particularly dispersion, are correlated with frequency, leading to a bias against infrequent MWEs. For instance, traditional dispersion metrics often fail to account for infrequent MWEs that are well-distributed across a corpus, such as domain specific MWEs or idiomatic expressions, thereby underestimating their representativeness. Second, existing measures frequently conflate directionality by assuming bidirectional attraction between words. This approach overlooks asymmetry in certain expressions such as give up or in truth, where one element strongly attracts the other, but not vice versa. mMERGE explicitly addresses these limitations, by using a dispersion measure that partials out the effect of frequency, as well as bidirectional association and entropy measures.
Built on recursive grouping and implemented in the Julia programming language, mMERGE achieves reasonable computational efficiency, enabling its application to the BROWN corpus over 20,000 iterations. Through a combination of human annotation and predictive modeling, the results not only demonstrate the algorithm's capability to uncover diverse MWE categories, such as idioms, compounds, and lexical bundles, but also reveal their distributional profiles and establish connections with the cognitive processes underlying language use.
Finally, a key application explored in this dissertation is MWE-augmented keyword analysis, which demonstrates how integrating MWEs into keyness metrics can enhance the interpretation of linguistic patterns in a given corpus. By examining changes in keyword rankings and distributions, this analysis underscores the impact of MWEs on corpus-based studies of register, domain specificity, and discourse organization.
This research contributes to corpus and computational linguistics by offering a method that balances the demands of efficiency and model interpretability, while addressing the limitations of existing approaches. Beyond theoretical implications, mMERGE offers practical versatility, enabling adaptation to a wide variety of corpora, regardless of their size, domain specificity, or language.