Use of Lexical Statistics for CompoundWord Recognition and Segmentation in Turkish
Skip to main content
eScholarship
Open Access Publications from the University of California

Use of Lexical Statistics for CompoundWord Recognition and Segmentation in Turkish

Abstract

Compound words are cross-linguistic morphological phenomena that occur in all languages. Compound words are widely accepted to be stored in the lexicon but their constituents need to be accessed during both language learning and production processes. In this study, the use of corpora was investigated for how to differentiate single-stem words from single-word compounds and then how to segment compound words when no phonological information is available. Stems and morphs discovered in manual segmentations of the METU-Sabancı Turkish Treebank and the CHILDES were employed in the compound word recognition task and the results were compared. The METU Turkish Corpus (with about 2 million words) and a webcorpus (with about 490 million of Turkish words) were utilized in the segmentation task. The results emphasize that the lexicon can be morpheme-based; and lexical frequencies are effective heuristics in compound word recognition and segmentation

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View