- Main
Grounding Code-Switching Evaluation to Community Speech Patterns
- Pattichis, Rebecca
- Advisor(s): Peng, Nanyun
Abstract
Code-switching (CS), broadly defined as switching between multiple languages in speech and text, is a common occurrence in multilingual communities. And yet, CS has been historically disparaged in higher institutions, including in the research field of Natural Language Processing. This thesis contextualizes CS dataset collection, transcription, and analysis for better data quality. Specifically, I improve CS dataset analysis by adapting previous metrics in NLP that are based on word-level units, which are misaligned with bilingual speech. Crucially, CS is not equally likely between any two words, but follows syntactic and prosodic rules. This work therefore adapts two metrics, multilinguality and CS probability, to use the Intonation Unit (IU), an established unit for speech transcription, as basic tokens for NLP tasks. I also calculate these two metrics separately for distinct mixing types: alternating-language multi-word strings and single-word incorporations. Results indicate that there is a shared tendency among bilinguals for multi-word CS to occur across, rather than within, IU boundaries. That is, bilinguals tend to prosodically separate their two languages. This constraint is blurred when metric calculations do not distinguish multi-word and single-word items. By comparing against the same metrics and datasets using the word as a token, I also show that IUs help researchers distinguish between CS speaker patterns, whereas the word-based metrics homogenize and obscure these patterns. These results call for a reconsideration of units of analysis in future development of CS datasets for NLP tasks.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-