Himalayan Linguistics is a free peer-reviewed web journal and archive devoted to the study of the languages of the Himalayas.
Volume 15, Issue 1, 2016
This introduction surveys research on Tibetan NLP, both in China and in the West, as well as contextualizing the articles contained in the special issue.
The use of advanced computational methods for the analysis of large corpora of electronic texts is becoming increasingly popular in humanities and social science research. Unfortunately, Tibetan Studies has lacked such a repository of electronic, searchable texts. The automated recognition of printed texts, known as Optical Character Recognition (OCR), offers a solution to this problem; however, until recently, robust OCR systems for the Tibetan language have not been available. In this paper, we introduce one new system, called Namsel, which uses Optical Character Recognition (OCR) to support the production, review, and distribution of searchable Tibetan texts at a large scale. Namsel tackles a number of challenges unique to the recognition of complex scripts such as Tibean uchen and has been able to achieve high accuracy rates on a wide range of machine-printed works. In this paper, we discuss the details of Tibetan OCR, how Namsel works, and the problems it is able to solve. We also discuss the collaborative work between Namsel and its partner libraries aimed at building a comprehensive database of historical and modern Tibetan works—a database that consists of more than one million pages of texts spanning over a thousand years of literary production.
This paper describes a recognition system for online handwritten Tibetan characters using advanced techniques in character recognition. To eliminate noise points of handwriting trajectories, we introduce a de-noising approach by using dilation, erosion, thinning operators of mathematical morphology. Selecting appropriate structuring elements, we can clear up large amounts of noises in the glyphs of the character. To enhance the recognition performance, we adopt a three-stage classification strategy, where the top rank output classes by the baseline classifier are re-classified by similarcharacter discrimination classifier. Experiments have been carried out on two databases MRG-OHTC and IIP-OHTC. Test results show the used recognition algorithm is effective and can be applied to pen-based mobile devices.
This paper represents a departure from traditional Tibetan grammar in terms of the classification of verbs, for it constructs verb types and relevant syntactic rules based on syntax and semantics. The paper further distinguishes twelve types of Tibetan verbs according to the numbers of different arguments and the requirements of different syntactic properties. Therefore, the classification of the syntax and semantics of verbs Permits a detailed and overall reflection of the syntactic framework of all sorts of all kinds of Tibetan clause constructions, including word orders, case markers and syntactic particles. All the findings of syntactic and semantic classifications of Tibetan verbs can be applied directly to build a Tibetan grammatical information dictionary which is the infrastructure of natural Tibetan language processing.
This paper discusses the exocenric construction of adjectives NP+AP in Tibetan, and points out that this kind of construction is derived from the lexicalization of phrases. According to the semantic analysis of qualia construction of Generative Lexicon Theory, NP only refers to the source, scope, shape, material and dimension etc. of the properties of AP. Type coercion together with metonymy frequently happens to NP based on AP, which causes the generalization of NP’soriginal meaning, turns NP into the argument of AP, and the whole construction is adjectivalized
Functional chunk can reveal the skeleton of a sentence and the relation among chunks. Recognizing functional chunk is a sub-field of Natural Language Process, which can effectively improve the performance of syntactic parsing. This paper proposes a Tibetan functional chunk classification. To testify the feasibility ofthe proposed theory, we observe the distribution of Tibetan functional chunks in our corpus. The statistics prove that the classification can describe sentence structure comprehensively. Then we establish a functional chunking model based on a sequencetag model. By introducing appropriate features, a couple of experiments have been conducted. The F1 achieves 82.30 by employing extended features.
Towards describing Tibetan syntax: From word segmentation to rewrite rules through a semi-automated workflow
The first task in Tibetan Natural Language Processing is word segmentation. We present our lightweight segmentation tool that is based on lexical ressources. It can be executed natively in InDesign and the user can update it with the manual corrections of its output. We then propose a semi-automated workflow aiming at syntactic analysis that uses utterance simplification and intonation cues to get pre- cise informations about the syntactical structure. Non-specialised native speakers are thus able to provide us with precise information about the structure of utter- ances. This will allow the scientific community to obtain the ressources needed to initiate the study of Tibetan syntax. In this process, informants will obtain educa- tional material generated from the utterances they will have processed.
Semantic role labeling is one of the most significant research fields of natural language processing. Researchers have already made many achievements in English and Chinese semantic role labeling. Until now, however, Tibetan semantic role labeling is still at an early stage due to the lack of a Tibetan corpus with semantic role annotation and relatively outdated research approaches. Tibetan is rich with syntactic markers that naturally divide a sentence into semantic chunks and indicate the semantic relationships between these chunks. Thus, in this paper, we propose a semantic role classification and an integrated strategy for Tibetan semantic role labeling. Transformation-Based Error-driven Learning and Conditional Random Fields have been employed in our study. Additionally, a number of linguistic rules have been introduced into our approach as well. Our integrated strategy achieves 83.91% in precision, 82.78% in recall, and an F-score of 85.71.
A Hybrid Approach Using Maximum Entropy Model and Conditional Random Fields to Identify Tibetan Person Names
Tibetan person name recognition is one of the most difficult tasks in the area of Tibetan information processing, and the effect of recognition impacts directly on the precision of Tibetan word segmentation and the performance of relative application systems, including Tibetan-Chinese machine translation, Tibetan informationretrieval, text categorization, etc. Based on the analysis of wording rules and features of Tibetan person names, this paper proposes a method which combines maximum entropy and conditional random fields to identify Tibetan person names. The experiment shows that this approach works quite well, with the value of F1-measure reaching 93.29%.
The Tibetan trisyllabic light verb construction is a type of widely used verb phrase that is composed of a disyllabic noun or adjective and a light verb. A large number of Tibetan trisyllabic light verb constructions are widely found in Tibetan. Successfully recognizing this type of phrase greatly contributes to Tibetan information processing, however, thorough and systematic academic research in this field has not yet been launched. Therefore, we propose a model for the recognition of Tibetan trisyllabic light verb constructions based on an integrated strategy in this paper. Firstly, we extract all trisyllabic light verb construction candidates from a Tibetan corpus. In this step, light verbs are used as retrieval marks. Secondly, we filter candidates using a statistics-based model, rule-based model, and integrated model separately. Experimental results show that the integrated model performs much better than the other strategies, which proves that linguistic features contribute a lot to the automatic recognition of Tibetan trisyllabic light verb constructions by computers.
This paper proposes a Chinese to Tibetan machine translation system with multiple translating strategies. The key corpora and technologies are explained in detail. Experiments show the sub systems output the translation of each phrase in the same order as they are in the Chinese sentence rather than in a Tibetan sentence, which leads to worse translation quality. So an order adjusting model is essential to Chinese to Tibetan translation system. The recall of translation phrase makes an improvement of 9.71% over the popular off-the-shelf language neutral statistical machine translation programme Moses. Our translation system achieves a speed ofabout 0.175s per sentence, which meets the requirement of the computer aided translation system.
Practical Applications for Corpora: The Role of Research-based Linguistics in Literacy & Education for the Tibetan Language
Corpus Linguistics and NLP have many obvious applications for researchers, academics, and other specialists; what should not be overlooked, however, is their role in improving the mundane, everyday interactions between people and language, be they a reader of a newspaper; a child with a storybook; or a student in a classroom. The language analyses that these linguistic tools provide have an important part to play in the feedback loop between authors, journalists, and pedagogists on the one hand and their audiences and students on the other.
While these sorts of research-based resources have already made splashes in majority languages like English, their ripples have yet to spill over into the smaller language markets. Within this paper we outline the ways in which corpus linguistics may inform Tibetan language literacy and education in both L1 & L2 contexts, while drawing from our own research into issues of readability and the development of a modern pedagogy for instruction in the Tibetan alphabet based on frequency data.