Skip to main content
eScholarship
Open Access Publications from the University of California

Finding probabilistic context-free grammar in Chinese writing system

Creative Commons 'BY' version 4.0 license
Abstract

Writing systems play a very important role in human languages, but the mathematical nature of writing systems remainsunderstudied. Here, we conduct a case study of an open-class writing system Chinese characters, which consists of aset of expandable basic units, in contrast to most other writing systems whose basic units form closed sets, or closed-class systems. We demonstrate that probabilistic context-free grammars underlie the representation of Chinese writing, byformalizing Chinese characters as a grammar with character shapes, as nonterminal rules, and components. as terminalnodes. Rule probabilities are estimated from a character treebank of the most frequent 3500 characters. Exploratoryanalysis reveals Zipfian distributions of both shapes and components. Our experiments also demonstrate that Chinesewriting system shows generative powers similar to PCFG, with 78% of the noncharacters generated from our grammarjudged acceptable, which suggests fundamental differences between open-class and closed-class writing systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View