How to Measure Word Length in Spoken and Written Chinese

作者:Chen, Heng; Liu, Haitao*
来源:Journal of Quantitative Linguistics, 2016, 23(1): 5-29.
DOI:10.1080/09296174.2015.1071147

摘要

Choosing an appropriate measurement unit of word length is a key prerequisite for word length distribution studies, since the measurement unit varies with different types of language or text. Taking Chinese as an example, this study explores the word length distributions of Chinese spoken and written language based on a data source consisting of 20 dialogue texts (spoken language) and 20 prose texts (written language), in which the lengths of words are variously determined in terms of pinyin letter, phoneme, syllable for spoken Chinese and stroke, component, character for written Chinese respectively. With the aim of selecting the most appropriate word length measurement unit, empirical word length distribution models, synergetic linguistic theories and Menzerath's law are used in this study. Results show that the syllable is the most appropriate measurement unit for spoken Chinese, and the component the most appropriate measurement unit for written Chinese. Chinese word length distributions can be described with the Poisson or Binomial distribution families, among which Extended Logarithmic and Mixed Poisson are the most generally accepted models for spoken and written Chinese respectively.