A Quantitative Analysis and Sentence Alignment for Parallel Corpora of ShiJi

作者:Liu Ying*; Wang Nan; Yuan Bo
来源:Journal of Quantitative Linguistics, 2016, 23(1): 71-108.
DOI:10.1080/09296174.2015.1071150

摘要

We conducted quantitative and qualitative analyses of ShiJi (Records of the Grand Historian) in parallel corpora. Our research reveals that the basic word order in both texts remains similar. Long sentences in Ancient Chinese texts tend to be translated into long sentences in Contemporary Chinese versions; and short sentences tend to be translated into short sentences. The evaluation function of paragraph length and sentence length in both texts is consistent with a normal distribution. A considerable amount of identical Chinese characters can be found in source sentences and target sentences. The alignment mode of sentences and clauses is mainly 1-to-1. The maximum entropy model combines sentence/clause length, alignment mode and co-occurring Chinese characters to align sentences and clauses for parallel corpora of ShiJi. The precision and recall rate of clause alignment are higher than those of sentence alignment for ShiJi.