摘要

In the field of constituency parsing, there exist multiple human-labeled treebanks which are built on non-overlapping text samples and follow different annotation standards. Due to the extreme cost of annotating parse trees by human, it is desirable to automatically convert one treebank (called source treebank) to the standard of another treebank (called target treebank) which we are interested in. Conversion results can be manually corrected to obtain higher-quality annotations or can be directly used as additional training data for building syntactic parsers. To perform automatic treebank conversion, we divide constituency parses into two separate levels: the part-of-speech (POS) and syntactic structure (bracketing structures and constituent labels), and conduct conversion on these two levels respectively with a feature-based approach. The basic idea of the approach is to encode original annotations in a source treebank as guide features during the conversion process. Experiments on two Chinese treebanks show that our approach can convert POS tags and syntactic structures with the accuracy of 96.6 and 84.8 %, respectively, which are the best reported results on this task.

全文