摘要

Objective: Difficulties with part-of-speech (PUS) tagging of biomedical text is accessing and annotating appropriate training corpora. These difficulties may result in PUS taggers trained on corpora that differ from the tagger's target biomedical text (cross-domain tagging). In such cases where training and target corpora differ tagging accuracy decreases. This paper presents a PUS tagger for cross-domain tagging called TcT. Methods and material: TcT estimates a tag's likelihood for a given token by combining token collocation probabilities and the token's tag probabilities calculated using a Naive Bayes classifier. We compared TcT to three PUS taggers used in the biomedical domain (mxpost, Brill and TnT). We trained each tagger on a non-biomedical corpus and evaluated it on biomedical corpora. Results: TcT was more accurate in cross-domain tagging than mxpost, Brill and TnT (respective averages 83.9, 81.0, 79.5 and 78.8). Conclusion: Our analysis of tagger performance suggests that lexical differences between corpora have more effect on tagging accuracy than originally considered by previous research work. Biomedical PUS tagging algorithms may be modified to improve their cross-domain tagging accuracy without requiring extra training or large training data sets. Future work should reexamine PUS tagging methods for biomedical text. This differs from the work to date that has focused on retraining existing PUS taggers.

  • 出版日期2014-5