A LARGE SPANISH-CATALAN PARALLEL CORPUS RELEASE FOR MACHINE TRANSLATION

作者:Costa Jussa Marta R*; Fonollosa Jose A R; Marino Jose B; Poch Marc; Farrus Mireia
来源:Computing and Informatics, 2014, 33(4): 907-920.

摘要

We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catalan corpus is partially available via ELDA (Evaluations and Language Resources Distribution Agency) in catalog number ELRA-W0053.

  • 出版日期2014