A unified alignment algorithm for bilingual data

Tillmann Christoph<sup>*</sup>; Hewavitharana Sanjika

doi:10.1017/S135132491100026X

摘要

The paper presents a novel unified algorithm for aligning sentences with their translations in bilingual data. With the help of ideas from a stack-based dynamic programming decoder for speech recognition (Ney 1984), the search is parametrized in a novel way such that the unified algorithm can be used on various types of data that have been previously handled by separate implementations: the extracted text chunk pairs can be either sub-sentential pairs, one-to-one, or many-to-many sentence-level pairs. The one-stage search algorithm is carried out in a single run over the data. Its memory requirements are independent of the length of the source document, and it is applicable to sentence-level parallel as well as comparable data. With the help of a unified beam-search candidate pruning, the algorithm is very efficient: it avoids any document-level pre-filtering and uses less restrictive sentence-level filtering. Results are presented on a Russian-English, a Spanish-English, and an Arabic-English extraction task. Based on simple word-based scoring features, text chunk pairs are extracted out of several trillion candidates, where the search is carried out on 300 processors in parallel.

出版日期2013-1
单位IBM

全文

访问全文

收藏分享被引(2) 浏览

更新时间：2018-01-18 09:58

A unified alignment algorithm for bilingual data

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友