Automatic construction of cross-lingual networks of concepts from the Hong Kong SAR police department

作者:Li KW*; Yang CC
来源:INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, SPRINGER-VERLAG BERLIN, HEIDELBERGER PLATZ 3, D-14197 BERLIN, GERMANY, 138-152, 2003.

摘要

The tragic event of September 11 has prompted the rapid growth of attention of national security and criminal analysis. In the national security world, very large volumes of data and information are generated and gathered. Much of this data and information Written in different languages and stored in different locations may be seemingly unconnected. Therefore, cross-lingual semantic interoperability is a major challenge to generate an overview of this disparate data and information so that it can be analysed, searched. The traditional information retrieval (IR) approaches normally require a document to share some keywords with the query. In reality, the users may use some keywords that are different from what used in the documents. There are then two different term spaces, one. for the users, and another for the documents. The problem can be viewed as the creation of a thesaurus. The creation of such relationships Would allow the system to match queries with relevant documents, even though they contain different terms. Apart from this, terrorists and criminals may communicate through letters, e-mails and faxes in languages other than English. The translation ambiguity significantly exacerbates the retrieval problem. To facilitate cross-lingual information retrieval, a corpus-based approach uses the term co-occurrence statistics in parallel or comparable corpora to construct a statistical translation model to cross the language boundary. However, collecting parallel corpora between European language and Oriental language is not an easy task due to the unique linguistics and grammar structures of oriental languages. In this paper, the-text-based approach to align English/Chinese Hong Kong Police press release documents from the Web is first presented. This article then reports an algorithmic approach to generate a robust knowledge base based on statistical correlation analysis of the semantics (knowledge) embedded in the bilingual press release corpus. The research output consisted of a thesaurus-like, semantic network knowledge base, which can aid in semantics-based cross-lingual information management and retrieval.