Automatic identification of light stop words for Persian information retrieval systems

Sadeghi Mohammad<sup>*</sup>; Vegas Jesus

doi:10.1177/0165551514530655

摘要

Stop word identification is one of the most important tasks for many text processing applications such as information retrieval. Stop words occur too frequently in documents in a collection and do not contribute significantly to determining the context or information about the documents. These words are worthless as index terms and should be removed during indexing as well as before querying by an information retrieval system. In this paper, we propose an automatic aggregated methodology based on term frequency, normalized inverse document frequency and information model to extract the light stop words from Persian text. We define a light stop word%26apos; as a stop word that has few letters and is not a compound word. In the Persian language, a complete stop word list can be derived by combining the light stop words. The evaluation results, using a standard corpus, show a good percentage of coincidence between the Persian and English stop words and a significant improvement in the number of index terms. Specifically, the first 32 Persian light stop words have a great impact on the index size reduction and the set of stop words can reduce the number of index terms by about 27%.

出版日期2014-8

全文

访问全文

收藏分享被引(8) 浏览

更新时间：2024-04-13 04:38

Automatic identification of light stop words for Persian information retrieval systems

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友