Automatic indexing of health documents in French: Evaluating and analysing errors

作者:Chebil W*; Soualmia L F; Dahamna B; Darmoni S J
来源:IRBM, 2012, 33(5-6): 316-329.
DOI:10.1016/j.irbm.2012.10.002

摘要

Catalogue and Index of French Medical Sites (CISMeF) is developed for retrieving the relevant medical information in the Internet for health professionals, the patients and students in medicine. The gathered resources are manually indexed, semi-automatically indexed or automatically indexed. Actually, the function indexing of CISMeF indexes only a part of resources that are judged the less important.
Objectives. - The objective of this work is to evaluate the indexing function developed for CISMeF, and analyse generated errors.
Material and method. - We used 500 clinical guidelines for the evaluation-of the indexing function, based since his implementation, on the "bag of words" algorithm. The automatic index generated is compared with the manual one which is considered as the "gold standard". We analyze the automatic indexing of short titles and subtitles associated, the automatic indexing of long titles and subtitles associated, the automatic indexing of long and short titles and subtitles associated and the automatic indexing of abstracts. The measures used for the evaluation are Precision, Recall and F-measure.
Results. - The results of the evaluation of the short titles and subtitles indexing are 0.56 for the precision, 0.21 for the recall. For the long titles and subtitles the precision is 0.39, the recall is 0.27. The precision of abstracts indexing is 0.23 and the recall is 0.61. Thirteen categories of errors are identified by analysing the indexing function. The short titles and subtitles indexing generated the less errors leading to the presence of wrong descriptors (0.97 errors per short tiles and subtitles). The long titles and subtitles generated the most errors leading to the absence of relevant descriptors (2.52 errors by long titles and subtitles).
Conclusion. - The evaluation of the indexing function showed that it should be used only for short titles and subtitles. We aim, after the identification of the causes of errors, to improve the performance of the automatic indexing function which will allow indexing more medical documents.

  • 出版日期2012-12

全文