Automatic thesaurus development: Term extraction from title metadata

作者:Wang J*
来源:Journal of the American Society for Information Science and Technology, 2006, 57(7): 907-920.
DOI:10.1002/asi.20352

摘要

The application of thesauri in networked environments is seriously hampered by the challenges of introducing new concepts and terminology into the formal controlled vocabulary, which is critical for enhancing its retrieval capability. The author describes an automated process of adding new terms to thesauri as entry vocabulary by analyzing the association between words/phrases extracted from bibliographic titles and subject descriptors in the metadata record (subject descriptors are terms assigned from controlled vocabularies of thesauri to describe the subjects of the objects [e.g., books, articles] represented by the metadata records). The investigated approach uses a corpus of metadata for scientific and technical (S&T) publications in which the titles contain substantive words for key topics. The three steps of the method are (a) extracting words and phrases from the title field of the metadata; (b) applying a method to identify and select the specific and meaningful keywords based on the associated controlled vocabulary terms from the thesaurus used to catalog the objects; and (c) inserting selected keywords into the thesaurus as new terms (most of them are in hierarchical relationships with the existing concepts), thereby updating the thesaurus with new terminology that is being used in the literature. The effectiveness of the method was demonstrated by an experiment with the Chinese Classification Thesaurus (CCT) and bibliographic data in China Machine-Readable Cataloging Record (MARC) format (CNMARC) provided by Peking University Library. This approach is equally effective in large-scale collections and in other languages.