摘要

Automated classification is usually not adjusted to specialized domains due to a lack of suitable data collections and insufficient characterization of the domain-specific content and its effect on the classification process. This work describes an approach for the automated multiclass classification of content components used in technical communication based on a vector space model. We show that differences in the form and substance of content components require an adaption of document-based classification methods and validate our assumptions with multiple real-world data sets in 2 languages. As a result, we propose general adaptions on feature selection and token weighting, as well as new ideas for the measurement of classifier confidence and the semantic weighting of XML-based training data. We introduce several potential applications of our method and provide prototypical implementation. Our contribution beyond the state of the art is a dedicated procedure model for the automated classification of content components in technical communication, which outperforms current document-centered or domain-agnostic approaches.

  • 出版日期2018-2