Automated classification of content components in technical communication

Oevermann Jan<sup>*</sup>; Ziegler Wolfgang

doi:10.1111/coin.12157

摘要

Automated classification is usually not adjusted to specialized domains due to a lack of suitable data collections and insufficient characterization of the domain-specific content and its effect on the classification process. This work describes an approach for the automated multiclass classification of content components used in technical communication based on a vector space model. We show that differences in the form and substance of content components require an adaption of document-based classification methods and validate our assumptions with multiple real-world data sets in 2 languages. As a result, we propose general adaptions on feature selection and token weighting, as well as new ideas for the measurement of classifier confidence and the semantic weighting of XML-based training data. We introduce several potential applications of our method and provide prototypical implementation. Our contribution beyond the state of the art is a dedicated procedure model for the automated classification of content components in technical communication, which outperforms current document-centered or domain-agnostic approaches.

出版日期2018-2

全文

访问全文

收藏分享被引(3) 浏览

更新时间：2022-08-25 06:02

Automated classification of content components in technical communication

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友