A study of damp-heat syndrome classification Using Word2vec and TF-IDF

作者:Zhu Wei; Zhang Wei; Li Guo Zheng; He Chong; Zhang Lei*
来源:IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM), 2016-12-15 to 2016-12-18.

摘要

With people's increasing concern about health, judging people's health through medical record is becoming a potential demand. Most of preview disease analysis researches were conducted on structured dataset, which usually ignored the relationship between different symptoms, and the dataset was expensive to get. In this paper, we proposed a novel model based on Word2vec and Terms Frequency-Inverse Document Frequency (TF-IDF), which could be used to detect damp-heat syndrome on unstructured records directly. Firstly, we adopt ICTCLAS system combined with corpus collected in the field of Traditional Chinese Medicine (TCM) to segment the clinical records into words. Secondly, Word2vec tool was used to train word vector. Then, we constructed the record representation vector according to word vector and TF-IDF. The record representation method was named Word2vec+ TF-IDF. In order to verify the effectiveness of the proposed method, we compared our record representation method with other text representation methods under four different classifiers. The experiment was conducted on the dataset collected from over 10 Chinese Medicine hospitals. And the experimental results show that our model perform better than the state-of-theart methods such as LSA and Doc2vec.