摘要

Aims: To develop a natural language processing (NLP)-based algorithm for extracting clinically useful information for patients with hepatocellular carcinoma (HCC) from Chinese electronic medical records (EMRs) and use these data for the assessment of HCC staging. @@@ Materials and Methods: Clinical documents, including operation notes, radiology and pathology reports, of 92 HCC patients were collected from Chinese EMRs. We randomly grouped these patients into training (n = 60) and testing (n = 32) datasets. Rule-based and hybrid methods for extracting information were developed using the training set of manually-annotated operation notes. The method with better performance was used to process other documents. The performance of the algorithm was assessed via calculating the precision, recall and F-score for exact-boundary and partial-boundary matching strategies. The utility of clinically useful information for the HCC staging was assessed in comparison with that manually reviewed. @@@ Results: For operation notes, the rule-based and hybrid methods had a precision, recall and F-score 80% when the exact-boundary and partial-boundary matching strategies were applied to the testing dataset. By using the rule-based method (which has better performance than the hybrid method), three other types of documents also obtained good performance. When the extracted clinically useful information was applied for the HCC staging, the concordance rate with the manual review was 75%. @@@ Conclusion: A NLP system was developed for clinical information extraction and HCC staging based on EMRs, and the results indicate that Chinese NLP has potential utility in clinical research.