摘要

The unstructured narratives in medicine have been increasingly targeted for content extraction using the techniques of natural language processing (NLP). In most cases, these efforts are facilitated by creating a manually annotated set of narratives containing the ground truth; commonly referred to as a gold standard corpus. This corpus is used for modeling, fine-tuning, and testing NLP software as well as providing the basis for training in machine learning. Determining the number of annotated documents (size) for this corpus is important, but rarely described; rather, the factors of cost and time appear to dominate decision-making about corpus size. In this report, a method is outlined to determine gold standard size based on the capture probabilities for the unique words within a target corpus. To demonstrate this method, a corpus of dictation letters from the Michigan Pain Consultant (MPC) clinics for pain management are described and analyzed. A well-formed working corpus of 10,000 dictations was first constructed to provide a representative subset of the total, with no more than one dictation letter per patient. Each dictation was divided into words and common words were removed. The Poisson function was used to determine probabilities of word capture within samples taken from the working corpus, and then integrated over word length to give a single capture probability as a function of sample size. For these MPC dictations, a sample size of 500 documents is predicted to give a capture probability of approximately 0.95. Continuing the demonstration of sample selection, a provisional gold standard corpus of 500 documents was selected and examined for its similarity to the MPC structured coding and demographic data available for each patient. It is shown that a representative sample, of justifiable size, can be selected for use as a gold standard.

  • 出版日期2012-6