摘要

Rationale and Objectives: The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI) is the largest publicly available computed tomography (CT) image reference data set of lung nodules. In this article, a comprehensive data analysis of the data set and a uniform data model are presented with the purpose of facilitating potential researchers to have an in-depth understanding to and efficient use of the data set in their lung cancer-related investigations. Materials and Methods: A uniform data model was designed for representation and organization of various types of information contained in different source data files. A software tool was developed for the processing and analysis of the database, which 1) automatically aligns and graphically displays the nodule outlines marked manually by radiologists onto the corresponding CT images; 2) extracts diagnostic nodule characteristics annotated by radiologists; 3) calculates a variety of nodule image features based on the outlines of nodules, including diameter, volume, and degree of roundness, and so forth; 4) integrates all the extracted nodule information into the uniform data model and stores it in a common and easy-to-access data format; and 5) analyzes and summarizes various feature distributions of nodules in several different categories. Using this data processing and analysis tool, all 1018 CT scans from the data set were processed and analyzed for their statistical distribution. Results: The information contained in different source data files with different formats was extracted and integrated into a new and uniform data model. Based on the new data model, the statistical distributions of nodules in terms of nodule geometric features and diagnostic characteristics were-summarized. In the LIDC/IDRI data set, 2655 nodules >= 3 mm, 5875 nodules <3 mm, and 7411 non-nodules are identified, respectively. Among the 2655 nodules, 1) 775, 488, 481, and 911 were marked by one, two, three, or four radiologists, respectively; 2) most of nodules >= 3 mm (85.7%) have a diameter <10.0 mm with the mean value of 6.72 mm; and 3) 10.87%, 31.4%, 38.8%, 16.4%, and 2.6% of nodules were assessed with a malignancy score of 1, 2, 3, 4, and 5, respectively. Conclusions: This study demonstrates the usefulness of the proposed software tool to the potential users for an in-depth understanding of the LIDC/IDRI data set, therefore likely to be beneficial to their future investigations. The analysis results also demonstrate the distribution diversity of nodules characteristics, therefore being useful as a reference resource for assessing the performance of a new and existing nodule detection and/or segmentation schemes.