摘要

Along with the worldwide trend of rapidly aging populations, diabetes mellitus and its comprehensive complications have become major public health issues. Considerable evidence suggests patients with diabetes mellitus have a higher risk of breast cancer. However, the relationships between the complications of diabetes mellitus and occurrence of breast cancer have not been well characterized. Despite the higher risk of breast cancer among patients with diabetes mellitus, patients with breast cancer constitute only a relatively small proportion of the diabetes mellitus data, leading to an imbalanced data set. This study proposes a hybrid machine learning scheme to cope with imbalanced data in the analysis of risk factors of breast cancer in patients with diabetes mellitus. The scheme combines the undersampling based on the clustering algorithm, the k-means algorithm, and the extreme gradient boosting algorithm. The results identify that occlusion stroke, diabetes with peripheral circulatory disorders, peripheral angiopathy in diseases classified elsewhere, and other forms of chronic ischemic heart disease are risk factors. This study provides an application of advanced methods in health care and shows the epidemiologic and informatics value of the proposed hybrid machine learning scheme.