摘要

This research proposes an innovative framework that can be used as a preliminary data analysis tool when labels of data instances are not available during the early stage of the process. The preliminary data analysis usually starts from exploring "target interest" features, which can be the measures representing the performances or the decision attributes. Then, investigating the factors that are highly correlated with the "target interest" features is the major analysis task. Because no exact labels are provided, these data exploration and investigation processes are iterative and time-consuming, especially when the size of data is huge. This research proposes the framework, named NSGAII-SCC, to form the multi-objective problem of combining clustering for "target interest" exploration with a classification algorithm for factor investigation, sequentially. The fast and elitist non-dominated sorting genetic algorithm (NSGAII) integrated with a feature selection mechanism is designed to search for a better solution for clustering and classification. This sequential clustering and classification process aims to not only reveal the hidden patterns of "target interest" but also explore the features that are highly correlated with the discovered patterns. Two public transactional datasets from Kaggle were used to evaluate the performance of NSGAII-SCC. The experimental result shows that NSGAII-SCC achieves a promising performance for finding better solutions that maintain the multi-objectives of clustering and classification. Additionally, the feature selection using the chromosome settings can help to search for the relevant features for both clustering and classification learnings.

  • 出版日期2018-8