摘要

Anomaly intrusion detection in big data environments calls for lightweight models that are able to achieve real-time performance during detection. Abstracting audit data provides a solution to improve the efficiency of data processing in intrusion detection. Data abstraction refers to abstract or extract the most relevant information from the massive dataset. In this work, we propose three strategies of data abstraction, namely, exemplar extraction, attribute selection and attribute abstraction. We first propose an effective method called exemplar extraction to extract representative subsets from the original massive data prior to building the detection models. Two clustering algorithms, Affinity Propagation (AP) and traditional k-means, are employed to find the exemplars from the audit data. k-Nearest Neighbor (k-NN), Principal Component Analysis (PCA) and one-class Support Vector Machine (SVM) are used for the detection. We then employ another two strategies, attribute selection and attribute extraction, to abstract audit data for anomaly intrusion detection. Two http streams collected from a real computing environment as well as the KDD'99 benchmark data set are used to validate these three strategies of data abstraction. The comprehensive experimental results show that while all the three strategies improve the detection efficiency, the AP-based exemplar extraction achieves the best performance of data abstraction.