摘要

Today, data storage capabilities as well as computational power are rapidly increasing. On the one hand, this improvement makes it possible to generate and store a great amount of temporal (time-oriented) data for future query, analysis and discovery of new knowledge. On the other hand, systems and experts are encountering new problems in processing this increased amount of data. The rapid growth in stored time-oriented data necessitates the development of new methods for handling, processing, and interpreting large amounts of temporal data. One approach is to use an automatic summarization process based on predefined knowledge, such the Knowledge-Based Temporal-Abstraction (KBTA) method. This method enables one to summarize and reduce the amount of raw data by creating higher level interpretations based on predefined domain knowledge. Unfortunately, the task of temporal abstraction is inherently computationally expensive, especially when an enormous volume of multivariate data has to be handled and when complex patterns need to be considered. In this research, we address the scalability problem of a temporal-abstraction task that involves processing significantly large amounts of raw data. We propose a new computational framework, the Distributed KBTA (DKBTA), which efficiently distributes the abstraction process among several parallel computational nodes, in order to achieve an acceptable computation time. The DKBTA framework distributes the temporal-abstraction process along one or more computational axes, each of which enables parallelization of one or more temporal-abstraction tasks into which the main temporal-abstraction task is decomposed, such as by different subject groups, concepts types, or abstraction types. We have implemented the DKBTA framework and have evaluated it in a preliminary fashion in the medical and the information security domains, with encouraging results. In our small-scale evaluation, only distribution along the subjects axis and sometimes along the concept-type axis seemed to consistently enhance performance, and only for computations involving individual subjects and not functions of sets of subjects; but this observation might depend on the number of processing units. Additionally, since the communication between the processing units was based on the TCP protocol, we could not observe any speedup even when using two processing units on the same machine. In our further evaluations we plan to use a shared memory architecture in order to exchange data between processing units.

  • 出版日期2012-8