摘要

The rapid growth of unstructured data has become a key factor that drives the development of enterprises. Several problems should be addressed when obtaining effective access to massive amounts of unstructured data, such as data stored in scattered locations, differences in data access, and non-unified data formats. In this article, we use Hadoop to build a distributed computing platform that stores unstructured data and improve the Hadoop scheduling algorithm on the basis of the end time of slow tasks. The improved algorithm can avoid the execution of slow bulk task caused by non-uniform velocity nodes for Hadoop in a heterogeneous environment and can improve operating efficiency and stability. Furthermore, we propose a classification index construction method using non-training sets, thereby improving the term frequency-inverse document frequency weight formula by introducing timeliness and entropy. On this basis, we propose a classification algorithm that follows the principle of document similarity and document classification algorithm and does not use training sets. Finally, we describe the construction process of the classification index that is based on the training set by combining Hadoop and Lucene. As a proof of concept, we implement a prototype system using the Hadoop platform of our improved scheduling algorithm and conduct experimental studies to demonstrate the feasibility and performance of our approach.