Non-structured Data Integration Access Policy Using Hadoop

Cai, Ting<sup>*</sup>; Yang, Xuemei

doi:10.1007/s11277-017-5112-4

摘要

The rapid growth of unstructured data has become a key factor that drives the development of enterprises. Several problems should be addressed when obtaining effective access to massive amounts of unstructured data, such as data stored in scattered locations, differences in data access, and non-unified data formats. In this article, we use Hadoop to build a distributed computing platform that stores unstructured data and improve the Hadoop scheduling algorithm on the basis of the end time of slow tasks. The improved algorithm can avoid the execution of slow bulk task caused by non-uniform velocity nodes for Hadoop in a heterogeneous environment and can improve operating efficiency and stability. Furthermore, we propose a classification index construction method using non-training sets, thereby improving the term frequency-inverse document frequency weight formula by introducing timeliness and entropy. On this basis, we propose a classification algorithm that follows the principle of document similarity and document classification algorithm and does not use training sets. Finally, we describe the construction process of the classification index that is based on the training set by combining Hadoop and Lucene. As a proof of concept, we implement a prototype system using the Hadoop platform of our improved scheduling algorithm and conduct experimental studies to demonstrate the feasibility and performance of our approach.

出版日期2018-9
单位重庆邮电大学

全文

访问全文

收藏分享被引(1) 浏览

更新时间：2024-05-10 19:46

Non-structured Data Integration Access Policy Using Hadoop

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友