A novel approach to record file correlation and reduce mapping frequency on HDFS based on ExtendHDFS

作者:Xiao, Chang; Li, Qiang; Zheng, Dong
来源:2013 3rd International Conference on Computer Science and Network Technology, ICCSNT 2013, China,Shaanxi,Weinan,Dali, 2013-10-12 to 2013-10-13.
DOI:10.1109/ICCSNT.2013.6967105

摘要

Hadoop Distributed File System (HDFS) is quite commonly deployed in large data storage facilities and behaved very efficient when managing very large files. However, it has problems when operating large amount of small files. This is mainly because of the master-slave structure. Access request of too many small files will bring heavy burden to NameNode, which is the master machine of Hadoop. In the previous studies, Dong paid attention to file correlation and Chandrasekar S has proposed a general prefetching method. But neither of them gives a specific approach to record file correlation. Both of them made an assumption that files in one merged block has the higher correlation. In this paper, we proposed a new way to record file correlations based on Chandrasekar's EHDFS. Through our recorded data, an optimal file request chain is achieved. The chain represents the most correlate file order. According to this order, blocks that contains small files can be re-constructed. After reconstruction, the new blocks will have higher prefetching efficiency according to our theoretical analysis and significantly reduce the request sent to Hadoop NameNode.

全文