A distributed join algorithm on separated data storage

Fan, Qiu Shi; Zhou, Min Qi<sup>*</sup>; Zhou, Ao Ying

doi:10.11897/SP.J.1016.2016.02102

摘要

In this big data era, the efficiency of join operator is needed to be optimized imperatively, especially for database systems with separated baseline and incremental data. In this database system architecture, the baseline data is stored in the disk as usual, while the incremental data is stored in the memory to achieve both higher transactional processing efficiency and scalability. HBase, BigTable, OceanBase are typical database systems deployed with such separated baseline and incremental data architecture, but they provided join operator with very low efficiency only. The main reasons are as follows: they have to merge the baseline data and incremental data at first;and the network overhead is very heavy because of the complex data model they used. This paper proposes an algorithm for efficient join operator based on separated baseline data and incremental data. It partitions the join attributes into specified ranges first and merges each range on different nodes in parallel. The key point of this algorithm is that it partitions, sorts the baseline data and incremental data separately to achieve even higher parallelism before merge join and avoids the cost of merge of the baseline and incremental data tuples which will not be appeared in the result set. We implement this algorithm based on OceanBase, an open sourced distributed database system. The experimental results confirm that our algorithm can improve the join performance of OceanBase database by a large margin.

出版日期2016
单位华东师范大学

全文

访问全文

收藏分享被引浏览

更新时间：2023-10-19 15:01

A distributed join algorithm on separated data storage

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友