Balancing push and pull in Confuga, an active storage cluster file system for scientific workflows

Donnelly Patrick<sup>*</sup>; Thain Douglas

doi:10.1002/cpe.3834

摘要

Most big-data analysis systems require users to adopt restricted abstractions to achieve scaling and system stability. While highly effective at establishing data locality and eliminating interdependencies, this approach is not easily incorporated into scientific workflows that are often complex and irregular graphs of sequential programs with multiple dependencies. To address this, we have developed an active storage cluster file system named Confuga which harnesses the file information already available in the workflow to enable efficient and controlled distribution of dependencies across active storage nodes. Confuga is built upon the idea of leveraging a job's namespace to eliminate unknown transfers and to plan the replication of all job dependencies. Replication is carried out through two opposing transfer methodologies: centrally managed push transfers and distributed pulls. We evaluate the effectiveness of the two transfer mechanisms using workflows that stress the ability of the cluster to replicate dependencies. Ultimately, we show that a balance of the two approaches achieves optimal file distribution. This is shown in two bioinformatics workflows where a careful balance of the two mechanisms leads to 48% and 77% improvements over only push or pull.

出版日期2017-2

全文

访问全文

收藏分享被引浏览

更新时间：2021-03-22 23:35

Balancing push and pull in Confuga, an active storage cluster file system for scientific workflows

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友