摘要

Most big-data analysis systems require users to adopt restricted abstractions to achieve scaling and system stability. While highly effective at establishing data locality and eliminating interdependencies, this approach is not easily incorporated into scientific workflows that are often complex and irregular graphs of sequential programs with multiple dependencies. To address this, we have developed an active storage cluster file system named Confuga which harnesses the file information already available in the workflow to enable efficient and controlled distribution of dependencies across active storage nodes. Confuga is built upon the idea of leveraging a job's namespace to eliminate unknown transfers and to plan the replication of all job dependencies. Replication is carried out through two opposing transfer methodologies: centrally managed push transfers and distributed pulls. We evaluate the effectiveness of the two transfer mechanisms using workflows that stress the ability of the cluster to replicate dependencies. Ultimately, we show that a balance of the two approaches achieves optimal file distribution. This is shown in two bioinformatics workflows where a careful balance of the two mechanisms leads to 48% and 77% improvements over only push or pull.

  • 出版日期2017-2

全文