Kariz Cache Prefetching and Management

To ease management and flexible scaling of storage and compute clusters, typical deployments of analytic clusters such as Microsoft HDInsight and Amazon EMR strongly separate storage and compute services. For storage, data is stored in highly available, durable, and scalable multi-tenant data lakes, such as Amazon S3 and Azure Data Lake Store. Despite their advantages, these configurations often provide lower performance than local storage. For example, recent studies show that the performance of Spark-SQL can be degraded by 8x with EC2 remote HDD configuration, compared to performance with local NVMe. 

Kariz is a cluster caching and prefetching system that focuses on improving performance of these configurations.  We assert that prior work has made the following mistaken assumptions: (a) hit ratio and performance are directly related, (b) input files should be cached in their entirety or not at all (i.e. the all-or-nothing property of PacMan,) and (c) back-end storage bandwidth is unlimited.

Kariz takes advantage of the disclosed I/O access information that already exists for use in scheduling and coordinating distributed workers. Spark-SQL, Hive, and PIG all collect this information in the form of a dependency Directed Acyclic Graph (DAG) identifying inputs and outputs for each individual computation. Given future access information (e.g. job DAGs), Kariz determines which data sets to make available in cache by either prefetching or retention. Kariz determines which data to prefetch or evict, and when to do so.

Kariz includes the following functions:

  • Runtime prediction: Using data from past executions, we estimate the run time of jobs within a DAG as a function of the fraction of input data present in cache
  • Partial caching: Prefetching and eviction are performed at fine granularity, allowing partial caching of job inputs without significant increase in stragglers
  • DAG scheduling: Given a DAG with a set of inputs, the Kariz offline scheduler constructs a schedule of plans, each specifying what fraction of each input to pre-fetch (or evict), and the time when the operation should be be performed.
  • multi-DAG scheduling: Given a set of these DAG schedules and knowledge of the scheduler state, the Mirab online planner determines which of these plans to execute at each step, using a heuristic that combines efficiency and fairness.

Our implementation combines a caching layer embedded within the Ceph Rados Gateway (RGW) allowing fine-grained prefetching and eviction, modifications to PIG for DAG export, and an external planner. This project evaluates Kariz running TPC-H, TPC-DS, and other benchmarks, running on a 16-server cluster.

This project is related to D3N and GUSS.