Enabling Efficient and General SubpopulationAnalytics In Multidimensional Data Streams
Antonis Manousis, Carnegie Mellon University; Zhuo Cheng, Carnegie Mellon University; Ran Ben Basat, University College London; Zaoxing (Alan) Liu, Boston University; Vyas Sekar, Carnegie Mellon University
Many large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional
datasets. However, state-of-art frameworks do not offer general and accurate analytics in real-time at reasonable cost. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to simultaneously monitor. We present Hydra, an efficient framework for multidimensional analytics that presents a
novel combination of using a “sketch of sketches” to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.