Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance￼
Session Recording and Materials
Join Red Hat Research for the next Research Days event, ‘Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance’ on April 28, 2022 from 2PM to 3:30PM CEST (8AM EDT, 3PM IDT).
When the relevance of your analytics come into question due to changes in the underlying data, you are faced with an expensive question: is it worth the cost to recompute? In this presentation and discussion Paolo Missier, Professor of Big Data Analytics at Newcastle University, will discuss advanced decision making and the role of data provenance in data analytics, present an experimental framework with potential to benefit data science infrastructure, and discuss broader research on data science pipelines. Ivan Nečas, Senior Principal Engineer, Red Hat, will guide the discussion as the Conversation Leader.
Analytics that are generated from “big data” may be valuable but also short-lived, namely when some of the underpinning data changes over time. When the processing is computationally expensive, it is desirable to be able to assess the need for re-computation in reaction to changes, i.e., in terms of marginal benefits relative to the current results, without actually executing the process. This capability is underpinned by data-diff functions, but we argue that this is not enough, and that a re-compute yes/no decision requires a deeper understanding of the process itself.
In the first part of the talk we suggest that histories of past executions can be used to inform such decisions, and articulate the role of data provenance specifically. We then present ReComp, a framework that we have used to experiment with these ideas, which we believe can be a beneficial addition to generic Data Science infrastructure, specifically in organizations where analytics are central, expensive, and repetitive.
In the second part, we broaden the scope of our research and present a provenance capture, storage, and query facility for generic Data Science pipelines.
Paolo Missier, Professor of Big Data Analytics at Newcastle University
Ivan Nečas, Senior Principal Software Engineer at Red Hat