Home Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance

Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance


Session Recording and Materials

Publication: Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Slides

Github

Join Red Hat Research for the next Research Days event, ‘Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance’ on April 28, 2022 from 2PM to 3:30PM CEST (8AM EDT, 3PM IDT).  

When the relevance of your analytics come into question due to changes in the underlying data, you are faced with an expensive question: is it worth the cost to recompute? In this presentation and discussion Paolo Missier, Professor of Big Data Analytics at Newcastle University, will discuss advanced decision making and the role of data provenance in data analytics, present an experimental framework with potential to benefit data science infrastructure, and discuss broader research on data science pipelines.  Ivan Nečas, Senior Principal Engineer, Red Hat, will guide the discussion as the Conversation Leader.

Abstract
Analytics that are generated from “big data” may be valuable but also short-lived, namely when some of the underpinning data changes over time. When the processing is computationally expensive, it is desirable to be able to assess the need for re-computation in reaction to changes, i.e., in terms of marginal benefits relative to the current results, without actually executing the process.  This capability is underpinned by data-diff functions, but we argue that this is not enough, and that a re-compute yes/no decision requires a deeper understanding of the process itself.

In the first part of the talk we suggest that histories of past executions can be used to inform such decisions, and articulate the role of data provenance specifically. We then present ReComp, a framework that we have used to experiment with these ideas, which we believe can be a beneficial addition to generic Data Science infrastructure, specifically in organizations where analytics are central, expensive, and repetitive. 

In the second part, we broaden the scope of our research and present a provenance capture, storage, and query facility for generic Data Science pipelines.

Speaker
Paolo Missier, Professor of Big Data Analytics at Newcastle University

Conversation Leader
Ivan Nečas, Senior Principal Software Engineer at Red Hat

Date

Apr 28 2022
Expired!

Time

EDT
8:00 am - 9:30 am

Local Time

  • Timezone: America/New_York
  • Date: Apr 28 2022
  • Time: 8:00 am - 9:30 am

Location

Virtual
Category

Organizer

Brno Research
Email
research-brno@redhat.com

Submit a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.