Home Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance

Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance

Name: Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance
Start: 2022-04-28T04:00:00-04:00
End: 2022-04-28T05:30:00-04:00
Location: Virtual

When the relevance of your analytics come into question due to changes in the underlying data, you are faced with an expensive question: is it worth the cost to recompute? In this presentation and discussion Paolo Missier, Professor of Big Data Analytics at Newcastle University, will discuss advanced decision making and the role of data provenance in data analytics, present an experimental framework with potential to benefit data science infrastructure, and discuss broader research on data science pipelines. Ivan Nečas, Senior Principal Engineer, Red Hat, will guide the discussion as the Conversation Leader.

Abstract
Analytics that are generated from “big data” may be valuable but also short-lived, namely when some of the underpinning data changes over time. When the processing is computationally expensive, it is desirable to be able to assess the need for re-computation in reaction to changes, i.e., in terms of marginal benefits relative to the current results, without actually executing the process. This capability is underpinned by data-diff functions, but we argue that this is not enough, and that a re-compute yes/no decision requires a deeper understanding of the process itself.

In the first part of the talk we suggest that histories of past executions can be used to inform such decisions, and articulate the role of data provenance specifically. We then present ReComp, a framework that we have used to experiment with these ideas, which we believe can be a beneficial addition to generic Data Science infrastructure, specifically in organizations where analytics are central, expensive, and repetitive.

In the second part, we broaden the scope of our research and present a provenance capture, storage, and query facility for generic Data Science pipelines.

Speaker
Paolo Missier, Professor of Big Data Analytics at Newcastle University

Conversation Leader
Ivan Nečas, Senior Principal Software Engineer at Red Hat

Session Recording and Materials

Publication: Capturing and querying fine-grained provenance of preprocessing pipelines in data science

Slides

Github

Date

Apr 28 2022

Expired!

Time

EDT

8:00 am - 9:30 am

Local Time

Timezone: America/New_York
Date: Apr 28 2022
Time: 8:00 am - 9:30 am

More Info

Labels

Research Days

Location

Virtual

Organizer

Brno Research

Email

research-brno@redhat.com

Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance

Date

Time

Local Time

More Info

Labels

Location

Virtual

Category

Organizer

Brno Research

Email

Submit a Comment Cancel reply

LEARN

ENGAGE

Building a better pipeline: Analyzing data flows to improve efficiency and retain provenance￼

Date

Time

Local Time

More Info

Labels

Location

Virtual

Category

Organizer

Brno Research

Email

Share this event