Workflow-Centric Tracing for Cloud Applications

Three collaboratory projects focus on improving the observability and diagnosability of the Red-Hat product portfolio.  They build on workflow-centric tracing, which is an increasingly-used technology for capturing data about distributed applications’ behavior.  Unlike traditional machine-centric data sources, such as logs, workflow-centric data coherently captures the work done to process requests within and among the nodes of a distributed application — i.e., it captures their workflows.  At the most basic level, workflow-centric tracing works by propagating context (e.g., request IDs) with requests and tagging records of trace points executed by requests with it.  (Trace points are identical to logging points.) This allows traces (i.e., graphs) of requests’ workflows to be constructed by stitching together trace points with the same context.   Sampling techniques, in which a random decision is made whether to capture any or none of the trace points executed on behalf of a request, are used to keep overhead low enough for tracing to be used in production systems.  

The first project, which is engineering-focused, aims to embed workflow-centric tracing within important Red Hat products with the goal of providing strong testbeds for use in the two research projects.  The two research projects explore advanced ways tracing could automate or inform engineers’ diagnosis efforts.  Combined, these projects will provide important guidance on: 1) Where existing open-source tracing systems, such as Jaeger, fall short.  2) How future tracing systems must be architected to support complex distributed-application behaviors and advanced use cases; 3) What tools and techniques that build on tracing for diagnosis are most useful.  

Combined, these projects provide a path toward a grand goal of creating a management plane that collects tracing data across all Red Hat products and uses it to inform diagnosis efforts.  Significant potential exists to use this management plane for other distributed-application management tasks as well, such as such as resource accounting or modeling).