Coldpress: An extensible optimization and orchestration framework for managing complex AI workloads

Abstract

Coldpress is an extensible optimization and orchestration framework designed to manage complex AI workloads by systematically analyzing and configuring hardware and software states. It functions as a high-level job manager that executes scalable, lifecycle-managed experiments, enabling users to deploy parallel workloads while simultaneously discovering intricate system states (e.g. NUMA topology, network configurations) and dynamically adjusting system configurations and workload parameters. By abstracting the underlying runtime environment, Coldpress provides a unified interface for executing complex testing workflows across diverse infrastructures, including OpenShift, Bare metal, and Slurm, facilitating reproducible research and holistic optimization of the AI technology stack.


The overarching goal of Coldpress is to address the complexities of infrastructure configuration and runtime optimization. For infrastructure configuration, Coldpress codifies the necessary expertise for identifying which system/network characteristics (e.g. PCIe topology, buffer sizes, sleep states) need to be discovered, the methods for discovery, and valid modifications for tuning them. For runtime optimization, Coldpress codifies the expertise on how to manage the lifecycle of AI experiments at scale. This includes launching experiments, monitoring execution, managing storage, collecting results and logs, cleaning up resources upon completion, and maintaining the experiment records needed for future repeatability. This approach facilitates exploration of diverse research objectives, spanning performance optimization through power-efficiency analysis, across a complex, multi-layered stack.

Learn more

Core Project Team

  • Ahmed Sanauallah, Red Hat Research
  • Lars Kellogg-Stedman, Red Hat Research
  • Taj Salawu, Red Hat Research
  • Jason Schlessman, Red Hat Research

Research Area(s)

Tags

Contacts

Project Resources

Project Team

Publications

Related RHRQ Articles