AI workload optimizations for different models, data, algorithms, hardware

Abstract

The Mass Open Cloud Alliance (MOC) provides significant computational resources including GPUs for research and open-source development. The goal of this project is to deploy both large-scale AI training and distributed inference workloads on the MOC and to optimize the underlying infrastructure across the full hardware and systems software stack (e.g. the PCIe subsystem, networking and RDMA, GPU kernels, storage, OS optimizations etc.) to provide a competitive platform for academic AI researchers.

Core Project Team

Sanjay Arora, Red Hat Research
Ulrich (Uli) Drepper, Red Hat Research
Jason Schlessman, Red Hat Research
Ahmed Sanaullah, Red Hat Research

Research Area(s)

AI-ML | Cloud-DS | Hardware and the OS

Contacts

Sanjay Arora

AI workload optimizations for different models, data, algorithms, hardware

Research Area(s)

Tags

Contacts

Project Resources

Project Team

Publications

Related RHRQ Articles

LEARN

ENGAGE