AI workload optimizations for different models, data, algorithms, hardware
Abstract
The Mass Open Cloud Alliance (MOC) provides significant computational resources including GPUs for research and open-source development. The goal of this project is to deploy both large-scale AI training and distributed inference workloads on the MOC and to optimize the underlying infrastructure across the full hardware and systems software stack (e.g. the PCIe subsystem, networking and RDMA, GPU kernels, storage, OS optimizations etc.) to provide a competitive platform for academic AI researchers.
Core Project Team
- Sanjay Arora, Red Hat Research
- Ulrich (Uli) Drepper, Red Hat Research
- Jason Schlessman, Red Hat Research
- Ahmed Sanaullah, Red Hat Research