Mass Open Cloud (MOC): An open, distributed platform enabling AI/ML workloads

Red Hat has for many years participated in and supported the Mass Open Cloud Alliance (MOC-A). With the rising importance of AI, the MOC-A now provides Red Hat and our partners/collaborators a platform to improve open source AI products, increase mindshare for open source AI technologies, and form catalyzing partnerships around AI solutions and services. While the MOC-A has always been an important part of our work, Red Hat Research is now focused on further developing this opportunity, expanding the MOC-A ecosystem, and helping our AI business units and partners take full advantage of it. At this pivotal moment in the evolution of AI, we contend that this long-running project is now a uniquely powerful platform for driving innovation and advancing open source AI.

Read the shared strategy here: MOC-A: driving innovation and advancing AI

Background

The Mass Open Cloud (MOC) Alliance is a long-standing collaboration between academic institutions, government, and industry that has created an open cloud. It was designed to provide researchers and students access to the large-scale compute resources, large diverse data sets, AI tools, and AI models that are critical to addressing problems in healthcare, climate change, education, and many other global challenges. 

The demands of AI have made the cloud critical for research and education; however, today’s public clouds have several disadvantages for academic users: 

  • Cloud services lock users into a specific vendor, with only a subset of the tools and functionality that open collaboration can provide. 
  • Commercial clouds lack the human facilitators common to successful HPC services that allow domain specialists to use the cloud for complex projects without getting bogged down in the technology.
  • Commercial clouds impose unpredictably large costs that are enormously (e.g., 4x) higher than the costs of campus compute resources that many top academic institutions make available to their faculty. 

The situation is even worse for systems researchers who want to innovate and improve the way cloud and AI platforms are implemented. Without access to the scale, demands, and users of a real cloud, and without bare metal access to diverse hardware, many areas of system research are impossible. 

The MOC-A addresses these needs by providing an open cloud for research and education whose fundamental goal is to maximize impact rather than revenue. It offers users facilitator-supported services at a fraction of the cost of the public cloud, and it offers system researchers, open source developers, research IT, and industry a shared large-scale environment to rapidly advance AI tools and infrastructure.

The MOC-A is providing value to Red Hat and partners in a number of ways, and we expect this value to increase tremendously as the MOC-A scales. 

  • The MOC-A hosts open source development.
    • Provides an open testbed for product development
    • Allows access to GPUs and services for multiple use cases at a fraction of the price of public clouds or internally-owned deployments
    • Enables scale- and security-related experiments with open source products that would be prohibitively expensive otherwise
  • The MOC-A provides a shared environment for collaborations to advance AI.
    • A shared environment reduces the barriers to technology evaluation and collaboration
    • Integration and interoperability with hardware partners on technology (e.g., NVIDIA, AMD, Intel) OEMs (e.g., Dell, Lenovo), Storage (e.g., Pure, Weka, Dell),  Networking, etc
    • ISVs can deliver products to research and educational customers, including university-based startups
      • This also provides incentive to support and integrate Red Hat’s open source platforms with other open software and systems
  • The MOC-A provides an attractive environment for developing, testing, and demonstrating potential solutions.
    • AI cloud-in-a-box solutions
    • Services that can span data centers and burst to the public cloud

MOC Resources

At a high level, MOC provides services for domain research, education & startups, along with facilitators to help get your project running on MOC easily.

  • VMs and containers (OpenShift)
  • Volume and object storage
  • AI-as-a-service (OpenShift AI)

It’s also possible to build your own testbed, applications and services with components provided through the Elastic Shared Infrastructure (ESI) tools.

  • 1,000 servers with over 30,000 cores, 5 TB memory, ~100 GPUs
  • 64 available NVIDIA A100 GPUs
  • Currently installing 192 H100 GPUs
  • Planned accelerators from AMD and Intel

Special capabilities that are critical to systems research and innovation, including access to all layers of the stack for developers and researchers who need it:

  • Can allocate a bare-metal cluster using ESI
  • Can deploy experiments to MOC and the Open Cloud Testbed (OCT) or the Colosseum mobile testbed and interconnect them (This work supported by NSF grants)

Storage/Data: 

  • 50 PB of disk and 100 PB of tape storage through the North Easte Storage Exchange
  • Planned modified instance of Harvard’s Dataverse for hosting very large data sets

Red Hatters Interested in Using the MOC

Read “Accessing and Paying for the Mass Open Cloud Public Cloud” on The Source (accessible only to Red Hatters)

See “How does pricing work?” for more general information on pricing.

Opportunities for Collaboration – Technical Priorities

There are many challenges that we would love to collaborate on with partners to enhance the MOC-A.   Email Heidi Dempsey, hdempsey@redhat.com, to discuss getting involved.

Follow MOC Jira issues here. [Access limited to Red Hatters and partners with Red Hat jira account]

Key technical priorities to enable the MOC-A to leverage expected state and federal investments to grow rapidly are:

  • Supporting scalable multi-cluster operations
  • Improving observability
  • Improving user experience for AI and domain-expert developers
  • Technical and business innovation to enable the full ecosystem, e.g.:
    • Building marketplace where a diverse community of developers can offer solutions for a diverse community of users
    • Enabling GPU Direct RDMA for storage and networking
    • Supporting mechanisms to cache and prefetch data to optimize GPU usage
    • Expanding the SRE community and available tools for cloud operations
    • Supporting compliance regimes (e.g., HIPAA) required for many use cases

Learn More

Visit massopen.cloud

Read the strategy doc: MOC-A: driving innovation and advancing AI

Status

Research Area(s)

Project Resources

RIG(s)

Affiliations

Project Team

Publications

Related RHRQ Articles