NERC Multi-Cluster Observability Platform
Abstract
This project provides a customizable observability infrastructure for the New England Research Cloud (NERC), a multi-tenant OpenShift environment serving universities, researchers, and various projects.
The multi-tenant observability patterns developed here are applicable to the Open Sovereign AI Cloud (O-SAC) initiative, which deploys AI-enabled OpenShift clusters on bare metal with similar multi-tenancy requirements.
While Red Hat Advanced Cluster Management (ACM) offers built-in MultiClusterObservability with Prometheus, Thanos, and Grafana, it operates at an “admin-has-all-access” level that doesn’t meet the fine-grained access requirements of a multi-tenant environment.
The solution introduces a dedicated Observability (OBS) cluster that:
- Offloads storage and query workloads from the infrastructure cluster
- Provides access to observability data outside the VPN for non-admin users
- Implements fine-grained, data-level access control through a custom Prometheus Keycloak Proxy (developed by Christopher Tate)
- Enables full Grafana capabilities beyond ACM’s limited default setup
- Centralizes logs from all managed clusters via Loki with ClusterLogForwarder
This architecture addresses the gap between what ACM provides out-of-the-box and what multi-tenant research environments require: secure, role-based access to metrics, logs, and dashboards across multiple clusters (prod, edu, test, sandbox) without exposing infrastructure access.
Learn more
- nerc-ocp-config — Kubernetes manifests for NERC OpenShift clusters (ACM Observability, Grafana, Loki, Prometheus Keycloak Proxy configurations)
- nerc-ocp-apps — ArgoCD application definitions for cluster deployments
- OCP-on-NERC/docs — Documentation/Design
- NERC Operations Issues — Operational tasks and tracking
- Grafana Dashboards on OBS — Production dashboards for running projects and POC dashboards for work-in-progress onboarding and testing
- AI Telemetry — Fine-grained access control to Observability dashboards at the multi-tenant, hub, cluster, and project level in the Mass Open Cloud
- Prometheus Keycloak Proxy — Fine-grained access control to Prometheus Observability metrics API in the Mass Open Cloud
- AI Telemetry Workbench — An OpenShift AI Image running VSCode for Java and Go development of AI Telemetry and Prometheus Keycloak Proxy software and dependencies.
Team & Collaboration
Core Team
- Christopher Tate – Lead developer, Prometheus Keycloak Proxy, architecture, operations, security, projects onboarding
- Thorsten Schwesig – Infrastructure, operations, metrics/statistics/functions, dashboards, coordination, projects onboarding
- Isaiah Stapleton – Infrastructure, operations, LLM and GPU metrics, dashboards, Open Telemetry
Contributors
- Isaiah Stapleton
- Meera Malhorta
Red Hat Internal Gig Participants
| Name | Role | Contribution |
|---|---|---|
| Harshil Sabhnani | Architect (US) | ACM 2.8 upgrade fixes, ACM Observability configuration |
| Jeet Basu | Architect (US) | External Grafana, OBS cluster infrastructure |
| Banashri Mandal | Senior Engineer (Germany) | Loki RBAC, storage, backup strategy, bug fixes |
| Cristiano Saggin | Support Engineer (Spain) | Grafana upgrades, microservices deployment, cluster support |
| Dheeraj Jodha | Engineer (India) | Open Telemetry collectors and authentication in OpenShift |
Project Team
Related RHRQ Articles
Open Education Project tackles GPU scheduling and metrics visibility
Enhancements to the education project highlight how research work on OPE drives advancements for many kinds of multitenant environments. The Open Education Project (OPE) continues to develop solutions for optimizing GPU resource usage in a multitenant environment. OPE, a project of Red Hat Collaboratory at Boston University, has long been a pioneer in making high-quality, open source education accessible to all.
Danni Shi
Observability cluster added to the MOC Alliance’s New England Research Cloud – Red Hat Research
Observability data provides essential insights for optimizing performance, troubleshooting, and using resources sustainably. For users of the New England Research Cloud (NERC), part of the Mass Open Cloud (MOC) Alliance, this data also provides critical information for innovative research projects.
Thorsten Schwesig
Christopher Tate
Telemetry Working Group looks at observability
A new working group is tackling observability in production. Observability has become an increasingly hot topic given the challenges of reliably operating distributed systems such as those in Kubernetes environments.