NERC Multi-Cluster Observability Platform
Abstract
This project provides a customizable observability infrastructure for the New England Research Cloud (NERC), a multi-tenant OpenShift environment serving universities, researchers, and various projects.
The multi-tenant observability patterns developed here are applicable to the Open Sovereign AI Cloud (O-SAC) initiative, which deploys AI-enabled OpenShift clusters on bare metal with similar multi-tenancy requirements.
While Red Hat Advanced Cluster Management (ACM) offers built-in MultiClusterObservability with Prometheus, Thanos, and Grafana, it operates at an “admin-has-all-access” level that doesn’t meet the fine-grained access requirements of a multi-tenant environment.
The solution introduces a dedicated Observability (OBS) cluster that:
- Offloads storage and query workloads from the infrastructure cluster
- Provides access to observability data outside the VPN for non-admin users
- Implements fine-grained, data-level access control through a custom Prometheus Keycloak Proxy (developed by Christopher Tate)
- Enables full Grafana capabilities beyond ACM’s limited default setup
- Centralizes logs from all managed clusters via Loki with ClusterLogForwarder
This architecture addresses the gap between what ACM provides out-of-the-box and what multi-tenant research environments require: secure, role-based access to metrics, logs, and dashboards across multiple clusters (prod, edu, test, sandbox) without exposing infrastructure access.
Learn more
- nerc-ocp-config — Kubernetes manifests for NERC OpenShift clusters (ACM Observability, Grafana, Loki, Prometheus Keycloak Proxy configurations)
- nerc-ocp-apps — ArgoCD application definitions for cluster deployments
- OCP-on-NERC/docs — Documentation/Design
- NERC Operations Issues — Operational tasks and tracking
- Grafana Dashboards on OBS — Production dashboards for running projects and POC dashboards for work-in-progress onboarding and testing
- AI Telemetry — Fine-grained access control to Observability metrics in the Mass Open Cloud
Team & Collaboration
Core Team
- Christopher Tate – Lead developer, Prometheus Keycloak Proxy, architecture, operations, security, projects onboarding
- Thorsten Schwesig – Infrastructure, operations, metrics/statistics/functions, dashboards, coordination, projects onboarding
Contributors
- Isaiah Stapleton
- Meera Malhorta
Red Hat Internal Gig Participants
| Name | Role | Contribution |
|---|---|---|
| Harshil Sabhnani | Architect (US) | ACM 2.8 upgrade fixes, ACM Observability configuration |
| Jeet Basu | Architect (US) | External Grafana, OBS cluster infrastructure |
| Banashri Mandal | Senior Engineer (Germany) | Loki RBAC, storage, backup strategy, bug fixes |
| Cristiano Saggin | Support Engineer (Spain) | Grafana upgrades, microservices deployment, cluster support |
| Dheeraj Jodha | Engineer (India) | AI Telemetry application in OpenShift AI |