NERC Multi-Cluster Observability Platform

Abstract

This project provides a customizable observability infrastructure for the New England Research Cloud (NERC), a multi-tenant OpenShift environment serving universities, researchers, and various projects.

The multi-tenant observability patterns developed here are applicable to the Open Sovereign AI Cloud (O-SAC) initiative, which deploys AI-enabled OpenShift clusters on bare metal with similar multi-tenancy requirements.

While Red Hat Advanced Cluster Management (ACM) offers built-in MultiClusterObservability with Prometheus, Thanos, and Grafana, it operates at an “admin-has-all-access” level that doesn’t meet the fine-grained access requirements of a multi-tenant environment.

The solution introduces a dedicated Observability (OBS) cluster that:

  • Offloads storage and query workloads from the infrastructure cluster
  • Provides access to observability data outside the VPN for non-admin users
  • Implements fine-grained, data-level access control through a custom Prometheus Keycloak Proxy (developed by Christopher Tate)
  • Enables full Grafana capabilities beyond ACM’s limited default setup
  • Centralizes logs from all managed clusters via Loki with ClusterLogForwarder

This architecture addresses the gap between what ACM provides out-of-the-box and what multi-tenant research environments require: secure, role-based access to metrics, logs, and dashboards across multiple clusters (prod, edu, test, sandbox) without exposing infrastructure access.

Learn more

  • nerc-ocp-config — Kubernetes manifests for NERC OpenShift clusters (ACM Observability, Grafana, Loki, Prometheus Keycloak Proxy configurations)
  • nerc-ocp-apps — ArgoCD application definitions for cluster deployments
  • OCP-on-NERC/docs — Documentation/Design
  • NERC Operations Issues — Operational tasks and tracking
  • Grafana Dashboards on OBS — Production dashboards for running projects and POC dashboards for work-in-progress onboarding and testing
  • AI Telemetry — Fine-grained access control to Observability dashboards at the multi-tenant, hub, cluster, and project level in the Mass Open Cloud
  • Prometheus Keycloak Proxy — Fine-grained access control to Prometheus Observability metrics API in the Mass Open Cloud
  • AI Telemetry Workbench — An OpenShift AI Image running VSCode for Java and Go development of AI Telemetry and Prometheus Keycloak Proxy software and dependencies.

Team & Collaboration

Core Team

  • Christopher Tate – Lead developer, Prometheus Keycloak Proxy, architecture, operations, security, projects onboarding
  • Thorsten Schwesig – Infrastructure, operations, metrics/statistics/functions, dashboards, coordination, projects onboarding
  • Isaiah Stapleton – Infrastructure, operations, LLM and GPU metrics, dashboards, Open Telemetry

Contributors

  • Isaiah Stapleton
  • Meera Malhorta

Red Hat Internal Gig Participants

Name Role Contribution
Harshil Sabhnani Architect (US) ACM 2.8 upgrade fixes, ACM Observability configuration
Jeet Basu Architect (US) External Grafana, OBS cluster infrastructure
Banashri Mandal Senior Engineer (Germany) Loki RBAC, storage, backup strategy, bug fixes
Cristiano Saggin Support Engineer (Spain) Grafana upgrades, microservices deployment, cluster support
Dheeraj Jodha Engineer (India) Open Telemetry collectors and authentication in OpenShift

Project Resources

Project Team

Publications

Related RHRQ Articles