NERC Multi-Cluster Observability Platform

Abstract

This project provides a customizable observability infrastructure for the New England Research Cloud (NERC), a multi-tenant OpenShift environment serving universities, researchers, and various projects.

The multi-tenant observability patterns developed here are applicable to the Open Sovereign AI Cloud (O-SAC) initiative, which deploys AI-enabled OpenShift clusters on bare metal with similar multi-tenancy requirements.

While Red Hat Advanced Cluster Management (ACM) offers built-in MultiClusterObservability with Prometheus, Thanos, and Grafana, it operates at an “admin-has-all-access” level that doesn’t meet the fine-grained access requirements of a multi-tenant environment.

The solution introduces a dedicated Observability (OBS) cluster that:

  • Offloads storage and query workloads from the infrastructure cluster
  • Provides access to observability data outside the VPN for non-admin users
  • Implements fine-grained, data-level access control through a custom Prometheus Keycloak Proxy (developed by Christopher Tate)
  • Enables full Grafana capabilities beyond ACM’s limited default setup
  • Centralizes logs from all managed clusters via Loki with ClusterLogForwarder

This architecture addresses the gap between what ACM provides out-of-the-box and what multi-tenant research environments require: secure, role-based access to metrics, logs, and dashboards across multiple clusters (prod, edu, test, sandbox) without exposing infrastructure access.

Learn more

  • nerc-ocp-config — Kubernetes manifests for NERC OpenShift clusters (ACM Observability, Grafana, Loki, Prometheus Keycloak Proxy configurations)
  • nerc-ocp-apps — ArgoCD application definitions for cluster deployments
  • OCP-on-NERC/docs — Documentation/Design
  • NERC Operations Issues — Operational tasks and tracking
  • Grafana Dashboards on OBS — Production dashboards for running projects and POC dashboards for work-in-progress onboarding and testing
  • AI Telemetry — Fine-grained access control to Observability metrics in the Mass Open Cloud

Team & Collaboration

Core Team

  • Christopher Tate – Lead developer, Prometheus Keycloak Proxy, architecture, operations, security, projects onboarding
  • Thorsten Schwesig – Infrastructure, operations, metrics/statistics/functions, dashboards, coordination, projects onboarding

Contributors

  • Isaiah Stapleton
  • Meera Malhorta

Red Hat Internal Gig Participants

NameRoleContribution
Harshil SabhnaniArchitect (US)ACM 2.8 upgrade fixes, ACM Observability configuration
Jeet BasuArchitect (US)External Grafana, OBS cluster infrastructure
Banashri MandalSenior Engineer (Germany)Loki RBAC, storage, backup strategy, bug fixes
Cristiano SagginSupport Engineer (Spain)Grafana upgrades, microservices deployment, cluster support
Dheeraj JodhaEngineer (India)AI Telemetry application in OpenShift AI

Project Resources

Project Team

Publications

Related RHRQ Articles