NERC Multi-Cluster Observability Platform

Abstract

This project provides a customizable observability infrastructure for the New England Research Cloud (NERC), a multi-tenant OpenShift environment serving universities, researchers, and various projects.

The multi-tenant observability patterns developed here are applicable to the Open Sovereign AI Cloud (O-SAC) initiative, which deploys AI-enabled OpenShift clusters on bare metal with similar multi-tenancy requirements.

While Red Hat Advanced Cluster Management (ACM) offers built-in MultiClusterObservability with Prometheus, Thanos, and Grafana, it operates at an “admin-has-all-access” level that doesn’t meet the fine-grained access requirements of a multi-tenant environment.

The solution introduces a dedicated Observability (OBS) cluster that:

Offloads storage and query workloads from the infrastructure cluster
Provides access to observability data outside the VPN for non-admin users
Implements fine-grained, data-level access control through a custom Prometheus Keycloak Proxy (developed by Christopher Tate)
Enables full Grafana capabilities beyond ACM’s limited default setup
Centralizes logs from all managed clusters via Loki with ClusterLogForwarder

This architecture addresses the gap between what ACM provides out-of-the-box and what multi-tenant research environments require: secure, role-based access to metrics, logs, and dashboards across multiple clusters (prod, edu, test, sandbox) without exposing infrastructure access.

Learn more

nerc-ocp-config — Kubernetes manifests for NERC OpenShift clusters (ACM Observability, Grafana, Loki, Prometheus Keycloak Proxy configurations)
nerc-ocp-apps — ArgoCD application definitions for cluster deployments
OCP-on-NERC/docs — Documentation/Design
NERC Operations Issues — Operational tasks and tracking
Grafana Dashboards on OBS — Production dashboards for running projects and POC dashboards for work-in-progress onboarding and testing
AI Telemetry — Fine-grained access control to Observability dashboards at the multi-tenant, hub, cluster, and project level in the Mass Open Cloud
Prometheus Keycloak Proxy — Fine-grained access control to Prometheus Observability metrics API in the Mass Open Cloud
AI Telemetry Workbench — An OpenShift AI Image running VSCode for Java and Go development of AI Telemetry and Prometheus Keycloak Proxy software and dependencies.

Team & Collaboration

Core Team

Christopher Tate – Lead developer, Prometheus Keycloak Proxy, architecture, operations, security, projects onboarding
Thorsten Schwesig – Infrastructure, operations, metrics/statistics/functions, dashboards, coordination, projects onboarding
Isaiah Stapleton – Infrastructure, operations, LLM and GPU metrics, dashboards, Open Telemetry

Contributors

Isaiah Stapleton
Meera Malhorta

Red Hat Internal Gig Participants

Name	Role	Contribution
Harshil Sabhnani	Architect (US)	ACM 2.8 upgrade fixes, ACM Observability configuration
Jeet Basu	Architect (US)	External Grafana, OBS cluster infrastructure
Banashri Mandal	Senior Engineer (Germany)	Loki RBAC, storage, backup strategy, bug fixes
Cristiano Saggin	Support Engineer (Spain)	Grafana upgrades, microservices deployment, cluster support
Dheeraj Jodha	Engineer (India)	Open Telemetry collectors and authentication in OpenShift

Project Team

Christopher Tate

Principal Software Engineer

May 2023

Thorsten Schwesig

Principal Software Engineer

November 2023

Isaiah Stapleton

Software Engineer

July 2023

Publications

Cloud computing, observability, and security research featured at SYSTOR 2023

The 16th ACM International System and Storage Conference (SYSTOR), held June 5-7 in Haifa, Israel, featured five posters highlighting Red Hat-sponsored research projects that target challenges in cloud implementation strategy, storage performance, network observability, and AI/ML-enhanced cybersecurity analysis.

Anastasia Braginsky

June 16, 2023

Open Education Project tackles GPU scheduling and metrics visibility

Enhancements to the education project highlight how research work on OPE drives advancements for many kinds of multitenant environments. The Open Education Project (OPE) continues to develop solutions for optimizing GPU resource usage in a multitenant environment. OPE, a project of Red Hat Collaboratory at Boston University, has long been a pioneer in making high-quality, open source education accessible to all.

Danni Shi

November 2025

Observability cluster added to the MOC Alliance’s New England Research Cloud – Red Hat Research

Observability data provides essential insights for optimizing performance, troubleshooting, and using resources sustainably. For users of the New England Research Cloud (NERC), part of the Mass Open Cloud (MOC) Alliance, this data also provides critical information for innovative research projects.

Thorsten Schwesig

November 2024

Christopher Tate

Telemetry Working Group looks at observability

A new working group is tackling observability in production. Observability has become an increasingly hot topic given the challenges of reliably operating distributed systems such as those in Kubernetes environments.

NERC Multi-Cluster Observability Platform

Abstract

Learn more

Team & Collaboration

Core Team

Contributors

Red Hat Internal Gig Participants

Project Team

Christopher Tate

Thorsten Schwesig

Isaiah Stapleton

Publications

Cloud computing, observability, and security research featured at SYSTOR 2023

Anastasia Braginsky

Related RHRQ Articles

Open Education Project tackles GPU scheduling and metrics visibility

Danni Shi

Observability cluster added to the MOC Alliance’s New England Research Cloud – Red Hat Research

Thorsten Schwesig

Christopher Tate

Telemetry Working Group looks at observability

Gordon Haff

Contacts

Project Resources

Project Team

Publications

Related RHRQ Articles

LEARN

ENGAGE