Red Hat Research Quarterly

Observability cluster added to the MOC Alliance’s New England Research Cloud

Red Hat Research Quarterly

Observability cluster added to the MOC Alliance’s New England Research Cloud

about the author

Thorsten Schwesig

Thorsten Schwesig is a Principal Software Engineer on the Red Hat Research team and part of MOC Alliance leadership for Red Hat. Thor is enthusiastic about collaborating with others to automate tasks and improve workflows.

about the author

Christopher Tate

Christopher Tate is a Principal Software Engineer on the Red Hat Research team and a lead software engineer for logging, metrics, alerts, and AI/ML smart data research in the New England Research Cloud environment.

Article featured in

Updates to NERC infrastructure enable fine-grained resource permissions for observability data.

Observability data provides essential insights for optimizing performance, troubleshooting, and using resources sustainably. For users of the New England Research Cloud (NERC), part of the Mass Open Cloud (MOC) Alliance, this data also provides critical information for innovative research projects. Until recently, access to this data was restricted for most users.

A standalone cluster

NERC container infrastructure is based on OpenShift and includes several clusters (e.g., an infra cluster, prod cluster, and test cluster) operated within a VPN. Access to these clusters is therefore limited. This restriction especially affects observability data, such as metrics, logs, and traces. As the amount of observability data continues to grow, it becomes increasingly useful for research and teaching, independent of the applications, models, and data that generate it.

Initially, the observability data and systems in NERC, such as Thanos, Prometheus, Grafana, and Loki,  ran on the infra cluster, which put higher demands on this cluster, which in turn can affect its operation in extreme cases. To enable access to observability data outside the VPN—and to relieve the infra cluster and separate tasks—we developed and implemented the idea of a standalone observability cluster.

Since March 2024, the NERC Observability Cluster has been running in its base version and has already successfully met several requirements. The cluster captures and stores metrics and logs with an increased retention rate and is accessible outside the VPN, which makes it much easier for researchers and educators to use. Additionally, we have made static dashboards for NERC data available in Grafana, providing a first basic visualization of the collected data to support analysis and monitoring, along with the ability to develop new dashboards.

Controlling data access

With the NERC Observability Cluster in place, our next step was implementing fine-grained access control. With multiple research projects and classes hosted on NERC, maintaining data privacy compliance is essential. We needed to ensure that specific user groups, such as admins, researchers, professors, students, and apps (via API access), can access the data they need, and only the data they need. 

Our primary challenges were ensuring seamless integration and maintaining high security standards. We accomplished this in May 2024 by introducing a new keycloak-permissions-operator to both operatorhub.io and Red Hat OpenShift to automate a previously missing feature of the Red Hat build of the Keycloak Operator. It introduces an advanced authorization feature of Keycloak and makes it easy to configure user, group, and application access to resources. We configure Keycloak for resource definitions, scopes, and permissions and set up a secure proxy to validate access tokens. We initially built these resources for the AI for Cloud Ops project team to give them access to certain metrics only on the prod OpenShift cluster.  However, this operator was very reusable for other customers and projects as well.

The next step was to deploy a reverse-proxy (prom-keycloak-proxy) with fine-grained resource permissions authentication and authorization between applications on NERC and Red Hat Advanced Cluster Management (ACM) observability metrics. We’ve also shared this work with the ACM Observability team, which has features for fine-grained access to metrics on its roadmap.

Future enhancements and long-term goals

In the next phase of this project, we will develop and implement mechanisms for data anonymization to ensure both privacy and usability of the data for research. We also plan to implement traces and develop interactive, dynamic dashboards that allow personalized and detailed data analysis. Additionally, the retention rate will be further optimized to support long-term analyses.

The NERC Observability Cluster represents a significant improvement in the accessibility and usability of observability data for research and education.

In time, we aim to introduce a proactive alerting and optimization system that captures event-based logs and provides targeted recommendations and optimizations. Additionally, we will continuously optimize the cluster’s scalability and performance. We plan to promote use of the cluster by more research projects and institutions and integrate additional observability systems and data sources for a more comprehensive analysis of system performance.

The NERC Observability Cluster represents a significant improvement in the accessibility and usability of observability data for research and education. With ongoing development, it will meet growing demands and provide a solid foundation for innovative research projects. The key ideas and tools we’ve used can also be applied to other kinds of data that require fine-grained access control.Keep up with our work on NERC Observability on GitHub.

SHARE THIS ARTICLE

More like this