NERC GPU Telemetry GPU & Profiling

Abstract With the rise of AI/ML workloads and increasing demand for GPU resources, detailed GPU profiling has become critical for the New England Research Cloud (NERC). GPUs represent both a significant cost factor and a bottleneck resource, making efficient...

Open Telemetry

Abstract This project is evaluating the adoption of OpenTelemetry as a standardized observability framework within the MOC clusters. As the MOC gets more customers and our clusters grow in scale and complexity, we need a more robust solution for collecting not just...