NERC GPU Telemetry GPU & Profiling
Abstract
With the rise of AI/ML workloads and increasing demand for GPU resources, detailed GPU profiling has become critical for the New England Research Cloud (NERC). GPUs represent both a significant cost factor and a bottleneck resource, making efficient utilization and performance monitoring essential.
The GPU Profiling project extends the NERC Observability platform to address GPU-specific challenges across multiple hardware vendors (NVIDIA, AMD) and use cases (training, inference, research).
Key challenges addressed:
- Multi-vendor support: Different GPU types (NVIDIA A100, H100, V100; future AMD) with proprietary drivers and varying metrics capabilities
- Metrics granularity gap: NVIDIA DCGM provides ops/admin-level metrics (seconds granularity), while researchers need microsecond-precise profiling data
- DCGM vs Profiler conflict: NVIDIA’s profiler mode disables DCGM metrics, requiring custom solutions to capture both simultaneously
- High-frequency data: GPU workloads require metrics collection at sub-second intervals, increasing network and storage demands
- Hardware telemetry: Integration of switches, network speed metrics for high-frequency GPU computation clusters
This project creates specialized metrics solutions and dashboards to bridge the gap between standard OpenShift observability and the specialized needs of GPU-intensive research workloads.