NERC GPU Telemetry GPU & Profiling

Abstract

With the rise of AI/ML workloads and increasing demand for GPU resources, detailed GPU profiling has become critical for the New England Research Cloud (NERC). GPUs represent both a significant cost factor and a bottleneck resource, making efficient utilization and performance monitoring essential.

The GPU Profiling project extends the NERC Observability platform to address GPU-specific challenges across multiple hardware vendors (NVIDIA, AMD) and use cases (training, inference, research).

Key challenges addressed:

  • Multi-vendor support: Different GPU types (NVIDIA A100, H100, V100; future AMD) with proprietary drivers and varying metrics capabilities
  • Metrics granularity gap: NVIDIA DCGM provides ops/admin-level metrics (seconds granularity), while researchers need microsecond-precise profiling data
  • DCGM vs Profiler conflict: NVIDIA’s profiler mode disables DCGM metrics, requiring custom solutions to capture both simultaneously
  • High-frequency data: GPU workloads require metrics collection at sub-second intervals, increasing network and storage demands
  • Hardware telemetry: Integration of switches, network speed metrics for high-frequency GPU computation clusters

This project creates specialized metrics solutions and dashboards to bridge the gap between standard OpenShift observability and the specialized needs of GPU-intensive research workloads.

Research Area(s)

Tags

Project Resources

Project Team

Publications

Related RHRQ Articles