More metrics and more dashboards mean more ways for researchers to identify actionable improvements.
Optimizing the performance, stability, and resource utilization of large language model (LLM) deployments is a challenge for both users and cluster administrators. The Mass Open Cloud (MOC) now supports the ability to collect inference performance metrics for LLMs deployed in our clusters. By integrating the collection of these metrics into our observability setup, we can enable researchers and other users of the MOC to gain insight into their model deployments and how to optimize them.
We collect these metrics primarily through models deployed using the vLLM ServingRuntime. The vLLM ServingRuntime natively exposes a rich set of metrics through a Prometheus-compatible endpoint that supports observability of LLMs deployed in an OpenShift environment. After collecting these metrics, we use dashboards to visualize them (as seen in Figure 1 and Figure 2).


Most people use vLLM for model serving, but for those that are serving models using other serving runtimes, we have tools that can run benchmarks against those deployed models and collect similar metrics. One such tool is llm-load-test-exporter. I built this tool to take the llm-load-test tool (created by the performance and scale team at Red Hat) and expand upon it to be able to export the collected metrics from the benchmark run. We can run this tool against models deployed in a cluster and export the metrics to a Prometheus-supported metrics endpoint to collect these metrics. These metrics are essentially the same metrics gathered through vLLM ServingRuntime, but they can be gathered from models deployed with any serving runtime. We also have dashboards that can visualize these metrics (see Figure 3).

These metrics are most valuable when they lead to actionable improvements for both users and administrators. For example, if a cluster administrator notices that models in the cluster are experiencing high latency and low throughput, that may indicate a networking bottleneck. An increase in waiting requests within the Queue Time and Scheduler State metrics may point to a scheduling issue, possibly due to a lack of available GPUs or an issue with the GPU scheduler itself. For researchers and other users, these metrics can provide insight into fine-tuning their specific model configurations. For example, if a user observes a high Time per Output Token (TPOT), those metrics indicate that the model is struggling to finish the entire response quickly. By analyzing cache utilization alongside this latency data, the user might realize their request batches are too large for the available GPU memory. To improve performance, they can then adjust their Max Generation Tokens or reconfigure their batching strategy. This approach to applying these metrics allows the user to move beyond guesswork and directly optimize their model setups.
We also have access to many GPU metrics through the NVIDIA GPU Operator. This tool manages all the NVIDIA drivers and software needed to utilize GPUs in our clusters, but it also includes a health monitor, referred to as the DCGM Exporter. This exporter tracks real time GPU data such as GPU temperature, utilization, power usage, memory consumption, and fan speed. By watching utilization of GPUs, we can monitor exactly how much of our GPUs are being utilized, helping us to spot bottlenecks or wasted resources. We then have a dashboard that we can use to visualize all of these metrics (see Figure 4), giving us a clear view of our infrastructure’s performance and helping us keep the MOC’s hardware stable and efficient.

For future work, we are investigating the gathering of not just metrics for LLMs, but also traces. While metrics provide a high-level overview of LLM performance, tracing will allow us to follow a single request’s entire journey through the system. This will allow us to pinpoint exactly where a delay occurs, whether it’s in networking, the prefill phase, or the hardware itself, giving us even greater insight into model performance and how to optimize it.
Through Red Hat OpenShift AI (RHOAI), we are also able to utilize TrustyAI, an open source toolkit for responsible AI development and deployment. By using TrustyAI, we can ensure our models are fair, safe, and transparent. It acts like a safety inspector by tracking key metrics: it monitors accuracy to ensure the model stays reliable over time, checks for bias to ensure fair treatment across all groups, and detects data drift to alert us if real-world data has changed too much. It also sets up guardrails to prevent harmful or incorrect answers and provides explainability scores to show exactly why a model made a certain decision. Together, these tools can make our AI projects more reliable, accountable, and more trustworthy.
Collaboration drives breakthroughs
The true advantage for researchers working with the MOC isn’t just access to a cluster; it’s the collaborative partnership between researchers and Red Hat engineers. We work directly with research teams to understand their specific requirements and implement the tools they need to succeed, whether it’s providing access to specific metrics or developing new infrastructure capabilities. A great example of this is one of our collaborations with the NAIRR pilot project Multi-Modal Semantic Routing for vLLM. This group was interested in deploying the Qwen3 Text-to-Speech model. They wanted to utilize the vLLM-Omni serving runtime that allows for multi-modal model deployments (such as image, video, and audio), but this serving runtime was not yet available in our RHOAI setup.
Until then, we had users and researchers working with only text-based LLMs, so there was no need to support multi-modal model deployments. Working directly with Red Hat engineers, the MOC was able to implement and deploy the necessary runtime, allowing experimenters to move their research and our infrastructure forward. Working with the MOC and having a Red Hat engineer involved means your technical needs are heard and integrated into the MOC, ensuring the environment evolves to support your specific research. Now, any researcher that would like to deploy a multi-modal model can utilize this serving runtime.
The successful deployment of this new capability exemplifies how responsive support drives the MOC platform forward, translating specific project requirements for one user into enhanced functionality for the entire user base. This dynamic approach to development is the platform’s core benefit, enabling experimenters to dedicate their full attention to accelerating research breakthroughs.
A complete list of metrics available to MOC users
vLLM Metrics
- End-to-End Request Latency: Total time from request receipt to the final token return
- Token Throughput: Tokens processed per second, separated by Prompt and Generation tokens
- Time Per Output Token (TPOT): Average time to generate each successive token after the first
- Time to First Token (TTFT): Latency before the first generated token is emitted
- Inter-Token Latency (ITL): Time between consecutive output tokens during streaming generation
- Queue Time: Duration a request spends waiting in the scheduler queue before execution begins
- Prefill & Decode Time: Average time spent processing the input prompt (prefill) versus generating output tokens (decode) per request
- Scheduler State: Current count of requests in Running, Waiting, or Swapped states
- KV Cache Utilization: Percentage of KV Cache blocks currently in use by active sequences
- Finish Reason: Breakdown of completed requests by reason: end-of-sequence token generated (stop) or max token limit reached (length)
- Max Generation Tokens: The configured upper limit for output tokens per sequence group
- Token Distribution: Distribution of input prompt lengths and output generation lengths across requests
LLM Load Test Metrics
- Response Time: End-to-end request latency from send to final response, including all network and inference time
- Time Per Output Token (TPOT): Time to generate each output token after the first
- Time to First Token (TTFT): Time from sending the request to receiving the first generated token
- Inter-Token Latency (ITL): Time between consecutive output tokens during streaming
- Time to Acknowledge (TT_ACK): Time from sending a request to the server’s first HTTP acknowledgement, before any tokens are generated, isolates network and routing latency from model inference time
- Throughput: Total output tokens generated per second across all concurrent requests
- Total Failures: Count of requests that failed during each load test run
- Total Requests: Number of requests sent in each load test run
- Mean Input Tokens: Average number of input (prompt) tokens per request
- Mean Output Tokens: Average number of output (generated) tokens per request
Responsible AI & Model Health
- Model Accuracy: The model’s correctness against benchmark data
- Fairness (SPD & DIR): Statistical metrics (Statistical Parity Difference and Disparate Impact Ratio) that ensure equitable outcomes across different demographics
- Data Drift: Detection of shifts in input data distribution relative to a reference, which may degrade model reliability
- Explainability Scores: Weights (SHAP/LIME) that show which input features most heavily influenced a decision
- Guardrail Orchestration: Tracking of blocked prompts or responses that breached safety, toxicity, or Personally Identifiable Information filters
Infrastructure & Hardware Metrics
- Utilization & Performance:
- GPU Utilization: The percentage of processing power currently in use
- Memory Bandwidth Utilization: The percentage of the memory interface currently in use
- Tensor Core Activity: Ratio of cycles the Tensor (HMMA) pipes are active—crucial for measuring AI-specific workload efficiency
- SM Occupancy: The ratio of active warps resident on a Streaming Multiprocessor (SM)
- Hardware Health & Constraints:
- GPU Temperature: Measure of the GPU heat level
- Memory Temperature: Measure of the VRAM heat level
- Fan Speed: Measure of the GPU fan speeds
- Power Usage: Current electric draw in watts
- Energy Consumption: Total energy used since boot
- Clock Frequencies: Current speeds for both the SM and Memory clocks (in MHz)
- Communications & Errors:
- NVLink Bandwidth: Data transfer rates between GPUs, essential for multi-GPU model deployments
- PCIe Throughput: Rate of data received and transmitted over the PCIe bus
- XID Errors: Tracking of specific hardware error codes for rapid troubleshooting
- ECC & Retired Pages: Monitoring for memory errors (single-bit or double-bit) and the health of the GPU’s physical memory cells









