Red Hat Research Quarterly

Enhanced observability makes optimizing LLM inference performance easier

Isaiah Stapleton

Isaiah Stapleton is a Software Engineer with Red Hat Research, where he supports cloud and AI initiatives in the Mass Open Cloud for academic and industry research. His work focuses on OpenShift, including model serving, AI workload support, observability, and performance evaluation across AI systems and cloud platforms. He helps build and maintain the systems that enable researchers to conduct and scale their work. He graduated from the College of Charleston in 2023 with a B.S. in Computer Science.

Related Projects

Article featured in

RHRQ Spring 2026 cover with Dan Alistarh

Red Hat Research Quarterly

Spring 2026

Download PDF

Subscribe now

More metrics and more dashboards mean more ways for researchers to identify actionable improvements.

Optimizing the performance, stability, and resource utilization of large language model (LLM) deployments is a challenge for both users and cluster administrators. The Mass Open Cloud (MOC) now supports the ability to collect inference performance metrics for LLMs deployed in our clusters. By integrating the collection of these metrics into our observability setup, we can enable researchers and other users of the MOC to gain insight into their model deployments and how to optimize them.

We collect these metrics primarily through models deployed using the vLLM ServingRuntime. The vLLM ServingRuntime natively exposes a rich set of metrics through a Prometheus-compatible endpoint that supports observability of LLMs deployed in an OpenShift environment. After collecting these metrics, we use dashboards to visualize them (as seen in Figure 1 and Figure 2).

*_{Figure 1. vLLM Dashboard visualizing LLM inference metrics exported via the vLLM ServingRuntime}*

*_{Figure 2. vLLM dashboard visualizing LLM inference metrics exported via the vLLM ServingRuntime}*

Most people use vLLM for model serving, but for those that are serving models using other serving runtimes, we have tools that can run benchmarks against those deployed models and collect similar metrics. One such tool is llm-load-test-exporter. I built this tool to take the llm-load-test tool (created by the performance and scale team at Red Hat) and expand upon it to be able to export the collected metrics from the benchmark run. We can run this tool against models deployed in a cluster and export the metrics to a Prometheus-supported metrics endpoint to collect these metrics. These metrics are essentially the same metrics gathered through vLLM ServingRuntime, but they can be gathered from models deployed with any serving runtime. We also have dashboards that can visualize these metrics (see Figure 3).

*_{Figure 3. LLM Load Test Exporter Dashboard visualizing llm load test exporter metrics}*

These metrics are most valuable when they lead to actionable improvements for both users and administrators. For example, if a cluster administrator notices that models in the cluster are experiencing high latency and low throughput, that may indicate a networking bottleneck. An increase in waiting requests within the Queue Time and Scheduler State metrics may point to a scheduling issue, possibly due to a lack of available GPUs or an issue with the GPU scheduler itself. For researchers and other users, these metrics can provide insight into fine-tuning their specific model configurations. For example, if a user observes a high Time per Output Token (TPOT), those metrics indicate that the model is struggling to finish the entire response quickly. By analyzing cache utilization alongside this latency data, the user might realize their request batches are too large for the available GPU memory. To improve performance, they can then adjust their Max Generation Tokens or reconfigure their batching strategy. This approach to applying these metrics allows the user to move beyond guesswork and directly optimize their model setups.

We also have access to many GPU metrics through the NVIDIA GPU Operator. This tool manages all the NVIDIA drivers and software needed to utilize GPUs in our clusters, but it also includes a health monitor, referred to as the DCGM Exporter. This exporter tracks real time GPU data such as GPU temperature, utilization, power usage, memory consumption, and fan speed. By watching utilization of GPUs, we can monitor exactly how much of our GPUs are being utilized, helping us to spot bottlenecks or wasted resources. We then have a dashboard that we can use to visualize all of these metrics (see Figure 4), giving us a clear view of our infrastructure’s performance and helping us keep the MOC’s hardware stable and efficient.

*_{Figure 4. DCGM Dashboard that visualizes DCGM metrics}*

For future work, we are investigating the gathering of not just metrics for LLMs, but also traces. While metrics provide a high-level overview of LLM performance, tracing will allow us to follow a single request’s entire journey through the system. This will allow us to pinpoint exactly where a delay occurs, whether it’s in networking, the prefill phase, or the hardware itself, giving us even greater insight into model performance and how to optimize it.

Through Red Hat OpenShift AI (RHOAI), we are also able to utilize TrustyAI, an open source toolkit for responsible AI development and deployment. By using TrustyAI, we can ensure our models are fair, safe, and transparent. It acts like a safety inspector by tracking key metrics: it monitors accuracy to ensure the model stays reliable over time, checks for bias to ensure fair treatment across all groups, and detects data drift to alert us if real-world data has changed too much. It also sets up guardrails to prevent harmful or incorrect answers and provides explainability scores to show exactly why a model made a certain decision. Together, these tools can make our AI projects more reliable, accountable, and more trustworthy.

Collaboration drives breakthroughs

The true advantage for researchers working with the MOC isn’t just access to a cluster; it’s the collaborative partnership between researchers and Red Hat engineers. We work directly with research teams to understand their specific requirements and implement the tools they need to succeed, whether it’s providing access to specific metrics or developing new infrastructure capabilities. A great example of this is one of our collaborations with the NAIRR pilot project Multi-Modal Semantic Routing for vLLM. This group was interested in deploying the Qwen3 Text-to-Speech model. They wanted to utilize the vLLM-Omni serving runtime that allows for multi-modal model deployments (such as image, video, and audio), but this serving runtime was not yet available in our RHOAI setup.

Until then, we had users and researchers working with only text-based LLMs, so there was no need to support multi-modal model deployments. Working directly with Red Hat engineers, the MOC was able to implement and deploy the necessary runtime, allowing experimenters to move their research and our infrastructure forward. Working with the MOC and having a Red Hat engineer involved means your technical needs are heard and integrated into the MOC, ensuring the environment evolves to support your specific research. Now, any researcher that would like to deploy a multi-modal model can utilize this serving runtime.

The successful deployment of this new capability exemplifies how responsive support drives the MOC platform forward, translating specific project requirements for one user into enhanced functionality for the entire user base. This dynamic approach to development is the platform’s core benefit, enabling experimenters to dedicate their full attention to accelerating research breakthroughs.

A complete list of metrics available to MOC users

vLLM Metrics

End-to-End Request Latency: Total time from request receipt to the final token return
Token Throughput: Tokens processed per second, separated by Prompt and Generation tokens
Time Per Output Token (TPOT): Average time to generate each successive token after the first
Time to First Token (TTFT): Latency before the first generated token is emitted
Inter-Token Latency (ITL): Time between consecutive output tokens during streaming generation
Queue Time: Duration a request spends waiting in the scheduler queue before execution begins
Prefill & Decode Time: Average time spent processing the input prompt (prefill) versus generating output tokens (decode) per request
Scheduler State: Current count of requests in Running, Waiting, or Swapped states
KV Cache Utilization: Percentage of KV Cache blocks currently in use by active sequences
Finish Reason: Breakdown of completed requests by reason: end-of-sequence token generated (stop) or max token limit reached (length)
Max Generation Tokens: The configured upper limit for output tokens per sequence group
Token Distribution: Distribution of input prompt lengths and output generation lengths across requests

LLM Load Test Metrics

Response Time: End-to-end request latency from send to final response, including all network and inference time
Time Per Output Token (TPOT): Time to generate each output token after the first
Time to First Token (TTFT): Time from sending the request to receiving the first generated token
Inter-Token Latency (ITL): Time between consecutive output tokens during streaming
Time to Acknowledge (TT_ACK): Time from sending a request to the server’s first HTTP acknowledgement, before any tokens are generated, isolates network and routing latency from model inference time
Throughput: Total output tokens generated per second across all concurrent requests
Total Failures: Count of requests that failed during each load test run
Total Requests: Number of requests sent in each load test run
Mean Input Tokens: Average number of input (prompt) tokens per request
Mean Output Tokens: Average number of output (generated) tokens per request

Responsible AI & Model Health

Model Accuracy: The model’s correctness against benchmark data
Fairness (SPD & DIR): Statistical metrics (Statistical Parity Difference and Disparate Impact Ratio) that ensure equitable outcomes across different demographics
Data Drift: Detection of shifts in input data distribution relative to a reference, which may degrade model reliability
Explainability Scores: Weights (SHAP/LIME) that show which input features most heavily influenced a decision
Guardrail Orchestration: Tracking of blocked prompts or responses that breached safety, toxicity, or Personally Identifiable Information filters

Infrastructure & Hardware Metrics

Utilization & Performance:
- GPU Utilization: The percentage of processing power currently in use
- Memory Bandwidth Utilization: The percentage of the memory interface currently in use
- Tensor Core Activity: Ratio of cycles the Tensor (HMMA) pipes are active—crucial for measuring AI-specific workload efficiency
- SM Occupancy: The ratio of active warps resident on a Streaming Multiprocessor (SM)

Hardware Health & Constraints:
- GPU Temperature: Measure of the GPU heat level
- Memory Temperature: Measure of the VRAM heat level
- Fan Speed: Measure of the GPU fan speeds
- Power Usage: Current electric draw in watts
- Energy Consumption: Total energy used since boot
- Clock Frequencies: Current speeds for both the SM and Memory clocks (in MHz)

Communications & Errors:
- NVLink Bandwidth: Data transfer rates between GPUs, essential for multi-GPU model deployments
- PCIe Throughput: Rate of data received and transmitted over the PCIe bus
- XID Errors: Tracking of specific hardware error codes for rapid troubleshooting
- ECC & Retired Pages: Monitoring for memory errors (single-bit or double-bit) and the health of the GPU’s physical memory cells

SHARE THIS ARTICLE

Unleashing the potential of Function as a Service in the cloud continuum

Luis Tomás Bolivar

José Castillo Lema

The PHYSICS project demonstrates the value of the FaaS paradigm for application development and data analysis. Here’s how we enhanced the infrastructure layer. The difficulty of scaling, optimizing, and maintaining infrastructure makes cloud computing too complex or resource-intensive for many developers and data scientists. The Function-as-a-Service (FaaS) model (often called serverless computing, generically) allows users […]

Feature

Tuning Linux kernel policies for energy efficiency with machine learning

Han Dong

Presenting BayOp, a generic ML-enhanced controller that optimizes network application efficiency by automatically controlling performance and energy trade-offs. As global datacenter energy use rises and energy budgets are constrained, it becomes increasingly important for operating systems (OS) to enable higher efficiency and get more work done while consuming less. Concurrently, the environmental footprint of hardware […]

Feature

Open source education: from philosophy to reality

Danni Shi

Researchers, interns, and industry engineers have joined forces to create an open education platform using Red Hat OpenShift Data Science. Open source technology has transformed many industries, and education is now poised to be the next frontier. Open Education (OPE), an innovative project initiated by Boston University professor Jonathan Appavoo, is revolutionizing how education is […]

Feature

Don’t blame the developers: making security usable for IT professionals

Martin Ukrop

Historically, usability studies have looked mostly at end users, doing focus groups or user testing with customers or the general public. This process often neglected developers, system administrators, and other IT professionals and the systems they use day to day.

Feature

Finding bugs in parallel programs with heavy-duty program analysis

Vladimír Štill

Parallelism promises to make programs faster, yet it also opens many new pitfalls and makes testing programs much harder.

Feature

A data-driven approach for analyzing Common Criteria and FIPS 140 security certificates

Jaroslav Řezník

Petr Švenda

Seccerts is a much-needed tool for data scraping and analysis of security certificates, but creating it was harder than expected. Here’s why. Security certification documents from certification schemes like Common Criteria (CC) and the National Institute of Standards and Technology (NIST) Federal Information Processing Standard (FIPS) contain valuable, detailed information. Most of it, however, is […]

Feature

Blocks, microworlds, puzzles, and adaptivity: teaching programming effectively

Tomáš Effenberger

Bayesian statistical methods can make predictive data analysis more accurate. In this article, we evaluate possible solutions to the challenge of refining and increasing the value of high-volume data streams.

Feature

Creating a Linux-based unikernel

Gordon Haff

Is there a way to gain the performance benefits of a unikernel without severing it from an existing general-purpose code base? Boston University professors, BU PhD students, and Red Hat engineers at the Red Hat Collaboratory at Boston University are getting close to finding the answer. A unikernel is a single bootable image consisting of […]

Feature

User authentication for open source developers: what do they use?

Agáta Kružíková

Milan Brož

Ongoing research into user authentication in public open source repositories demonstrates the importance of usability–even for IT professionals.