Smarter AI, fewer resources: bringing cloud AI into real-time edge devices to unlock performance

A new AI framework for edge systems overcomes the communication and energy obstacles that limit their use in real-time applications by integrating local and cloud decision-making while maintaining strong performance.

Artificial intelligence (AI) models with vast and generalized knowledge are increasingly being integrated into everyday devices, from smartphones that provide personalized assistance to mobile robots and vehicles that continuously monitor and interact with their surroundings. Yet these powerful AI models are currently constrained by the limited resources of these edge devices.

Running a large and accurate AI model on a smartphone or a mobile robot can quickly drain its battery within minutes and require significant energy and hardware resources. As these models continue to grow in size and computational demands (e.g., requiring expensive GPUs), deploying them across millions of everyday devices becomes increasingly difficult, expensive, and environmentally unsustainable. As part of the collaborative project Minimal Mobile Systems via Cloud-based Adaptive Task Processing, researchers at Red Hat and Boston University developed a new framework that optimizes computation to enable more efficient real-time AI applications without sacrificing model accuracy.

Motivation

Traditionally, AI computations are offloaded to remote servers. This can save on-device resources, as local image and text data are sent to models in the cloud. Smart assistants often use this approach to offload as much computation as possible to the cloud, helping to preserve energy and local device resources. While this method is widely used today in systems like ChatGPT, relying on the cloud can introduce delays, making it unsuitable for real-time or safety-critical applications. For a robot, even a brief delay can be dangerous—for example, causing a mobile system to collide with a nearby pedestrian. As a result, latency-constrained edge systems often depend on expensive local hardware and resources to ensure quick responses. Can we design edge systems that seamlessly balance cloud and local resources to optimize for real-time accuracy, efficiency, and safety across different situations?

To address urgent societal and sustainability needs with existing systems and models, engineers today may leverage various ad hoc strategies. Developers may try to use lightweight and compressed models, but these smaller models will suffer from degraded accuracy and result in unreliable performance, such as, again, failing to detect that nearby pedestrian. Models can also be carefully tuned for specific devices and scenarios but struggle when faced with diverse operational tasks that may need more computational power. One promising alternative is systems that automatically adapt on the fly, adjusting when, where, and how computations are performed as needed.

In work presented at the European Conference on Computer Vision 2024, researchers from Red Hat and Boston University collaborated to develop a novel framework that dynamically learns to balance shared computation across various devices and operational settings. The proposed system, UniLCD (Unified Local-Cloud Decision-Making), introduces a new approach based on a field in machine learning called reinforcement learning (RL), where the system learns by trial and error, receiving rewards or penalties based on its actions. This method trains a flexible model to decide, based on the current scenario and task, whether to offload computation to the cloud or process it locally.

Our method—UniLCD

UniLCD is a dynamic approach that empowers resource-constrained devices—such as smartphones, autonomous vehicles, and mobile robots—with the ability to leverage both local processing power and cloud resources.

At its core, UniLCD comprises a context-dependent routing module, which takes as input an embedding, that is, a compressed representation of the current state and the history of past system decisions. This routing module is trained using RL to determine a decision policy, such as whether to implement a local action based on a lightweight but less accurate model or choose to transmit local information to the cloud server model, which is larger and more accurate but also induces latency. While this approach can be applied to any real-time AI application and edge device, Figure 1 illustrates an example system for a camera-based mobile robot navigation task.

***Figure 1.*** *Overview of UniLCD for a robot navigation task. The framework learns to offload tasks to the cloud while maintaining real-time performance.*

The primary goal of our system is to learn when to offload computations to the cloud while meeting safety and real-time requirements. As shown in Figure 1, the local decision-making model (also referred to as the local policy) consists of a truncated neural network designed to rapidly process image and goal observations. The extracted features, or embedding, are then combined with a memory buffer that stores a history of past observations, providing additional context for the system. This historical data enables the system to observe latency dynamics and adapt to various constraints, such as limited communication settings. The memory is passed to a multi-layer perceptron (MLP) routing module, which determines whether to offload the current embedding to the cloud for further processing with a subsequent neural network or to classify a navigation action—such as steering, braking, or accelerating—locally. The complete algorithm for training the routing policy is shown in Figure 2.

***Figure 2.*** *Training a generalized routing policy with reinforcement learning. The algorithm continuously updates a minimal local neural network that classifies between local and cloud operations.*

As shown in the algorithm, UniLCD learns by receiving a reinforcement signal, or reward, based on the outcomes of its decisions. For example, a mobile system should learn to strategically interleave cloud computation, particularly when encountering challenging scenarios, to improve the accuracy of the lightweight, lower-accuracy local model. In the case of our navigation task, if the mobile robot successfully moves closer to the goal, reduces energy consumption, or selects effective action ranges and speeds, it gets a positive reward. If it is close to collision with an object, which is undesirable, it receives a negative reward, where the complete reward in each time step is computed as:

Here, alpha is a scaling factor that adjusts the overall reward to fall within the range [0, 1]. This reward ensures that the resulting policy optimizes both task performance as well as energy and communication constraints. In general, designing a multi-objective reward function can be complex, even for relatively simple tasks (e.g., robot navigation without dynamic objects, as often explored in prior work). RL typically requires extensive iteration in training. One key finding is in how the reward function impacts training efficiency and convergence significantly. By multiplying the different reward terms, the need for extensive tuning of individual components is reduced—if one term is low, it diminishes the overall reward, encouraging an effective policy to emerge within just a few minutes of operation. Once this initial training is complete, the policy can be deployed without additional training, though the model can be updated over incoming observations continually (e.g., for further efficiency gains) or automatically adapt to novel scenarios, platforms, and communication modes.

Results

To rigorously validate the system, a simulation environment was developed for sidewalk robot navigation in crowded outdoor settings. This environment captures complex scenarios that require frequent switching and high responsiveness, thus showcasing UniLCD’s robust capabilities in handling challenging, dynamic tasks that demand seamless cloud-edge integration. To realistically model real-world constraints, the simulation also introduces stochastic delays in data transmission between the local device and the cloud server, effectively capturing the impact of latency.

In the most difficult and dense settings, UniLCD showed an improvement of over 35% compared to all prior baselines in an introduced Ecological Navigation Score, a metric that combines task performance (e.g., collisions, route completion, overall task time) with overall energy costs. In these intricate settings, baselines relying on naive model splitting or pruning resulted in poor navigation and frequent collisions as their design does not holistically consider environmental, communication, and safety contexts. The strong performance persisted across environmental conditions and different models, including very small local models for resource-limited use cases. This remarkable generalizability marks a significant step toward broad, ultra-low-cost deployments, which are currently being explored in follow-up research. Real-time, cloud-integrated systems with lightweight local models and minimal hardware requirements—such as smartphones—could be deployed in broader and more diverse settings, delivering high-performance operation with minimal degradation.

Future applications

UniLCD has the potential to reshape the future of edge computing by seamlessly integrating local and cloud-based decision-making. This novel framework is currently being integrated into Red Hat OpenShift, providing a flexible solution for enabling large-scale, real-world deployments across various communication and modeling configurations. While challenges remain, including accelerating RL model training to solve for an optimal local-cloud policy within just a handful of interactions, there are several exciting future opportunities. Given the generalized nature of the routing mechanism, a potential approach to speeding up training further could be collaborative training over data from different platforms and tasks.

UniLCD has the potential to reshape the future of edge computing by seamlessly integrating local and cloud-based decision-making.

By significantly reducing the energy consumption and cost of powerful AI models, UniLCD could unlock transformative possibilities to address societal needs across a range of domains, including transportation, healthcare, and disaster response, where real-time and efficient processing is essential. For example, autonomous vehicles could offload tasks to cloud models to conserve energy and enhance safety. Lower-cost assistive robots could operate with precision and energy efficiency in various home environments, minimizing failures associated with low-accuracy edge models or delays from waiting for cloud-based predictions. In disaster zones, robots could manage resources efficiently, adapting to different communication infrastructures and operating for extended periods without sacrificing accuracy during the most crucial moments. Handheld smartphones could provide continual and reliable support when assisting users without rapidly depleting battery life. As researchers continue to push the boundaries of what’s possible, UniLCD brings us one step closer to a future where smarter, faster, and more sustainable AI systems are seamlessly integrated into our daily lives.

SHARE THIS ARTICLE

Feature

Unpacking AI’s black box: why authenticity and traceability must be built in

Marek Grác

Martin Ukrop

An AI Bill of Materials (AIBOM) is a critical tool for establishing trust for an AI application, but today they are far from standard. Learn what researchers are exploring. Organizations are rapidly weaving artificial intelligence (AI) technologies into nearly every aspect of the enterprise, from everyday workflow tools to specialized solutions for finance, healthcare, and […]

Feature

Verifying programs that communicate with the environment

Henrich Lauko

Writing tests with high coverage is almost always tedious work that is still error prone. This can lead to missing crucial details that cause undesirable behavior, and, in the worst case, a complete system failure. What if there were an efficient way to automate this work?

Feature

Preserving privacy in the cloud: speeding up homomorphic encryption with custom hardware

Rashmi Agrawal

Lily Sturmann

Fully homomorphic encryption could be a great solution for secure data sharing, if only it weren’t so slow. Could an FPGA accelerator be the answer? Protecting sensitive data from being seen or tampered with, either while it is stored or while it is in transit, has been standard for some time. This practice is especially […]

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Data skipping and network performance improvement technologies prove their value in data-intensive applications.

Feature

Bridging clusters: a comparative look at multicluster networking performance in Kubernetes

Sai Sindhur Malleni

José Castillo Lema

André Bauer

Raúl Sevilla Canavate

The EU Horizon project CODECO aims to provide smoother and more flexible support of services for distributed workloads across the edge-cloud continuum. Here’s what researchers discovered about multicluster networking solutions. The shift towards microservices has redefined how modern applications are built and run. With this architectural style, developers can break down monolithic systems into smaller, […]

Feature

Demystifying real-time Linux scheduling latency

Daniel Bristot de Oliveira

This is the third of a series of three articles about the formal analysis and verification of the real-time Linux® kernel. Read the first article in RHRQ 2:3 and the second article in RHRQ 2:4.

Feature

When machine learning meets big data processing: From human-native tasks to machine-native tasks

Ilya Kolchinsky

Since the inception of artificial intelligence research, computer scientists have aimed to devise machines that think and learn like human beings. What else could AI do?

Feature

Building an intelligent multicluster scheduler with network link abilities

Clodagh Walsh

Ryan Jenkins

Simplify scheduling with an intelligent, multicluster-aware scheduler capable of automatically handling dependent Kubernetes resources and ensuring network connectivity between distributed services. Scheduling resources across a multicluster environment is not a trivial task. As part of a recent cloud-to-edge research collaboration, P2CODE, a team of engineers based out of Red Hat’s Waterford office in Ireland took […]

Feature

RISC-V extensions: what’s available and how to find them

Richard Jones

Extensions available in RISC-V enable the customizations that make it ideal as a basis for open innovation. Here’s the extension situation as it stands today. RISC-V is a new Instruction Set Architecture (ISA) that, over the next decade, will compete with x86-64 and ARM in all areas, from the lowest-end IoT devices all the way […]

Red Hat Research Quarterly

Smarter AI, fewer resources: bringing cloud AI into real-time edge devices to unlock performance

Red Hat Research Quarterly

Smarter AI, fewer resources: bringing cloud AI into real-time edge devices to unlock performance

Eshed Ohn-Bar

Related Projects

Red Hat Research Quarterly

Spring 2025

A new AI framework for edge systems overcomes the communication and energy obstacles that limit their use in real-time applications by integrating local and cloud decision-making while maintaining strong performance.

Motivation

Our method—UniLCD

Results

Future applications

Marek Grác

Martin Ukrop

Henrich Lauko

Rashmi Agrawal

Lily Sturmann

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Sai Sindhur Malleni

José Castillo Lema

André Bauer

Raúl Sevilla Canavate

Daniel Bristot de Oliveira

Ilya Kolchinsky

Clodagh Walsh

Ryan Jenkins

Richard Jones