Discovering Opportunities for Optimizing OpenShift Energy Consumption

VIEW DIRECTORY

Abstract
Drawing from our collective experience, we believe a wide array of opportunities for implementing energy optimization exist within OpenShift. However, we’ve learned that the most effective approach begins with a meticulous quantification of a system’s energy and performance characteristics [1]. Once this quantification is accomplished, it becomes possible to pinpoint and prioritize optimization opportunities based on their generality and potential impact. While both of our initiatives hold valuable insights into potential optimizations, we currently lack such a quantified foundation.

Our collaborative team is uniquely positioned to undertake a comprehensive study of performance and energy dynamics within the OpenShift environment. Leveraging Kepler and the expertise of the Red Hat PEAKS team, we possess a well-integrated source of energy data specific to OpenShift. Furthermore, our past work in fine-grain energy investigations provides a robust methodological framework for conducting this study. Our primary objective is to construct a dataset that establishes baseline performance and energy consumption profiles. This dataset will use various synthetic and benchmark containers to generate loads, all meticulously measured using Kepler. As the study progresses, we will augment this dataset with configurations designed to explore project PEAK’s proposed utilization optimizations and the potential for network-oriented container optimization, as explored by Dr Dong and Arora. This rich dataset will also facilitate the development and comparison of machine-learning strategies for control.

Accomplishments

Established a hardware environment for doing controlled experiments that allows us to explore the effectiveness of proposed Kubernetes, Kepler-driven scheduling changes.
- Worked as a team across BU, Red Hat and IBM to:
  - Create a reproducible deployment environment on the Massachusetts Open Cloud (MOC) for running a distributed Kubernetes cluster with Kepler-exported metrics enabled (https://github.com/SustainableOpenShift/peakler/tree/main/scripts.
  - Discuss and implement the following set of validation benchmarks/scripts.
Begun our own evaluation of power consumption using metrics and mechanisms we have established in our prior work. This allows us to gather data that we can use to corroborate and validate Kepler’s data.
- Measured the bare MOC server power consumption using stress-ng, in order to act as a baseline to compare against Kepler metrics and to validate against internal behavior that Red Hat and IBM have observed. The details and analysis of our findings can be found in: https://github.com/SustainableOpenShift/peakler/blob/main/experiments/stress-ng/jupyter-notebook/graph1.ipynb, https://github.com/SustainableOpenShift/peakler/blob/main/experiments/stress-ng/jupyter-notebook/stress-ng-vm-graph.ipynb
- Developed a set of microbenchmarks to run inside a Kubernetes pod and use Kepler to validate the collected power metrics: https://github.com/SustainableOpenShift/peakler/blob/main/experiments/microbenchmarks/jupyter-notebook/kepler_alu_parallel.ipynb
  - We were able to validate Kepler power metrics with the baremetal measurements.
  - We noticed certain accuracy issues with other metrics such as instructions and have notified the Kepler team regarding this.
- Measured Kepler’s inherent power costs on the MOC server node (https://github.com/SustainableOpenShift/peakler/blob/main/experiments/microbenchmarks/jupyter-notebook/kepler_sleep.ipynb)
  - The base node idle power is at around 56W and after Kubernetes+Kepler is deployed, the idle power is around 74W.
Given these validation experiments, we decided the next step is to find a realistic Kubernetes scheduler workload to explore the use of Kepler and PEAKS. Based on a recently published paper, Cilantro [2], we have begun exploring a more complex and comprehensive benchmark for evaluating performance and power Kubernetes optimizations on hardware we can access.
- We have deployed and recreated the experiments results demonstrated in the paper based on the Hotel Reservation benchmark from DeathStarBench [3]: (https://github.com/SustainableOpenShift/peakler/blob/main/experiments/cilantro/jupyter-notebook/graph1.ipynb, https://github.com/SustainableOpenShift/peakler/blob/main/experiments/cilantro/jupyter-notebook/r320_4nodes.ipynb)

Current Goals/Efforts:

Using the benchmark from DeathStarBench and the Cilantro system, we have begun adding in a custom Kubernetes scheduler that is Kepler aware such that it can make power optimization decisions: https://github.com/SustainableOpenShift/cilantro/blob/main/experiments/microservices/starters/cilantro_cfgs/kind/config_cilantro_scheduler_bayop.yaml

Two main questions our work seeks to address:

Cilantro has already provided the mechanisms to implement a custom scheduler in Kubernetes and while in their work they only focus on performance (i.e. P99 latency), we are interested in utilizing their approach to explore energy efficiency in scheduling instead.
Recent work [4] has illustrated that not all components of a complicated application such as Hotel Reservation (which consists of over 20 microservices deployed on various Kubernetes pods) need to be deployed on the latest and greatest hardware to maintain performance. This suggests that a more carbon aware approach is also feasible where older/recycled hardware (therefore less embodied and operational carbon) can be used in favor of newer ones.

Talks/Posters/Demonstrations:‬

Talk proposal (titled: Power Efficiency Aware Kubernetes Scheduler) under submission to Red Hat Devconf 2024

Bibliography

Han Dong, Sanjay Arora, Yara Awad, Tommy Unger, Orran Krieger, and Jonathan Appavoo. Slowing down for performance and energy: An os-centric study in network driven workloads. https://arxiv.org/abs/2112.07010, 2021.
Romil Bhardwaj and Kirthevasan Kandasamy and Asim Biswal and Wenshuo Guo and Benjamin Hindman and Joseph Gonzalez and Michael Jordan and Ion Stoica. Cilantro: {Performance-Aware} Resource Allocation for General Objectives via Online Feedback. Operating Systems Design and Implementation (OSDI 23).
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’19). Association for Computing Machinery, New York, NY, USA, 3–18. https://doi.org/10.1145/3297858.3 304013
Jaylen Wang, Udit Gupta, and Akshitha Sriraman. 2023. Peeling Back the Carbon Curtain: Carbon Optimization Challenges in Cloud Computing. In Proceedings of the 2nd Workshop on Sustainable Computer Systems (HotCarbon ’23). Association for Computing Machinery, New York, NY, USA, Article 8, 1–7. https://doi.org/10.1145/3604930.3605718