Red Hat Research Quarterly

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Yosef Moatti, IBM, graduated from the Ecole Politechnique of Paris and holds an engineering doctorate from Télécom Paris. He has been involved in cloud technologies for the past ten years.

about the author

Oshrit Feder

Oshrit Feder, IBM, holds a BSc in Computer science from Tel Aviv university and an MBA from the Technion and has been working with cloud technologies since 2008.

about the author

Guy Khazma

Guy Khazma, IBM, is a research staff member in the IBM Cloud and Data Technologies group and has been working on SQL analytics over object storage using Apache Spark.

about the author

Gal Lushi

Gal Lushi, IBM, is a research staff member in the IBM Cloud and Data Technologies group and has been working on SQL analytics over object storage using Apache Spark as well as Analytics over encrypted data.

about the author

Paula Ta-Shma

Paula Ta-Shma, IBM, is a senior technical staff member in the Cloud and Data Technologies group at IBM Research-Haifa and is responsible for a group of research efforts in the area of hybrid data, with a particular focus on handling big data for both analytics and machine learning.

about the author

Luis Tomás Bolivar

Luis Tomás Bolivar is a Principal Software Engineer at Red Hat, Spain. He is currently working in the Ecosystem Engineering group and on research activities with a focus on cloud computing in general, and automation, networking, and AI in particular. He holds a PhD in Computer Science (University of Castilla-La Mancha, Spain) and has been involved in several EU projects.

about the author

Miki Kenneth

Miki Kenneth, Red Hat, Director of Software Engineering, manages the Red Hat engineering center in Tel Aviv. She is passionate about bringing industry, government, and academic partners together.

about the author

Josh Salomon

Josh Salomon, Red Hat, is an architect in the Ceph storage team, with more than thirty years of experience in software development and architecture roles.

Related Projects

Big Data Stack EU Project: An European Open Source Initiative

Article featured in

Red Hat Research Quarterly

August 2021

Download PDF

Subscribe now

In this issue

From the Director

Programmable networks, hardware—what’s next, programmable enterprises?

Hugh Brock

News

Europe RIG launched June 3, 2021

Matej Hrušovský

News

Red Hat joins IBM PhD Fellowship program

Marek Grác

News

Gender diversity on the rise among engineers

Matej Hrušovský

Feature

User authentication for open source developers: what do they use?

Agáta Kružíková

Milan Brož

Interview

The right idea at the right time: networking researchers use open source for real-world results

Toke Høiland-Jørgensen

Feature

Faster hardware through software

Gordon Haff

Feature

“When one teaches, two learn”: making the most of technical research mentorship

Matej Hrušovský

Lis Strenger

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Column

Building better through research

Heidi Dempsey

Data skipping and network performance improvement technologies prove their value in data-intensive applications.

In 2018, fourteen industry and university partners launched a collaborative project designed to optimize big data analytics performance, efficiency, and scalability. Known as the BigDataStack, the project, funded by the European Commission Horizon 2020 Work Programme, was featured for the first time in Red Hat Research Quarterly in the May 2019 issue. In January 2021, the BigDataStack consortium delivered their complete solution.

The project has more than met its objectives. The final review report opens by saying, “BigDataStack has delivered exceptional results with significant immediate or potential impact,” which testifies to the project’s quality. Indeed, many significant research efforts in this project led to state-of-the-art technologies, with no less than three chosen by the EU Innovation Radar program, which aims at discovering cutting-edge EU-funded innovations. Since it would be impossible within the scope of this article to address, even very briefly, all of these significant results, we have chosen to focus on two open source technologies: Xskipper¹, developed by IBM, and Kuryr, developed with contributions from Red Hat. We detail their technological background and their impact on BigDataStack use cases and, more broadly, on industrial deployments.

These two technologies tackle and solve acute performance problems. Xskipper gives a performance boost for SQL analytics with big data workloads, while the contribution to Kuryr solves an important performance problem with networking. Although they deal with different levels of big data infrastructure, they have a lot in common. First, they both contribute significant performance improvements that directly affect use cases of BigDataStack. Second, both were deployed in production as early as one year ahead of BigDataStack’s completion, and in the course of 2020 they proved their value in real production environments. Finally, both were open sourced and identified by the EU Innovation Radar program as key innovator technologies.

Xskipper technology

The Xskipper framework was developed by IBM as an extension to Apache Spark, arguably the most prominent analytics engine for large-scale data processing. Xskipper permits the creation, management, and deployment of data skipping indexes. Its concrete impact is to enable a huge performance acceleration of Apache Spark SQL queries when targeting datasets in an object store.

Let’s briefly explain what data skipping is all about. For a given dataset stored in an object store, the initial step is to build the index, or, in other words, to build summary metadata for each object of the dataset. For example, if the dataset has a column that represents the temperature, the summary metadata could include the minimum and maximum temperatures of all the data records contained in the object. This summary metadata is significantly smaller than the data itself and can be indexed and stored in object storage. Once the metadata is indexed, SQL queries that apply a predicate on the temperature column—for example, queries looking for temperatures >30°C—can benefit from the index by skipping over all the objects whose metadata does not overlap with the predicate (e.g., we can skip over all objects whose maximum temperature is lower than 30°C, since their data cannot possibly satisfy the query).

Beyond this extremely simple example, Xskipper extends data skipping technology in numerous novel ways, such as supporting user-defined functions, handling Boolean conditions over SQL predicates, and adding other indexes out of the box—for example, a Bloom filter. For details about how to reduce I/O and accelerate SQL performance by orders of magnitude using Xskipper, see the June 2020 IBM blog post “Data skipping for IBM Cloud SQL Query.”

Kuryr Container Network Interface

Kuryr acts as an integration bridge between traditional Red Hat® OpenStack® networking and the container community. Red Hat is the main contributor to the project, and during the course of BigDataStack Red Hat developed performance improvements as well as codebase modernization critical for Kuryr adoption by customers.

Before discussing the contribution of Red Hat to Kuryr, let’s address the rationale of running Red Hat® OpenShift® over OpenStack. Deployments of OpenShift clusters over OpenStack are an important part of the hybrid cloud solution. OpenShift is infrastructure independent: it can be run on public cloud, virtualization, bare metal, or anything that can boot Red Hat®Enterprise Linux®. As enterprises increasingly move workloads to the cloud, they still typically keep a sizable on-premise IT stack, which will be part of the infrastructure of a hybrid cloud solution. This positions OpenStack as an excellent solution to deliver the on-premise portion of hybrid cloud, where OpenShift workloads are run over on-premise OpenStack, similar to the public cloud.

Traditional OpenShift installations leverage OpenShiftSDN, which is specific to OpenShift. Using OpenShiftSDN means that your containers run on a network within a network. This setup suffers from doubleencapsulation, and it introduces an additional layer of complexity that becomes apparent when troubleshooting network issues. Double encapsulation also affects network performance due to the overhead of running a network within a network. In 2018, when BigDataStack was launched, running OpenShift over an OpenStack installation meant having to suffer from double encapsulation.

Red Hat contributions focused on the Kuryr OpenStack upstream project, which avoids the double encapsulation handicap by allowing direct access to OpenStack networking. Kuryr creates OpenStack resources to provide networking for the Kubernetes/OpenShift object: it uses Neutron Ports for pods, Neutron networks/subnets for namespaces, Neutron security groups and security group rules for the network policies, and Octavia load balancers for services. This setup reduces the complexity of the networking layer and also improves network performance. The configuration is detailed in the July 2019 Red Hat blog post “Accelerate your OpenShift network performance on OpenStack with Kuryr.”

Red Hat made four key contributions during the project:

integrating Kuryr into OpenShift 4 (i.e., into the Cluster Network Operator),
modernizing its codebase to better align with the Kubernetes models (custom resource definitions [CRD] based),
implementing network policy support by leveraging OpenStack security groups (needed to keep feature parity with other SDNs, such as OpenShiftSDN or OVN-Kubernetes), and
supporting East/West distributed load balancing through OVN Octavia integration, leading to remarkable performance improvements.

The latter was one of the most important contributions. Before OVN Octavia integration, Kuryr was relying on an OpenStack Octavia Amphora driver that required one load balancer VM per Kubernetes service. This led to resource wastage and control plane slowness due to booting up and configuring the VM (unlike just adding a few OVS flows). OVN Octavia integration was key for Kuryr adoption by customers, as it remarkably decreases the resource consumption by removing the need for one specific VM for each service. It also improves the latency/throughput for pod-to-svc communication (up to two times, as it avoids the extra hop on the network to reach the amphora VM), and it improves control plane action performance by one order of magnitude. Because there is no need to create a VM, it achieves a reduction from minutes to seconds.

Use cases within and beyond BigDataStack

The joint use of Kuryr and Xskipper was instrumental in sizable performance improvements. These improvements were demonstrated with a joint IBM/Danaos talk at Think 2019 with the BigDataStack maritime use case, in which BigDataStack algorithms were used to optimize preventative maintenance on Danaos’ large fleet of container ships.

The ultimate test came with production use cases. The Weather Company use case was one of the best examples of the impact of Xskipper. Thanks to Xskipper and complementary technology, SQL queries on a multi-TB weather dataset could be run 40 times faster, while achieving an order-of-magnitude cost reduction at the same time. For more information, see Paula Ta-Shma’s presentation at DATA+AI Summit Europe 2020, where this use case and its outcomes are detailed.

The benefits of implementing Kuryr are illustrated in its use by Services Australia. Services Australia is responsible for the delivery of advanced, high-quality, and accessible social, health, and child support services and payments. The organization was interested in utilising the benefits of running OpenShift on OpenStack with Kuryr, primarily by allocating floating IPs to OpenShift services. This feature facilitated inter-cluster communication, in addition to the performance enhancements that come with avoiding double encapsulation. The organization achieved results including better network performance (up to 9 times), streamlined networking, and the potential to expose services externally, without administrative intervention. More information is available on the BigDataStack blog.

Conclusion

We have shown that Xskipper and the Red Hat contributions to the Kuryr project are two state-of-the-art technologies that significantly enhance the performance of big data workloads. Even though the BigDataStack consortium delivered its complete stack in 2021 , this project will certainly continue influencing big data infrastructure. For example, Xskipper is evolving in various directions, including integration with Red Hat Open Data Hub, and the team is involved in discussions on data skipping indexes for Apache Iceberg. As for Kuryr, customer adoption is quickly increasing, and new features and enhancements are being requested and implemented, such as support for IPv6 or SCTP protocols and the automatic reconciliation of removed OpenStack resources.

Funded by the European Commission Horizon 2020 Work Programme, the BigDataStack project will deliver a complete collection of open and interoperable technology building blocks for a variety of big data stakeholders, including infrastructure operators, application developers, data providers, data scientists, and data consumers. The BigDataStack project combines business knowledge with academic insights, bringing together engineers, developers, and researchers from the worlds of industry and higher education. Overall, fourteen partners are collaborating on the project including Red Hat, IBM, Atos, and the University of Piraeus Research Center.

The BigDataStack is an open technology stack for big data applications and operations. The project is based on a unique data-driven architecture that optimizes big data analytics performance, efficiency, and scalability by dynamically adapting compute, storage, and network resources based on service attributes, data flows, and application interdependencies.

For an overview of the architecture, theory of operation (dimensioning, deployment, operational phases), and use cases see the article “Researchers and industry deliver open source technology stack for big data applications,” Red Hat Research Quarterly 1:1, written by Dimosthenis Kyriazis, Associate Professor, University of Piraeus Technical Coordinator, BigDataStack Consortium.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779747.

SHARE THIS ARTICLE

Composing a research symphony

Heidi Dempsey

Many talents contributed to one goal: a shared production-level research cloud. It was a chilly morning at Boston University, and I was looking for a quiet place to gather my thoughts and do some writing. I passed two painters covering up scuffs on the white walls and a man with a floor machine busily tracing […]

Feature

Translation layers for the cloud: speeding storage performance

Peter Desnoyers

A guide to understanding the hidden algorithms that manage the data in our everyday world, from smartphones to cloud apps. We look at which ones perform faster—and why.

Feature

How to open source cloud operations

Marcel Hild

Open source has become a dominant paradigm for developing software. One major factor for its success is its transparency: if you have a problem with the software, you can peek into the details of the code, search the issue tracker, ask for help, and maybe even provide a fix. This means that even though most users don’t write code, the mere fact that everything is open will help the majority of users. Now it’s time to apply the open source model to the cloud.

Feature

When machine learning meets big data processing: From human-native tasks to machine-native tasks

Ilya Kolchinsky

Since the inception of artificial intelligence research, computer scientists have aimed to devise machines that think and learn like human beings. What else could AI do?

News

Publication highlights—August 2022

Red Hat Research collaborates with universities and government agencies to produce papers that bring open source contributions along with them. This is a sampling of recent publications and conference presentations; to see more visit the publications page on the Red Hat Research website.

Interview

The right idea at the right time: networking researchers use open source for real-world results

Toke Høiland-Jørgensen

We invited Red Hat Principal Kernel Engineer Toke Høiland-Jørgensen to interview Anna Brunström, currently a Full Professor and Research Manager for the Distributed Systems and Communications Research Group at Karlstad University, Sweden. Prof. Brunström has a background in distributed systems, but her main area of work over the last years has been in computer networking. Their wide-ranging conversation covers programmable networking, open data, diversity in IT fields, and more.

Column

Focus on artificial intelligence and machine learning | February 2024

Sanjay Arora

AI/ML research is driving performance and efficiency gains and building better developer tools. Red Hat Research and its university partners focus strategically on projects with the most promise to shape the future of how we use technology. Each quarter, RHRQ will publish an overview of our research in a specific area, such as edge computing, […]

Feature

Applying lessons from our upstream hypervisor fuzzer to improve kernel fuzzing

Alexander Bulekov

Bandan Das

Could a grammarless approach increase its effectiveness? Low-level systems such as Linux kernels and hypervisors form the foundation of cloud systems today. The virtual machines (VMs) provided by hypervisors are attractive targets for attackers. Bugs in hypervisors create the risk of an attacker in a malicious VM, compromising the isolation guarantees provided by the hypervisor, […]

From the Editor

AI is changing research collaborations. How will open source research impact AI?

Shaun Strohmer

Research isn’t necessarily about getting the right answers, but asking the right questions. After five years as the editor of RHRQ, I’d like to think I’ve gained some perspective on the questions we want this publication to address. To me, the big-picture questions we’re trying answer four times a year, across multiple disciplines and continents, […]