Data skipping and network performance improvement technologies prove their value in data-intensive applications.
In 2018, fourteen industry and university partners launched a collaborative project designed to optimize big data analytics performance, efficiency, and scalability. Known as the BigDataStack, the project, funded by the European Commission Horizon 2020 Work Programme, was featured for the first time in Red Hat Research Quarterly in the May 2019 issue. In January 2021, the BigDataStack consortium delivered their complete solution.
The project has more than met its objectives. The final review report opens by saying, “BigDataStack has delivered exceptional results with significant immediate or potential impact,” which testifies to the project’s quality. Indeed, many significant research efforts in this project led to state-of-the-art technologies, with no less than three chosen by the EU Innovation Radar program, which aims at discovering cutting-edge EU-funded innovations. Since it would be impossible within the scope of this article to address, even very briefly, all of these significant results, we have chosen to focus on two open source technologies: Xskipper1, developed by IBM, and Kuryr, developed with contributions from Red Hat. We detail their technological background and their impact on BigDataStack use cases and, more broadly, on industrial deployments.
These two technologies tackle and solve acute performance problems. Xskipper gives a performance boost for SQL analytics with big data workloads, while the contribution to Kuryr solves an important performance problem with networking. Although they deal with different levels of big data infrastructure, they have a lot in common. First, they both contribute significant performance improvements that directly affect use cases of BigDataStack. Second, both were deployed in production as early as one year ahead of BigDataStack’s completion, and in the course of 2020 they proved their value in real production environments. Finally, both were open sourced and identified by the EU Innovation Radar program as key innovator technologies.
Xskipper technology
The Xskipper framework was developed by IBM as an extension to Apache Spark, arguably the most prominent analytics engine for large-scale data processing. Xskipper permits the creation, management, and deployment of data skipping indexes. Its concrete impact is to enable a huge performance acceleration of Apache Spark SQL queries when targeting datasets in an object store.
Let’s briefly explain what data skipping is all about. For a given dataset stored in an object store, the initial step is to build the index, or, in other words, to build summary metadata for each object of the dataset. For example, if the dataset has a column that represents the temperature, the summary metadata could include the minimum and maximum temperatures of all the data records contained in the object. This summary metadata is significantly smaller than the data itself and can be indexed and stored in object storage. Once the metadata is indexed, SQL queries that apply a predicate on the temperature column—for example, queries looking for temperatures >30°C—can benefit from the index by skipping over all the objects whose metadata does not overlap with the predicate (e.g., we can skip over all objects whose maximum temperature is lower than 30°C, since their data cannot possibly satisfy the query).
Beyond this extremely simple example, Xskipper extends data skipping technology in numerous novel ways, such as supporting user-defined functions, handling Boolean conditions over SQL predicates, and adding other indexes out of the box—for example, a Bloom filter. For details about how to reduce I/O and accelerate SQL performance by orders of magnitude using Xskipper, see the June 2020 IBM blog post “Data skipping for IBM Cloud SQL Query.”
Kuryr Container Network Interface
Kuryr acts as an integration bridge between traditional Red Hat® OpenStack® networking and the container community. Red Hat is the main contributor to the project, and during the course of BigDataStack Red Hat developed performance improvements as well as codebase modernization critical for Kuryr adoption by customers.
Before discussing the contribution of Red Hat to Kuryr, let’s address the rationale of running Red Hat® OpenShift® over OpenStack. Deployments of OpenShift clusters over OpenStack are an important part of the hybrid cloud solution. OpenShift is infrastructure independent: it can be run on public cloud, virtualization, bare metal, or anything that can boot Red Hat® Enterprise Linux®. As enterprises increasingly move workloads to the cloud, they still typically keep a sizable on-premise IT stack, which will be part of the infrastructure of a hybrid cloud solution. This positions OpenStack as an excellent solution to deliver the on-premise portion of hybrid cloud, where OpenShift workloads are run over on-premise OpenStack, similar to the public cloud.
Traditional OpenShift installations leverage OpenShiftSDN, which is specific to OpenShift. Using OpenShiftSDN means that your containers run on a network within a network. This setup suffers from doubleencapsulation, and it introduces an additional layer of complexity that becomes apparent when troubleshooting network issues. Double encapsulation also affects network performance due to the overhead of running a network within a network. In 2018, when BigDataStack was launched, running OpenShift over an OpenStack installation meant having to suffer from double encapsulation.
Red Hat contributions focused on the Kuryr OpenStack upstream project, which avoids the double encapsulation handicap by allowing direct access to OpenStack networking. Kuryr creates OpenStack resources to provide networking for the Kubernetes/OpenShift object: it uses Neutron Ports for pods, Neutron networks/subnets for namespaces, Neutron security groups and security group rules for the network policies, and Octavia load balancers for services. This setup reduces the complexity of the networking layer and also improves network performance. The configuration is detailed in the July 2019 Red Hat blog post “Accelerate your OpenShift network performance on OpenStack with Kuryr.”
Red Hat made four key contributions during the project:
- integrating Kuryr into OpenShift 4 (i.e., into the Cluster Network Operator),
- modernizing its codebase to better align with the Kubernetes models (custom resource definitions [CRD] based),
- implementing network policy support by leveraging OpenStack security groups (needed to keep feature parity with other SDNs, such as OpenShiftSDN or OVN-Kubernetes), and
- supporting East/West distributed load balancing through OVN Octavia integration, leading to remarkable performance improvements.
The latter was one of the most important contributions. Before OVN Octavia integration, Kuryr was relying on an OpenStack Octavia Amphora driver that required one load balancer VM per Kubernetes service. This led to resource wastage and control plane slowness due to booting up and configuring the VM (unlike just adding a few OVS flows). OVN Octavia integration was key for Kuryr adoption by customers, as it remarkably decreases the resource consumption by removing the need for one specific VM for each service. It also improves the latency/throughput for pod-to-svc communication (up to two times, as it avoids the extra hop on the network to reach the amphora VM), and it improves control plane action performance by one order of magnitude. Because there is no need to create a VM, it achieves a reduction from minutes to seconds.
Use cases within and beyond BigDataStack
The joint use of Kuryr and Xskipper was instrumental in sizable performance improvements. These improvements were demonstrated with a joint IBM/Danaos talk at Think 2019 with the BigDataStack maritime use case, in which BigDataStack algorithms were used to optimize preventative maintenance on Danaos’ large fleet of container ships.
The ultimate test came with production use cases. The Weather Company use case was one of the best examples of the impact of Xskipper. Thanks to Xskipper and complementary technology, SQL queries on a multi-TB weather dataset could be run 40 times faster, while achieving an order-of-magnitude cost reduction at the same time. For more information, see Paula Ta-Shma’s presentation at DATA+AI Summit Europe 2020, where this use case and its outcomes are detailed.
The benefits of implementing Kuryr are illustrated in its use by Services Australia. Services Australia is responsible for the delivery of advanced, high-quality, and accessible social, health, and child support services and payments. The organization was interested in utilising the benefits of running OpenShift on OpenStack with Kuryr, primarily by allocating floating IPs to OpenShift services. This feature facilitated inter-cluster communication, in addition to the performance enhancements that come with avoiding double encapsulation. The organization achieved results including better network performance (up to 9 times), streamlined networking, and the potential to expose services externally, without administrative intervention. More information is available on the BigDataStack blog.
Conclusion
We have shown that Xskipper and the Red Hat contributions to the Kuryr project are two state-of-the-art technologies that significantly enhance the performance of big data workloads. Even though the BigDataStack consortium delivered its complete stack in 2021 , this project will certainly continue influencing big data infrastructure. For example, Xskipper is evolving in various directions, including integration with Red Hat Open Data Hub, and the team is involved in discussions on data skipping indexes for Apache Iceberg. As for Kuryr, customer adoption is quickly increasing, and new features and enhancements are being requested and implemented, such as support for IPv6 or SCTP protocols and the automatic reconciliation of removed OpenStack resources.
Funded by the European Commission Horizon 2020 Work Programme, the BigDataStack project will deliver a complete collection of open and interoperable technology building blocks for a variety of big data stakeholders, including infrastructure operators, application developers, data providers, data scientists, and data consumers. The BigDataStack project combines business knowledge with academic insights, bringing together engineers, developers, and researchers from the worlds of industry and higher education. Overall, fourteen partners are collaborating on the project including Red Hat, IBM, Atos, and the University of Piraeus Research Center.
The BigDataStack is an open technology stack for big data applications and operations. The project is based on a unique data-driven architecture that optimizes big data analytics performance, efficiency, and scalability by dynamically adapting compute, storage, and network resources based on service attributes, data flows, and application interdependencies.
For an overview of the architecture, theory of operation (dimensioning, deployment, operational phases), and use cases see the article “Researchers and industry deliver open source technology stack for big data applications,” Red Hat Research Quarterly 1:1, written by Dimosthenis Kyriazis, Associate Professor, University of Piraeus Technical Coordinator, BigDataStack Consortium.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779747.