Red Hat Research Quarterly

When machine learning meets big data processing: From human-native tasks to machine-native tasks

Ilya Kolchinsky

Ilya Kolchinsky is a research scientist with Red Hat Research, specializing in the various aspects of AI-based system optimization. He has a PhD and BS in Computer Science, both from Technion, Israel Institute of Technology. His past and present research interests include cloud optimization, ML-driven resource management in containerized deployments, pattern mining in streaming data, stream and complex event processing optimization, distributed systems, automatic software testing/debugging, anomaly detection, and more.

Article featured in

Red Hat Research Quarterly

October 2020

Download PDF

Subscribe now

Since the inception of artificial intelligence research, computer scientists have aimed to devise machines that think and learn like human beings. What else could AI do?

Image classification, language-to-language translation, and speech recognition are some of the most prominent examples of major tasks attributed to humans in which great success has been achieved by modern machine learning technologies.

Unfortunately, as a consequence of this vision, important tasks that are not perceived as human native are commonly neglected by most AI-related research communities. This category of problems, which we call machine native, is characterized by: 1) being unsolvable by a human without the aid of a computer, and 2) the existence of a known, not necessarily efficient algorithm capable of solving the problem. Hard combinatorial optimization problems such as the traveling salesman problem or finding the maximum clique in a graph are obvious examples of this category. Many practical machine-native tasks have virtually no known efficient solutions and could benefit greatly from approaches based on the recent groundbreaking achievements in machine learning.

In this article, we will provide a glimpse into a number of ongoing research directions addressing this second type of AI-assisted tasks in the context of the category of computer systems collectively known as big data processing systems.

Big data (stream) processing

As we enter the era of big data, a large number of data-driven systems and applications have become an integral part of our daily lives, and this trend is accelerating dramatically.

As we enter the era of big data, a large number of data-driven systems and applications have become an integral part of our daily lives, and this trend is accelerating dramatically. It is estimated that 1.7MB of data are created every second for every person on earth, for a total of over 2.5 quintillion bytes of new data every day, reaching 163 zettabytes by 2025 according to the International Data Corporation. Many practical challenges encountered by modern big data systems are further exacerbated by the growing volume, velocity, and variety of continuously generated data, presented to them in the form of near-infinite data streams. The complexity of big data processing systems grows over time, together with user requirements and data volume, and increases exponentially with system scale.

A typical big data processing application involves hundreds to thousands of operators connected by communication channels to form a directed graph referred to as a data processing network. This network is constructed according to the dedicated query evaluation plan, which is derived from the queries submitted by the system users. For the most part, each operator is relatively simple and serves a generic purpose, whereas their composition in every segment of the data processing network implements an application-specific requirement.

Big data processing optimization using deep reinforcement learning

The problem of data processing optimization dates back to the inception of early database systems. The input to this problem is a user query scheduled for execution, and the task is to convert this query into a series of low-level operations comprising an evaluation plan. The same query could correspond to multiple possible evaluation plans. For example, if a user wishes to extract and combine data from n tables, there are n! orders of accessing these tables. Even more possibilities are introduced if the target system contains multiple implementation options for some of the operators, if the computation can be distributed over multiple nodes, etc. As different plans could have differences of a few orders of magnitude in their performance characteristics, such as execution time and resource consumption, the task of selecting the optimal plan is of utmost importance for any data processing system.

Picking the best performing evaluation plan is a challenging task due to the extremely high number of possible solutions. Since the early 70s, a plethora of methods and algorithms has been developed to attempt to solve this problem. In spite of these efforts, existing solutions often prove either inefficient or imprecise. A plan optimization algorithm cannot afford to scan the huge plan space or a substantial fraction thereof and is instead forced to utilize heuristics, which might or might not work.

One approach that could come to the rescue here is known as deep reinforcement learning—the very same method that gained fame as the driving force behind AI-based chess, backgammon, and Go players. In reinforcement learning, the trained model learns to perform sequences of moves leading to states providing maximum reward, such as a victory in a game. The learning process is performed by way of trial and error, and deep neural networks are utilized to handle the huge possible state space. In the data processing optimization domain, the process of crafting an efficient query evaluation plan could be considered a “game,” with a set of “moves” defined as all possible selections and placements of an operator in a particular position. By continuously creating plans, applying them on sample data, and measuring the resulting performance, an optimizer implementing this paradigm could gradually learn the most efficient plans.

Deep neural networks as an efficient alternative to traditional big data processing mechanisms

One could suggest an alternative approach to the query processing optimization problem discussed above. Instead of devising smart algorithms for arranging the operators into an efficient evaluation plan, why not take a step further and replace the entire data processing engine with a pretrained deep neural network capable of answering the query?

While seemingly unrealistic at first, this idea has a number of clear advantages. First, since a neural network merely approximates the expensive computation that a query processing engine directly performs, the former is expected to run considerably faster and to consume fewer resources. For example, if the user-defined query is to correlate between two data streams A and B and to find all pairs of A’s and B’s satisfying a predefined condition, the neural network will not have to actually compare between all candidate A-B pairs but instead will settle for a cheaper computation based on the function it learned during training. Second, since the inference time (i.e., the time required to provide an output given an input) of a trained network is constant, the need for using complex optimization methods and algorithms for maximizing the performance of a query evaluation plan would become obsolete.

The main disadvantage of the neural network-based data processing approach is the possibility of returning imprecise or erroneous results due to the imperfection of the learning process. An ongoing challenge for research is to find ways to achieve high levels of precision by utilizing ensemble methods or other novel regularization techniques. In addition, trading off result accuracy (up to a certain level) for performance is acceptable or even highly desirable in many modern big data applications.

Predicting future big data stream query results

As indicated above, the primary task of a big data engine is to deliver up-to-date query results to the end users. It might be even more useful to go a few steps forward and predict the future returned values based on the observed trends in the continuously generated streaming data. Such functionality could be highly beneficial in real-time processing scenarios where a particular action must be triggered and promptly executed immediately (typically within milliseconds) following an occurrence of a particular data item or a combination thereof, and where even the most prolific data processing techniques fail to provide a sufficiently small detection latency. Furthermore, in some situations the goal is to prevent a certain event from occurring rather than react to it, a use case that cannot be realized without an ability to predict the future state with some degree of confidence.

For a data processing system to provide future query answers, there is a need to get a snapshot of the expected future data values. The long-established field of time series forecasting was designed to do exactly that. An increasingly active area of research, it received an unprecedented boost in recent years following a breakthrough in deep learning. It was demonstrated by multiple research teams around the world that certain types of neural networks (such as LSTM, TCN, and Transformer) could achieve remarkable success in learning the data trends and predicting the future data stream content based on past history. While many data analytics and stream analytics frameworks provide time series forecasting as a separate feature, typically as a part of a larger data mining package, incorporating this technology into the core of the query processing engine is yet to become a major trend.

By combining a state-of-the-art time series forecasting method and an efficient mechanism for processing the raw data and acquiring the query results (such as one of those described above), a future generation of big data processing engines could offer a new capability of accurately predicting query answers that will enhance the proactive response abilities of user applications.

What does the future hold?

In this short article, we have barely scratched the surface of the immense unrealized potential of machine learning in the area of big data processing. In the Technion University research team, undergraduate and graduate students are working side by side to produce innovative solutions for these and many other open challenges for practical problems in human-native, machine-native, and hybrid problem domains.

We are looking for projects that will help us test these techniques. Those interested in finding out more about our project ideas and/or looking for collaboration opportunities are kindly invited to contact Dr. Ilya Kolchinsky at ikolchin@redhat.com.

SHARE THIS ARTICLE

Team threat hunting on a container platform: Kestrel as a Service

Kenneth Peeples

An automated tool developed by researchers aims to decrease the mean time to detection by enabling threat hunters to automate and collaborate within a secure, stable container environment. The automated security tools in a Security Operations Center (SOC) can handle about 80% of cybersecurity threats, leaving a substantial 20% of more sophisticated threats undetected. These […]

Feature

“Open source opens doors”: mentoring students for success

Heidi Dempsey

Research- and leadership-focused support is getting results in the push to grow and diversify the engineering talent pool. The technology industry has largely embraced the theory that diversity drives innovation, but in practice the talent pipeline continues to be leaky. Even when high school preparation is equal, students of color are more likely than white […]

Feature

The elastic bare metal cloud is here

Gagan Kumar

Exclusivity of resources is becoming obsolete. The Elastic Secure Infrastructure Project (ESI) provides a solution for sharing computing resources and getting the most from hardware investments. Using resources efficiently is an important goal for any organization. If those resources are computers, then theoretically that goal should be easily achievable, because machines don’t get tired and […]

Feature

Developing AI telemetry, digital twins, and other data-driven websites with SPINE Programming Theory

Christopher Tate

Dewayne Branch

Denis Poussard

Developers using SPINE Programming have drastically cut manual coding time while maintaining full control over their data. SPINE Programming Theory (SPT) is a form of on-device, local AI code indexing and generation that accelerates software development while ensuring that users maintain full control over their data in their own environment. SPT allows developers to focus […]

Feature

Isn’t multi-tenancy Ironic?

Tzu-Mainn Chen

Lars Kellogg-Stedman

Virtualization is an amazing technology that has become a popular solution for sharing resources among members of an organization. However, some organizations need to harness the capabilities of an entire machine, without a layer of virtualization between the code and the hardware. Is it possible to share hardware between projects with the same ease as sharing virtual resources?

Feature

BigDataStack delivers with contributions from industry and university partners

Yosef Moatti

Oshrit Feder

Guy Khazma

Gal Lushi

Paula Ta-Shma

Luis Tomás Bolivar

Miki Kenneth

Josh Salomon

Data skipping and network performance improvement technologies prove their value in data-intensive applications.

Feature

Changing the world, one lesson at a time

Matej Hrušovský

Why teaching more teachers is essential to computer science education.

Feature

Making machine learning accessible across disciplines

Marek Grác

Machine learning has been driving research breakthroughs in many fields. Now there is an open source curriculum designed to help non-specialists build the skills they need to use it. Machine learning is an increasingly important competency in a growing number of fields. Biochemists are using it to create models for protein engineering. Economists are using […]

Feature

Meet osnoise, a better tool for fine-tuning to reduce operating system noise in the Linux kernel

Daniel Bristot de Oliveira

Research on the root causes of OS noise in high-performance computing environments has produced a tool that can provide more precise information than was previously available. The Linux operating system (OS) has proved to be a viable option for a wide range of very niche applications, despite its general-purpose nature. For example, Linux can be […]