Red Hat Research Quarterly

When machine learning meets big data processing: From human-native tasks to machine-native tasks

Red Hat Research Quarterly

When machine learning meets big data processing: From human-native tasks to machine-native tasks

about the author

Ilya Kolchinsky

Ilya Kolchinsky is a research scientist with Red Hat Research and Technion, Israel Institute of Technology. He has a PhD and BSc in Computer Science, both from the Technion. Ilya’s research interests span a wide range of topics in big data processing, such as distributed event-based systems, data stream mining, and applications of AI and machine learning in stream processing engines.

Article featured in

Since the inception of artificial intelligence research, computer scientists have aimed to devise machines that think and learn like human beings. What else could AI do?

Image classification, language-to-language translation, and speech recognition are some of the most prominent examples of major tasks attributed to humans in which great success has been achieved by modern machine learning technologies. 

Unfortunately, as a consequence of this vision, important tasks that are not perceived as human native are commonly neglected by most AI-related research communities. This category of problems, which we call machine native, is characterized by: 1) being unsolvable by a human without the aid of a computer, and 2) the existence of a known, not necessarily efficient algorithm capable of solving the problem. Hard combinatorial optimization problems such as the traveling salesman problem or finding the maximum clique in a graph are obvious examples of this category. Many practical machine-native tasks have virtually no known efficient solutions and could benefit greatly from approaches based on the recent groundbreaking achievements in machine learning.

In this article, we will provide a glimpse into a number of ongoing research directions addressing this second type of AI-assisted tasks in the context of the category of computer systems collectively known as big data processing systems.

Big data (stream) processing

As we enter the era of big data, a large number of data-driven systems and applications have become an integral part of our daily lives, and this trend is accelerating dramatically.

As we enter the era of big data, a large number of data-driven systems and applications have become an integral part of our daily lives, and this trend is accelerating dramatically. It is estimated that 1.7MB of data are created every second for every person on earth, for a total of over 2.5 quintillion bytes of new data every day, reaching 163 zettabytes by 2025 according to the International Data Corporation. Many practical challenges encountered by modern big data systems are further exacerbated by the growing volume, velocity, and variety of continuously generated data, presented to them in the form of near-infinite data streams. The complexity of big data processing systems grows over time, together with user requirements and data volume, and increases exponentially with system scale.

A typical big data processing application involves hundreds to thousands of operators  connected by communication channels to form a directed graph referred to as a data processing network. This network is constructed according to the dedicated query evaluation plan, which is derived from the queries submitted by the system users. For the most part, each operator is relatively simple and serves a generic purpose, whereas their composition in every segment of the data processing network implements an application-specific requirement.

Big data processing optimization using deep reinforcement learning

The problem of data processing optimization dates back to the inception of early database systems. The input to this problem is a user query scheduled for execution, and the task is to convert this query into a series of low-level operations comprising an evaluation plan. The same query could correspond to multiple possible evaluation plans. For example, if a user wishes to extract and combine data from n tables, there are n! orders of accessing these tables. Even more possibilities are introduced if the target system contains multiple implementation options for some of the operators, if the computation can be distributed over multiple nodes, etc. As different plans could have differences of a few orders of magnitude in their performance characteristics, such as execution time and resource consumption, the task of selecting the optimal plan is of utmost importance for any data processing system.

Picking the best performing evaluation plan is a challenging task due to the extremely high number of possible solutions. Since the early 70s, a plethora of methods and algorithms has been developed to attempt to solve this problem. In spite of these efforts, existing solutions often prove either inefficient or imprecise. A plan optimization algorithm cannot afford to scan the huge plan space or a substantial fraction thereof and is instead forced to utilize heuristics, which might or might not work.

One approach that could come to the rescue here is known as deep reinforcement learning—the very same method that gained fame as the driving force behind AI-based chess, backgammon, and Go players. In reinforcement learning, the trained model learns to perform sequences of moves leading to states providing maximum reward, such as a victory in a game. The learning process is performed by way of trial and error, and deep neural networks are utilized to handle the huge possible state space. In the data processing optimization domain, the process of crafting an efficient query evaluation plan could be considered a “game,” with a set of “moves” defined as all possible selections and placements of an operator in a particular position. By continuously creating plans, applying them on sample data, and measuring the resulting performance, an optimizer implementing this paradigm could gradually learn the most efficient plans.

Deep neural networks as an efficient alternative to traditional big data processing mechanisms

One could suggest an alternative approach to the query processing optimization problem discussed above. Instead of devising smart algorithms for arranging the operators into an efficient evaluation plan, why not take a step further and replace the entire data processing engine with a pretrained deep neural network capable of answering the query?

While seemingly unrealistic at first, this idea has a number of clear advantages. First, since a neural network merely approximates the expensive computation that a query processing engine directly performs, the former is expected to run considerably faster and to consume fewer resources. For example, if the user-defined query is to correlate between two data streams A and B and to find all pairs of A’s and B’s satisfying a predefined condition, the neural network will not have to actually compare between all candidate A-B pairs but instead will settle for a cheaper computation based on the function it learned during training. Second, since the inference time (i.e., the time required to provide an output given an input) of a trained network is constant, the need for using complex optimization methods and algorithms for maximizing the performance of a query evaluation plan would become obsolete.

The main disadvantage of the neural network-based data processing approach is the possibility of returning imprecise or erroneous results due to the imperfection of the learning process. An ongoing challenge for research is to find ways to achieve high levels of precision by utilizing ensemble methods or other novel regularization techniques. In addition, trading off result accuracy (up to a certain level) for performance is acceptable or even highly desirable in many modern big data applications.

Predicting future big data stream query results

As indicated above, the primary task of a big data engine is to deliver up-to-date query results to the end users. It might be even more useful to go a few steps forward and predict the future returned values based on the observed trends in the continuously generated streaming data. Such functionality could be highly beneficial in real-time processing scenarios where a particular action must be triggered and promptly executed immediately (typically within milliseconds) following an occurrence of a particular data item or a combination thereof, and where even the most prolific data processing techniques fail to provide a sufficiently small detection latency. Furthermore, in some situations the goal is to prevent a certain event from occurring rather than react to it, a use case that cannot be realized without an ability to predict the future state with some degree of confidence.

For a data processing system to provide future query answers, there is a need to get a snapshot of the expected future data values. The long-established field of time series forecasting was designed to do exactly that. An increasingly active area of research, it received an unprecedented boost in recent years following a breakthrough in deep learning. It was demonstrated by multiple research teams around the world that certain types of neural networks (such as LSTM, TCN, and Transformer) could achieve remarkable success in learning the data trends and predicting the future data stream content based on past history. While many data analytics and stream analytics frameworks provide time series forecasting as a separate feature, typically as a part of a larger data mining package, incorporating this technology into the core of the query processing engine is yet to become a major trend.

By combining a state-of-the-art time series forecasting method and an efficient mechanism for processing the raw data and acquiring the query results (such as one of those described above), a future generation of big data processing engines could offer a new capability of accurately predicting query answers that will enhance the proactive response abilities of user applications.

What does the future hold?

In this short article, we have barely scratched the surface of the immense unrealized potential of machine learning in the area of big data processing. In the Technion University research team, undergraduate and graduate students are working side by side to produce innovative solutions for these and many other open challenges for practical problems in human-native, machine-native, and hybrid problem domains.

We are looking for projects that will help us test these techniques. Those interested in finding out more about our project ideas and/or looking for collaboration opportunities are kindly invited to contact Dr. Ilya Kolchinsky at

More like this