Real-Time Data Stream Processing
Introduction – why stream processing?
It is commonly known that we are living in the era of Big Data. At no other time throughout human history our lives were as overloaded with data, dependent on data, and controlled by data as they are nowadays. In fact, 90% of world’s data was generated over last two years. It is estimated that 1.7MB of data is created every second for every person on Earth, for the total of over 2.5 quintillion bytes of new data every day. As predicted by the International Data Corporation, the total amount of generated data will reach 163 zettabytes (10^21 bytes) by 2025.
The gargantuan rates of data generation create a demand for a technology capable of scalably and efficiently processing the newly created information. As the data processing speed must exceed the data generation speed to guarantee adequate performance, developing such a technology becomes a challenging task. Additional factors contributing to this challenge include the high variety of the data and the rising complexity of the operations performed by modern applications. To address these issues, a new generation of innovative data processing solutions has emerged. These solutions are collectively known as Big Data technologies.
In addition to the high data generation rate, many modern systems are characterized by extremely short “relevance time” of the data they work with. In many domains, newly created data becomes obsolete mere seconds (or milliseconds) after its collection as newer, “fresh” data becomes available. As a result, data processing becomes an extremely time-critical task, requiring real-time Big Data processing solutions. Some of the applications exhibiting this characteristic include:
– real-time detection of cyber attacks and intrusion attempts in a computer network containing thousands of machines;
– monitoring a stream of continually generated reports of price updates streamed from stock markets for correlation patterns;
– on-the-fly recognition of speeding vehicles, illegally overtaking vehicles, traffic jams, or accidents in a smart city;
– online detection of an emerging mass sentiment in social networks;
– tracking an array of sensors in a smart home to identify situations requiring immediate action, such as a fire or a burglary;
– real-time analysis of the data collected from thousands of wearable sensors used by hundreds of patients in a healthcare facility.
Figure 1. Selected examples of applications requiring stream processing functionality.
In each of the above domains, data items originate at various sources at the rate exceeding hundreds of thousands per second – such as periodic readings from multiple smart sensors, the entire traffic of a monitored network, images arriving from cameras installed at different geographical points, and so on. The system is required to promptly and efficiently analyze these large volumes of data to extract or deduce the needed information. In addition to rigid time constraints, even stricter resource limitations typically apply (e.g., memory and processing power), as the work is often distributed among multiple, possibly computationally limited devices, such as if the edge computing paradigm is utilized.
The growing need to process massive amounts of data in real-time and the scarcity of resources to do so gave rise to a new type of data processing systems adapted for this streaming scenario. Also known as stream processing systems or SP systems, they differ from the traditional database-oriented systems in a number of ways. SP engines operate directly on the data stream, strive to use as few local storage as possible (for reasons of speed and capacity), only incorporate one-pass algorithms, and tend to prioritize high throughput and fast response time over the accuracy of the produced results.
This post explains the basic concepts and ideas behind stream processing and provides an overview of the main paradigms and approaches to implementing a SP system.
So, what can we do with a data stream?
The functionality provided by SP systems can be broadly described as “executing computations on data streams”. This vague definition encompasses several distinct types of operations that we will describe shortly.
A stream processing application is composed of a number of stream operators, each responsible for an “atomic” (albeit potentially quite complex) operation over the data stream. A stream operator belongs to one of the following categories:
– a stateless computation on the data items;
– a sliding aggregate / statistic computation of the entire stream;
– a pattern detection operation.
We will now discuss each of the above operator types.
Stream calculus
The most basic stream operators merely apply a given function on each data item of the input stream. They independently inspect each object passing by and process it separately from the rest according to a predefined algorithm. These operators are called stream calculus operators. As opposed to more complex operators that we will cover later on, stream calculus operators lack any kind of “operator state” to be maintained between different invocations. These stateless processing units can be classified into three major types: filter, transform, and join. As we explain in detail below, a stream processing application usually contains a number of such units connected by communication channels.
A filter operator, as its name suggests, checks all incoming data items for some predefined condition and only forwards those satisfying it. For example, assume that our SP system receives a stream of temperature readings from a sensor. A filter operator displayed in Figure 2(a) checks the value of each reading and only allows those reporting about an abnormally high or low temperature to pass to the next operator.
A transform operator is very similar to a filter. However, instead of deciding whether to drop an item, it unconditionally applies a function on each incoming object. Continuing the example above, Figure 2(b) depicts a transform operator that converts the temperature readings from Celsius to Fahrenheit degrees. Transform operators are a useful tool for enriching a SP system with advanced mathematical and relational functionality.
A slightly more sophisticated category of stream calculus operators combines items from two or more streams into larger objects based on some attribute or predicate – a so-called join operator. An example is shown in Figure 2(c). Here, in addition to a temperature sensor, we also assume that a humidity sensor is installed in the same room. Our join operator accepts the two streams these sensors emit and combines the measurements taken at the same time into full temperature and humidity reports.
Figure 2. Basic stream calculus operators: (a) filter; (b) transform; (c) join.
Aggregations and statistics
Another field of responsibility of a stream processing system is the calculation of statistics representing the entire stream during a specified time period. For example, a user might wish to obtain the average temperature in the building during the last 10 minutes, the peak hourly price of a given stock ticker, or the daily number of speeding vehicles detected by an array of traffic cameras. More advanced aggregation types include quantiles, heavy hitters (the most frequently appearing items in the stream), count of distinct items, and more. These computations are handled by a dedicated category of operators known as aggregation operators or aggregators.
The differences between aggregators and stream calculus operators are twofold. First, a statistic collecting operator acts on the entire stream, or a given fraction thereof, rather than processing each data item independently. Second, to guarantee the correctness of the provided answers, an aggregator must maintain an internal state, sometimes of significant size and dimensionality.
Modern SP systems further extend the notion of an aggregation operator by allowing it to contain even more powerful functionality. In particular, many frameworks provide a variety of aggregators based on machine learning, making it possible to activate data mining, anomaly detection, predictive analytics, and sentiment analysis operations on a data stream.
Pattern detection
The third service typically provided by stream processing infrastructures is that of recognizing instances of user-defined patterns at runtime. As an example, a stock monitoring application user might wish to detect a case of some stock ticker experiencing a rapid rise, following by a similar rise of another predefined stock, with a third stock dropping at the same rate in the meanwhile. Alternatively, an intrusion prevention system in a computer network might need to perform an action as soon as some part of the observed traffic conforms to a behavioral signature of a known malware tool. As both stock behavior and network traffic patterns exhibit very complex real-life trends, using a combination of calculus and aggregation operators in the above cases is impractical and more advanced solutions are required.
In fact, on-the-fly detection of patterns turns out to be such a complicated process that a dedicated subfield of stream processing called complex event processing (or CEP for short) has emerge to address its challenges. CEP systems break a pattern down into atomic parts and craft a pattern-specific evaluation structure that processes the relevant data items (known as ‘primitive events‘ in CEP terminology) and gradually combines them into sets matching the requested pattern (also called ‘complex events‘). Depending on the system, this structure can be implemented as a state machine (a CEP automaton), an evaluation tree, or a network of simpler operators.
Figure 3. An example of a state machine for pattern detection over a data stream.
Figure 3 presents an example of a CEP automaton detecting a simple pattern of three stock price updates of Microsoft’s, Google’s, and Apple’s stocks respectively, occurring in this particular order and having ascending prices. In this case, the entire state machine logic will be placed in a single CEP operator inspecting the input stream for data items matching the required conditions (Figure 4). In more advanced scenarios, an entire network segment (referred to as a EPN – event processing network) is introduced.
Figure 4. A CEP streaming operator.
Complex event processing is an extremely broad and extensive topic deserving a post (or even a series) of its own. Barring a few exceptions, most SP systems only incorporate a very limited subset of the existing CEP functionality.
Combining the operators into a stream processing network
Naturally, only a few limited use cases can be satisfied using a single streaming operator. Most real-life scenarios require a mixture of multiple stream calculus, aggregation, and pattern detection operators executing a vast and diverse array of interconnected tasks. As an example, consider the room temperature sensor scenario that we examined above. One possible user query could be formulated as follows: “Raise an alarm if the average temperature reported by all sensors exceeds 30°C for at least three measurements within 10 minutes”. To provide the requested functionality, a SP system needs an aggregation operator for computing the running average over multiple streams, a filter operator for isolating the abnormal temperature values, and a CEP operator in order to detect a sequence of three measurements. The operators would have to be connected to form a processing chain as illustrated in Figure 5.
Figure 5. Combining three basic streaming operators to implement a basic SP application for firing high temperature notifications.
A typical real-life stream processing application involves hundreds to thousands of operators connected by communication channels to form a directed graph, referred to as a stream processing network. For the most part, each operator is relatively simple and serves a generic purpose, whereas their composition implements an application-specific feature. Figure 6 depicts an example of a SP network created using IBM Streams, one of the most popular SP systems. Three node types can be identified in the network. The data sources or providers (marked in light green) are the origins at which the raw, unprocessed data is generated. They encapsulate a variety of actual data sources, such as databases, file systems, Kafka, RabbitMQ, and many more. The data sinks or consumers (marked in dark green) are the final destinations of the processing pipeline, representing the end users. Every sink corresponds to a distinct path in the SP network, and thus consumes an output of a different query. The remainder of the nodes are the operators, covered above in detail.
Figure 6. A sample stream processing network created using IBM Streams.
Even though modern SP systems offer a rich variety of generic ready-to-use operators, many applications involve user-defined functionality that cannot be achieved by combining and configuring the existing tools. For example, Celsius-to-Fahrenheit (and vice versa) conversion from our temperature sensor example above is an entirely domain-specific function and hence would not be implemented by most standard operator libraries. To overcome this limitation, SP systems offer programming interfaces for implementing custom operators, thus allowing to extend the default operator set with new capabilities. These “plugin” operators are treated as generic containers executing user-supplied code typically written in Java, Python, SQL, or a system-specific declarative language.
Popular stream processing systems
A plethora of SP frameworks has been developed and successfully launched during the last decade. While intended for similar use cases and providing similar functionality, these systems differ in many aspects.
Some of the widely used commercial SP systems are TIBCO Streaming, IBM Streams, Microsoft StreamInsight, Oracle CEP, Sybase Aleri Streaming, SAS ESP, Apama, Amazon Kinesis Data Analytics, SQLStream, and StreamAnalytix. Active open source projects providing stream processing capabilities include Spark, Storm, Flink, Heron, Samza, Esper, and Siddhi.
In the remainder of this post, we will examine the major types of SP systems and their most important properties.
UI paradigm: dataflow-based vs. query-based
One of the main characteristics of a stream processing system is the way a user specifies the desired system functionality, i.e., the user interface.
Most currently available systems are strictly dataflow-based. They provide a graphical interface for directly specifying (“drawing”) the desired flows and explicitly defining the processing network topology. As illustrated above in Figure 6, a user creates a graph of “boxes and arrows” representing the operators, sources, sinks, and communication channels comprising the streaming application.
While the dataflow-based approach is very intuitive and allows for seamless incorporation of user-defined operators into the SP network, it completely lacks the separation between “what” is the action a user wishes to perform and “how” should it be implemented. In many cases, a user-supplied query can be “translated” into multiple stream processing networks of identical functionality, yet of significantly varying complexity and performance.
Continuing our temperature sensors example from above, consider a simple setup of three sensors installed in three different rooms A, B, and C. We would like to detect a situation where room B temperature exceeds room A temperature, and at the same time room C temperature exceeds room B temperature. Figure 7 presents three possible network topologies implementing this pattern query. In addition, assume that reading from sensor C arrive every 100 milliseconds, while A and B generate their measurements every 10 milliseconds. Then, the network in 7(a) would perform much worse than the two others, since the first filter would be experiencing more load due to higher total arrival rate of the data. As the size of the network grows and the arrival rates of the different data types are more diverse, the number of possible network topologies increases and determining the most efficient one becomes a highly difficult task. Dataflow-based systems delegate the responsibility for this performance-critical decision to the user, which constitutes the main drawback of this approach.
In contrast, query-based stream processing engines overcome the above issue by offering a different UI paradigm. Instead of creating a SP network, a user provides a human-readable query written in a SQL-like declarative language. This query describes “what” is the desired output of a streaming application rather than “how” to build it. In the above example, the query of ascending temperature readings could be formulated as follows (SASE language is used):
PATTERN AND(A a, B b, C c)
WHERE (a.temp < b.temp AND b.temp < c.temp)
The query expression is internally translated into a stream processing network similar to the one a user explicitly creates in a dataflow-based system. However, it is now the responsibility of the system to apply complex optimization mechanisms in order to select the representation of the input query resulting in optimal performance. While only a handful of existing stream processing systems support this approach, tremendous optimization potential makes it a viable candidate to conquer the future market of stream processing applications.
Figure 7. Different network topologies for detecting a pattern of three readings from sensors A, B, and C, such that A.value < B.value < C.value.
Architecture: centralized vs. multi-core vs. fully distributed
Early SP frameworks generally assumed centralized architecture. In contrast, later systems were designed with parallelization in mind, ranging from simple shared-memory models to increasingly complicated ones supporting execution in a fully distributed environment. Since SP networks consist of a multitude of independent units, they are inherently parallelizable and can greatly benefit from running on top of a computing cluster or a cloud.
Two main types of parallelization are employed in stream processing systems. First, different stream operators can be assigned to different cores, VMs, or machines. This simple parallelization type is called functional or pipeline parallelism. While simple to implement, its parallelization degree is limited by the number of operators in the application.
The second parallelization method is based on replicating operators of the same type to create multiple instances of the same flow. The input stream is split into multiple sub-streams, each communicating with a different operator or network replica and containing a subset of the input data. For example, the input of a transform operator converting an uppercase word to lowercase could be split according to the first letter of the input word, with different instances covering different ranges of letters. Due to the need to split the data, this parallelization type is called data parallelism. While trivial to implement on a stateless stream calculus operator, increasingly sophisticated methods are required to parallelize stateful (aggregation and CEP) operators.
Input data assumptions: deterministic vs. probabilistic
As mentioned above, data items processed by SP systems originate at a variety of external sources. Sometimes we know for sure that the data accepted from these sources is reliable and correct (barring a bad connection channel or a malicious intervention). This is the case for stock price reports and system logs among others. Other data sources are not always guaranteed to deliver the correct values. This is typically the case for all kinds of sensors. For example, an IoT temperature tracker might be faulty, and a smart camera might make an error in object recognition. The readings of these imprecise sensors incur some degree of uncertainty, making it much more difficult to produce accurate results.
Most today’s SP systems assume all incoming data to be deterministic, certain, and correct. On the contrary, probabilistic or uncertain SP systems, of which only academic prototypes are currently available, perceive each data item as an uncertain object whose value is specified by some probability distribution. Still a very active research field, this latter category of systems incorporates complex algorithms with the aim of producing as correct and stable outputs as possible given the uncertainty in the inputs. While managing to partially mitigate the problem of uncertain inputs, these systems suffer from increased complexity and significant performance degradation.
Cost model: throughput-oriented vs. latency-oriented vs. other
Lastly, stream processing systems can be distinguished by the performance objective they attempt to optimize. The most popular choices are maximizing the throughput (number of data items processed per second) and minimizing the latency (system response time). The former tends to be the favorite performance metric of dataflow-based SP systems focusing on aggregation/statistic type of functionality, while CEP-oriented systems often prefer the latter. Other, less commonly used metrics include network bandwidth (in distributed stream processing systems), power efficiency, and memory consumption. Mixed cost models combining two or more performance objectives are also widely employed.
The choice of a performance metric is crucial for the design and the functionality of a stream processing system. In some cases, conflicts occur between the different options. For example, processing the input items in batches rather than one-by-one can significantly improve the throughput, but at the same time would increase the latency due to the introduced delay.
The future of stream processing
SP systems have achieved notable success in many real-life scenarios. However, stream processing is still an actively evolving and developing field. The rapidly growing market demand for stream processing solutions serves as the catalyst to further technological advance in this area.
There is no shortage of open problems, unresolved challenges, and directions for future work on stream processing. Utilizing advanced optimization techniques to further improve real-time system performance, making SP systems more adaptive and resilient to on-the-fly changes in the workload, and fitting them to operate in cloud-based environments are among the leading trends in SP-related research.
One particularly interesting research direction considers incorporating state-of-the-art machine learning techniques to increase the scalability of stream processing systems, broaden their range of abilities, and ease their deployment. Machine learning can help achieve the above goals in a number of ways.
First, as mentioned above, optimizing a query-based stream processing system is a complicated task. Rather than relying on sophisticated and computation-heavy optimization algorithms, it might be beneficial to make the system learn “good” configurations by trial and error using reinforcement learning. The learning procedure could be executed automatically at runtime and only incur negligible overhead.
Second, in addition to teaching the system how to do the given job, it might also be possible to make it learn what to do. By utilizing state-of-the-art data mining techniques, a stream processing system could analyze the input data stream and past user choices to make intelligent decisions regarding interesting patterns to be detected.
Finally, the functionality of a SP engine could be extended to predict future outputs of a SP network rather than only providing query answers, aggregate results, and pattern matches based on currently available data. This could become a reality by utilizing recent advances in predictive analysis and time series forecasting to predict future data workloads.