===== ===== ===== IoT Data Processing Models and Frameworks===== Processing frameworks and processing engines are responsible for computing over data in a data system. While there is no authoritative definition setting apart "engines" from "frameworks", it is sometimes useful to define the former as the actual component responsible for operating on data and the latter as a set of elements designed to do the same. For instance, Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine. Engines and frameworks can often be swapped out or used in tandem. For instance, Apache Spark, another framework, can hook into Hadoop to replace MapReduce. This interoperability between components is one reason that big data systems have great flexibility. While the systems which handle this stage of the data lifecycle can be complex, the goals on a broad level are very similar: operate over data to increase understanding, surface patterns, and gain insight into complex interactions. To simplify the discussion of these components, we will group these processing frameworks by the state of the data they are designed to handle. Some systems handle data in batches, while others process data in a continuous stream as it flows into the system. Still, others can manage data in either of these ways. ===Batch Processing Systems=== Batch processing has a long history within the big data world. Batch processing involves operating over a large, static dataset and returning the result at a later time when the computation is complete. The datasets in batch processing are typically bounded: batch datasets represent a limited collection of data persistent: data is almost always backed by some permanent storage large: batch operations are often the only option for processing extensive sets of data Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state is maintained for the duration of the calculations. Tasks that need vast volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at managing large volumes of persistent data, it frequently is used with historical data. The trade-off for handling large quantities of data is a longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant. Because this methodology heavily depends on permanent storage, reading and writing multiple times per task, it tends to be somewhat slow. On the other hand, since disk space is typically one of the most abundant server resources, it means that MapReduce can handle enormous datasets. MapReduce has incredible scalability potential and has been used in production on tens of thousands of nodes. Apache Hadoop: Apache Hadoop is a processing framework that exclusively provides batch processing. Hadoop was the first big data framework to gain significant traction in the open-source community. Based on several papers and presentations by Google about how they were dealing with tremendous amounts of data at the time, Hadoop reimplemented the algorithms and component stack to make large-scale batch processing more accessible. Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling extensive datasets where time is not a significant factor. The low cost of components necessary for a well-functioning Hadoop cluster makes this processing inexpensive and useful for many use cases. Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology. ===Stream Processing Systems=== Stream processing systems compute over data as it enters the system. It requires a different processing model than the batch paradigm. Instead of defining operations to apply to an entire dataset, stream processors determine processes that will be used to each individual data item as it passes through the system. The datasets in stream processing are considered "unbounded". It has a few important implications: * the total dataset is only defined as the amount of data that has entered the system so far; * the working dataset is perhaps more relevant and is limited to a single item at a time; * processing is event-based and does not "end" until explicitly stopped. Results are immediately available and will be continually updated as new data arrives. Stream processing systems can handle a nearly unlimited amount of data, but they only process one (true stream processing) or very few (micro-batch processing) items at a time, with a minimal state being maintained in between records. While most systems provide methods of maintaining some state, stream processing is highly optimised for more functional processing with few side effects. Functional operations focus on discrete steps that have limited state or side-effects. Performing the same operation on the same piece of data will produce the same output independent of other factors. This kind of processing fits well with streams because state between items is usually some combination of challenging, limited, and sometimes undesirable. So while some type of state management is generally possible, these frameworks are much simpler and more efficient in their absence. This type of processing lends itself to certain kinds of workloads. Processing with near real-time requirements is well served by the streaming model. Analytics, server or application error logging, and other time-based metrics are a natural fit because reacting to changes in these areas can be critical to business functions. Stream processing is a good fit for data where you must respond to changes or spikes and where you're interested in trends over time. * Apache Storm. * Apache Samza. ===Hybrid Processing Systems=== Some processing frameworks can handle both batch and stream workloads. These frameworks simplify diverse processing requirements by allowing the same or related components and APIs to be used for both types of data. The way that this is achieved varies significantly between Spark and Flink, the two frameworks we will discuss. It is mainly a function of how the two processing paradigms are brought together and what assumptions are made about the relationship between fixed and unfixed datasets. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, but they also have their integrations, libraries, and tools for doing things like graph analysis, machine learning, and interactive querying. * Apache Spark. * Apache Flink.