===== =====
=====IoT Data Lifecycle=====
===== =====
Data processing is simply the conversion of raw data to meaningful information through a process. Data is manipulated to produce results that lead to a resolution of a problem or improvement of an existing situation. Similar to a production process, it follows a cycle where inputs (raw data) are fed to a process (computer systems, software, etc.) to produce output (information and insights). Generally, organisations employ computer systems to carry out a series of operations on the data to present, interpret, or obtain information. The process includes activities like data entry, summary, calculation, storage, etc. A useful and informative output is presented in various appropriate forms, such as diagrams, reports, graphics, etc.
The lifecycle of data within an IoT system proceeds from data production to aggregation, transfer, optional filtering and preprocessing, and finally to storage and archiving. Querying and analysis are the endpoints that initiate (request) and consume data production, but data products can be set to be “pushed” to the IoT consuming services. Production, collection, aggregation, filtering, and some basic querying and preliminary processing functionalities are considered online, communication-intensive operations. Intensive preprocessing, long-term storage and archival and in-depth processing/analysis are considered offline storage-intensive operations.
Storage operations aim at making data available on the long-term for constant access/updates, while archival is concerned with read-only data. Since some IoT systems may generate, process, and store data in-network for real-time and localised services, with no need to propagate this data further up to concentration points in the system, edge devices that combine both processing and storage elements may exist as autonomous units in the cycle. In the following paragraphs, each of the elements in the IoT data lifecycle is explained.
- **Querying**: data-intensive systems rely on querying as the core process to access and retrieve data. In the context of the IoT, a query can be issued either to request real-time data to be collected for temporal monitoring purposes or to retrieve a certain view of the data stored within the system. The first case is typical when a (mostly localised) real-time request for data is required. The second case represents more globalised views of data and in-depth analysis of trends and patterns.
- **Production**: data production involves sensing and transfer of data by the edge devices within the IoT framework and reporting this data to interested parties periodically (as in a subscribe/notify model), pushing it up the network to aggregation points and subsequently to database servers, or sending it as a response triggered by queries that request the data from sensors and smart objects. Data is usually time-stamped and possibly geo-stamped and can be in the form of simple key-value pairs, or it may contain rich (unstructured) audio/image/video content, with varying degrees of complexity in-between.
- **Collection**: the sensors and smart objects within the IoT may store the data for a certain time interval or report it to govern components. Data may be collected at concentration points or gateways within the network, where it is further filtered and processed, and possibly fused into compact forms for efficient transmission. Wireless communication technologies such as Zigbee, Wi-Fi and mobile networks are used by objects to send data to collection points. A collection is the first stage of the cycle and is very crucial since the quality of data collected will impact heavily on the output. The collection process needs to ensure that the data gathered are both defined and accurate so that subsequent decisions based on the findings are valid. This stage provides both the baseline from which to measure and a target on what to improve. Some types of data collection include census (data collection about everything in a group or statistical population), sample survey (collection method that includes only part of the total population), and administrative by-product (data collection is a byproduct of an organisation’s day-to-day operations).
- **Aggregation/fusion**: transmitting all the raw data out of the network in real-time is often prohibitively expensive, given the increasing data streaming rates and the limited bandwidth. Aggregation and fusion techniques deploy summarisation and merging operations in real-time to compress the volume of data to be stored and transmitted.
- **Delivery**: as data is filtered, aggregated, and possibly processed either at the concentration points or at the autonomous virtual units within the IoT, the results of these processes may need to be sent further up the system, either as final responses or for storage and in-depth analysis. Wired or wireless broadband communications may be used there to transfer data to permanent data stores.
- **Preprocessing**: the IoT data will come from different sources with varying formats and structures. Data may need to be preprocessed to handle missing data, remove redundancies and integrate data from different sources into a unified schema before being committed to storage. Preparation is the manipulation of data into a form suitable for further analysis and processing. Raw data cannot be processed and must be checked for accuracy. Preparation is about constructing a dataset from one or more data sources to be used for further exploration and processing. Analysing data that has not been carefully screened for problems can produce highly misleading results that are heavily dependent on the quality of data prepared. This preprocessing is a known procedure in data mining called data cleaning. Schema integration does not imply brute-force fitting of all the data into a fixed relational (tables) schema, but rather a more abstract definition of a consistent way to access the data without having to customise access for each source's data format(s). Probabilities at different levels in the schema may be added at this phase to the IoT data items in order to handle the uncertainty that may be present in data or to deal with the lack of trust that may exist in data sources.
- **Storage/update and archiving**: This phase handles the efficient storage and organisation of data, as well as the continuous update of data with new information as it becomes available. Archiving refers to the offline long-term storage of data that is not immediately needed for the system's ongoing operations. The importance of this step is that it allows quick access and retrieval of the processed information, allowing it to be passed on to the next stage directly when needed. The core of centralised storage is the deployment of storage structures that adapt to the various data types and the frequency of data capture. Relational database management systems are a popular choice that involves the organisation of data into a table schema with predefined interrelationships and metadata for efficient retrieval at later stages. NoSQL key-value stores are gaining popularity as storage technologies for their support of big data storage with no reliance on a relational schema or strong consistency requirements typical of relational database systems. Storage can also be decentralised for autonomous IoT systems, where data is kept at the objects that generate it and is not sent up the system. However, due to the limited capabilities of such objects, storage capacity remains limited in comparison to the centralised storage model.
- **Processing/analysis**: This phase involves the ongoing retrieval and analysis operations performed and stored and archived data in order to gain insights into historical data and predict future trends, or to detect abnormalities in the data that may trigger further investigation or action. Task-specific preprocessing may be required to filter and clean data before meaningful operations can take place. When an IoT subsystem is autonomous and does not require permanent storage of its data, but rather keeps the processing and storage in the network, then in-network processing may be performed in response to real-time or localised queries.
- **Output and interpretation**: This is the stage where processed information is now transmitted to the user. An output is presented to users in various visual formats like diagrams, infographics, printed report, audio, video, etc. The output needs to be interpreted so that it can provide meaningful information that will guide future decisions of the company.
Depending on the architecture of an IoT system and actual data management requirements in place, some of the steps described above can be omitted. Nevertheless, it is possible to distinguish three main patterns for the IoT data flow.
* In relatively autonomous IoT systems, data proceeds from query to production to in-network processing and then delivery.
* In more centralised systems, the data flow that starts from production and proceeds to collection and filtering/aggregation/fusion and ends with data delivery to initiating (possibly global or near real-time) queries.
* In fully centralised systems, the data flow extends the production to aggregation further and includes preprocessing, permanent data storage and archival, and in-depth processing and analysis.