====== Data Preparation for Data Analysis ====== {{:en:iot-open:czapka_p.png?50|General audience classification icon}}{{:en:iot-open:czapka_b.png?50|General audience classification icon}}{{:en:iot-open:czapka_e.png?50|General audience classification icon}}\\ ===== Introduction ===== In most cases, data must be prepared before analysing or applying some processing methods. There might be different reasons for it, for instance, missing values, sensor malfunctioning, different time scales, different units, specific format needed for a given method or algorithm, and many more. Therefore, data preparation is as necessary as the analysis itself. While data preparation is usually very specific to a given problem, some common general cases and preprocessing tasks prove to be very useful. Data preprocessing also depends on the data's nature– preprocessing is usually very different for data, where the time dimension is essential (time series), or it is not like a log of discrete cases for classification, where there are no internal causal dependencies among entries. It must be emphasised that whatever the data preprocessing is done, it needs to be carefully noted, and the reasoning behind it must be explained to allow others to understand the results acquired during the analysis. ===== "Static data" ===== Some of the methods explained here might also be applied to time series but must be done with full awareness of possible implications. Usually, the data should be formatted as a table consisting of rows representing data entries or events and fields representing features of the event entry. For instance, a row might represent a room climate data entry, where fields or factors represent air temperature, humidity level, CO2 level and other vital measurements. For the sake of simplicity in this chapter, it is assumed that data is formatted as a table. ==== Filling the missing data ==== One of the most common situations is missing sensor measurements, which might be caused by communication channel issues, IoT node malfunctioning or other reasons. Since most of the data analysis methods require complete entries, it is necessary to ensure that all data fields are present before applying the analysis methods. Usually, there are some common approaches to deal with the missing values: * **Random selection** – the method, as suggested by the name, allows randomly selecting one of the possible values of the data field. If the field value list is categorical, representing a limited set of possible values, for instance, a set of colours or operation modes, one value from the list is randomly selected. In the case of a continuous value, a random value from an interval is selected. Besides its simplicity, the method allows for filling gaps in data in cases where a fraction of missing values is insignificant. In case of a significant fraction of missing values, the method should not be applied due to implications on the data analysis. * **Informed selection** – the method, in essence, does the same as the Random selection except that additional information on values distribution of the field (factor) is used. In other words, the most common might be selected for discrete factor values. However, in the case of continuous values, an average value might be selected according to the distribution characteristics. There might be more complex situations which cannot be described by Gaussian distribution. In those cases, the data analyst needs to make an informed decision on particular selection mechanisms, representing the distribution's specifics. * **Value marking** – this approach might be applied for cases where there is the chance that missing data is a consequence of some critical processes; for instance, whenever the engine's temperature reaches a critical value, the pressure sensor stops functioning due to overheating. Analysts might know the issue or not; in any case, it is essential to mark those situations to find possible causalities in the data. A dedicated new category might be introduced if the factor is categorical, like "empty". In the case of continuous values, a dedicated "impossible" value might be assigned, such as max integer value, minimum integer value, zero, and others. ==== Scaling ==== Scaling is a very often used method for continuous value numerical factors. The main reason is that different value intervals for different factors are observed. It is essential for methods like clustering, where a multi-dimensional Euclidian distance is used, where, in the case of different scales, one of the dimensions might overwhelm others just because of a higher order of the numerical values. Usually, scaling is performed by applying a linear transformation of the data with set min and max values, which mark the desired value interval. In most software packages, like Python Pandas ((https://pandas.pydata.org/)), scaling is implemented as a simple-to-use function. However, it might be done manually if needed as well:
{{ :en:iot-reloaded:equationone.png?600 | Scaling}} Scaling
,where: Vold – the old measurement Vnew – the new – scaled measurement mmin – minimum value of the measurement interval mmax – maximum value of the measured interval Imin – minimum value of the desired interval Imax – maximum value of the desired interval ==== Normalisation ==== Normalisation is effective when the data distribution is unknown or known to be non-Gaussian (not following a bell curve of the Gaussian distribution). It is beneficial for data with varying scales, especially when using algorithms that do not assume any specific data distribution, such as k-nearest neighbours and artificial neural networks. Normalisation does not change the scale of the values but the distribution of the values to represent a Gaussian distribution. This technique is mainly used in machine learning and is performed with appropriate software packages due to the complexity of the calculations when compared to scaling. ==== Adding dimensions ==== Sometimes, it is necessary to emphasise a particular phenomenon in the data. For instance, it might be very helpful to bolden the changes in the factor value, i.e. those that are more distant from 0 should be even larger, but those closer should not be raised. In this case, applying the exponent function to the factor values – squaring or raising in power of 4 is a simple technique. If the negative values are present, uneven power values might be used. A variation of the technique might be summing up different factor values before or after applying the exponent. In this case, a group of similar values representing the same phenomenon emphasises it. Any other function can be applied to represent the specifics of the problem. ===== Time series ===== Time series usually represent the dynamics of some process, and therefore, the order of the data entries has to be preserved. This means that in most cases, all of the mentioned methods might be used as long as the data order remains the same. A time series is simply a set of data - usually events, arranged by a time marker. Typically, time series are arranged in the order in which events occur/ recorded. Several important consequences follow from this simple fact: * The sequence of events must be followed for any data manipulation; * The arrangement of events in time is not only the order of data arrival but is a reflection of a certain process and its development in time. * The sequence of events reflects the causal relations of this process, which we try to discover through data analysis; === Time Series Analysis Questions === Therefore, there are several questions that data analysis typically tries to answer: * Is the process stationary, i.e. is the process variable over time? * If the process is dynamic, is there a direction of development: * The process is chaotic or regular; * There is periodicity in the dynamics of the process: * Are there any regularities between the individual changes of the parameters characterising the process – correlation? * Does the dynamics of the process depend on changes in parameters of the external environment that we can influence, i.e. is the process adaptive? === Some definitions === **Autocorrelation** - A process is autocorrelated if the similarity of the values of a given observation is a function of the time between observations. In other words, the difference between the values of the observations depends on the interval between the observations. This does not mean that the process values are identical but that the difference between them is similar. The process can equally well be decaying or growing in the mean value or amplitude of the measurements, but the difference between subsequent measurements is always the same (or close). **Seasonality** - The process is seasonal if the deviation from the average value is repeated periodically. This does not mean the values must match perfectly, but there must be a general tendency to deviate periodically from the average value. A perfect example is a sinusoid. **Stationarity** - A process is stationary if its statistical properties do not change over time. Generally, the mean and variance over a period serve as good measures. In practice, a certain tolerance interval is used to tell whether a process is stationary since ideal cases (no noise) do not tend to occur in practice. For example, temperature measurements over several years are stationary and seasonal. It is not autocorrelated because temperatures are still relatively variable across days. Numerically, stationarity is evaluated with the so-called Dickey-Fuller test ((Dickey, D. A.; Fuller, W. A. (1979). "Distribution of the Estimators for Autoregressive Time Series with a Unit Root". Journal of the American Statistical Association. 74 (366): 427–431. doi:10.1080/01621459.1979.10482531. JSTOR 2286348.)), which uses a linear regression model to measure change over time at a given time step. The model's t-test ((Blair, R. Clifford; Higgins, James J. (1980). "A Comparison of the Power of Wilcoxon's Rank-Sum Statistic to That of Student's t Statistic Under Various Nonnormal Distributions". Journal of Educational Statistics. 5 (4): 309–335. doi:10.2307/1164905. JSTOR 1164905.)) indicates how statistically strong the hypothesis of process stationarity is. ==== Time series modelling ==== In many cases, it is necessary to emphasise the main pattern of the time series while removing the "noise". In general, there are two main techniques – decimation and smoothing. Both are widely used but need to be treated carefully. === Moving average (sliding average) === The essence of the method is to obtain an average value within a certain time window, M, thereby giving inertia to the incoming signal reducing noise's impact on the overall analysis result. Different effects might be obtained depending on the size of the time window M.
{{ :en:iot-reloaded:equationtwo.png?400 | Moving average}} Moving average
, where SMAt - the new smoothed value at time instant t; Xi – ith measurement at a time instant i M – time window The image below demonstrates the effects of a time window size of 10 and 100 measurements – an incoming signal from a freezer's thermometer. * At first, it needs to be emphasised that the moving average adds a slight lag in the incoming data, i.e., the rise and fall of the values are slightly behind the original values. * In the case of M = 10, the overall shape of the time series is preserved, while noise is removed. * In the case of M = 100, the time series shape is transformed into a new function, which does not represent the main feature of the original measurements. For instance, the rises are replaced by falls and vice versa, while the data spike melts with the coming rise and forms one more significant rise of the signal. So, the result annihilates the initial features of the signal.
{{ {{ :en:iot-reloaded:slidingaverage.png?600 |:en:iot-reloaded:equationtwo.png?900 | Moving average}} Moving average
=== Exponential moving average === The exponential moving average is widely used in noise filtering, for example, in the analysis of changes in stock markets. Its main idea is that each measurement's weight (influence) decreases exponentially as time increases. Thus, the evaluation takes more recent measurements and less considers older ones.
{{ :en:iot-reloaded:equationthree.png?400 | Exponential moving average}} Exponential moving average
, where EMAt - the new smoothed value at time instant t; Xi – ith measurement at a time instant i Alpha - smoothing factor between 0 and 1, which reflects the weight of the last - the most recent measurement. As seen in the picture below, the exponential moving average in the case of different weighting factor values preserves the shape of the initial signal. It has a minimal lag while removing the noise, which makes it a handy smoothing technique.
{{ :en:iot-reloaded:exponentialaverage.png?600 | Exponential moving average}} Exponential moving average
=== Decimation === Decimation is a technique of excluding some entries from the initial time series to reduce the overwhelming or redundant data. As the name suggests, usually, to reduce the data by 10%, every tenth entry is excluded. It is a simple method that significantly benefits cases of over-measured processes with slow dynamics. With preserved time stamps, the data still allows the application of general time-series analysis techniques like forecasting.