Table of Contents

Data Products Development

In the previous chapter, some essential properties of Big Data systems have been discussed and how and why IoT systems relate to Big Data problems. In any IoT implementation, data processing is the system's heart, which at least transforms into a data product. While it is still mainly a software subsystem, its development differs significantly from that of a regular software product. The difference is expressed through the roles involved and the lifecycle itself. It is often wrongly assumed that the main contributor is the data scientist responsible for developing a particular data processing or forecasting algorithm. It is somewhat valid, except other essential roles are vital to success. The team of developers playing the roles might be as small as three or as large as 20 people, depending on the scale of the project. The leading roles are explained below.

Business user

Business users have good knowledge of the application domain and, in most cases, benefit significantly from the developed data product. They know how to transform data into a business value in the organisation. Typically, they take positions like Production manager, Business/market analyst, and Domain expert.

Project sponsor

He is the one who defines the business problem and is triggering the birth of the project. He represents the project's scope and volume and meets the necessary provisions. While he defines project priorities, he does not have deep knowledge or skills in the technology, algorithms, or methods used.

Project manager

As in most software projects, the project manager is responsible for meeting project requirements and specifications within the given time frame and available provisions. He selects the needed talents, chooses development methods and tools, and selects goals for the development team members. Usually, he reports to the project sponsor and ensures that information flows within the team.

Business information analyst

He possesses deep knowledge in the given business domain, supported by his skills and experience. Therefore, he is a valuable asset for the team in understanding the data's content, origin, and possible meaning. He defines the key performance indicators (KPI) and metrics to assess the project's success level. He selects information and data sources to prepare information and data dashboards for the organisation's decision-makers.

Database administrator

He is responsible for configuring the development environment and Database (one, many, or a complex distributed system). In most cases, the configuration must meet specific performance requirements, which must be maintained. He ensures secure access to the data for the team members. During the project, he backs up data, restores it if needed, updates configuration, and provides other support.

Data engineer

Data engineers usually have deep technical knowledge of data manipulation methods and techniques. During the project, he tuned data manipulation procedures, SQL queries, and memory management and developed specific stored or server-side procedures. He is responsible for extracting particular data chunks for the Sandbox environment and formatting and tuning them according to data scientists' needs.

Data scientist

Develops or selects data processing models needed to meet the project specifications. Develops, tests and implements data processing methods and algorithms; develops decision-making support methods and their implementations for some projects. Provides needed research capacities for selecting and developing the data processing methods and models.

As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on their competencies and capacities, roles might overlap, or a single team member could provide several roles. Once the team is built, the development process can start. As with any other product development, data product development follows a specific life cycle of phases. Depending on particular project needs, there might be variations, but the data product development follows the well-known waterfall pattern in most cases. The phases are explained in the figure 1:

 Data Product Life Cycle
Figure 1: Data Product Life Cycle

Discovery

The project team learns about the problem domain, the problem itself, its structure, and possible data sources and defines the initial hypothesis. The phase involves interviewing the stakeholders and other potentially related parties to reach as broad an insight as necessary. It said that during this phase, the problem is farmed – defined the analytical problem, indicators of the success for the potential solutions, business goals and scope. To understand business needs, the project sponsor is involved in the process from the very beginning. The identified data sources might include external systems or APIs, sensors of different types, static data sources, official statistics and other vital sources. One of the primary outcomes of the phase is the Initial Hypothesis (IH), which concisely represents the team's vision of the problem and potential solution simultaneously. For instance, “Introduction of deep learning models for sensor time series forecast provides at least 25% better performance over statistical methods used at the moment.” Whatever the IH is, it is a much better starting point than defining the hypothesis during the project implementation in later phases.

Data preparation

The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, most teams tend to underestimate this time consumption, which costs the project manager and analysts dearly, leading to losing trust in the project's success. Data scientists given a unique role and authority in the team tend to “skip” this phase and go directly to phase 3 or 4, which is costly because of incorrect or insufficient data to solve the problem.

  1. Data analysis sandbox - The client's operational data, log (window), raw streams, etc., are copied. There is a possibility of a natural conflict where Data scientists want everything, and IT «service» provides a minimum. The needs must, therefore, be explained through a thorough argument. The sandbox can be 5 – 10 times larger than the original dataset!
  2. Carrying out ETLs - The data is retrieved, transformed and loaded back into the sandbox system. Sometimes, simple data filtering excludes outliers and cleans the data. Due to the volume of data, there may be a need for parallelisation of data transfers, which leads to the need for appropriate software and hardware infrastructure. In addition, various web services and interfaces are used to obtain context.
  3. Exploring the content of the data - The main task is to get to know the content of the extracted data. A data catalogue or vocabulary is created (small projects can skip this step). Data research allows for identifying data gaps and technology flaws, as well as teams' own and extraneous data (for determining responsibilities and limitations).
  4. Data conditioning - Slicing and combining are the most common actions in this step. The compatibility of data subsets with each other after the performed manipulations is checked to exclude systematic errors – errors that occur as a result of incorrect manipulation (formatting of data, filling in voids, etc…). During this step, the team ensures the time, metadata, and content match.
  5. Reporting and visualising - This step uses general visualisation techniques, providing a high-level overview – value distributions, histograms, correlations, etc. explaining the data content. It is necessary to check whether the data represent the problem sphere, how the value distributions “behave” throughout the dataset, and whether the details are sufficient to solve the problem.

Model planning

The main task of the phase is to select model candidates for data clustering, classification or other needs consistent with the Initial Hypothesis from Phase 1.

  1. Exploring data and selecting variables - The aim is to discover and understand variables' interrelationships through visualisations. The identified stakeholders are an excellent source of relevant insights about internal data relationships – even if they do not know the reasons! These steps allow the selection of key factors instead of checking all against all.
  2. Selection of methods or models - During this step, the team creates a list of methods that match the data and the problem. A typical approach is making many trim model prototypes using ready-made tools and prototyping packages, such as R, SPSS, Excel, Python, and other specific tools. Tools typical of the phase might include but are not limited to R or Python, SQL and OLAP, Matlab, SPSS, and Excel (for simpler models).

Model development

During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed:

  1. Data preparation - Specific subsets of data are created, such as training, testing, and validation. The data is adjusted to the selected initial data formatting and structuring methods.
  2. Model development - Usually, conceptually, it is very complex but relatively short in terms of time.
  3. Model testing - The models shall be operated and tuned using the selected tools and training datasets to optimise the models and ensure their resilience to incoming data variations. All decisions must be documented! This is important because all other team roles require detailed decision-making reasoning, especially during communication and operationalisation.
  4. Key points to be answered during the phase area:
    • Is the model accurate enough?
    • Are the results obtained meaningful in relation to the objectives set?
    • Don't models make unacceptable mistakes?
    • Is the data enough?

In some areas, false positives are more dangerous than false negatives. For example, targeting systems may inadvertently target “their own”.

Communication

During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to show any drafts outside a group of data scientists! - The methods used by most of those involved are too complex, which leads to incorrect conclusions and unnecessary communication to the team. Usually, the team is biased in not accepting the results, which falsifies the hypotheses, taking it too personally. However, the data led the team to the conclusions, not the team itself! Anyway, it must be verified that the results are statistically reliable. If not, the results are not presented. It is also essential to present all the obtained side results, as they almost always provide additional value to the business. The general conclusions need to be complemented by sufficiently broad insights into the interpretation of the results, which is necessary for users of the results and decision-makers.

Operationalisation

The results presented are first integrated into the pilot project before full-scale implementation, after which the widespread roll-out follows the pilot's tests in the production environment. During this phase, some performance gaps may require replacing, for instance, Python or R code with compiled code. Expectations for each of the roles during this phase: