In the previous chapter, some essential properties of Big Data systems have been discussed and how and why IoT systems relate to Big Data problems. In any IoT implementation, data processing is the system's heart, which at least transforms into a data product. While it is still mainly a software subsystem, its development differs significantly from that of a regular software product. The difference is expressed through the roles involved and the lifecycle itself. It is often wrongly assumed that the main contributor is the data scientist responsible for developing a particular data processing or forecasting algorithm. It is somewhat valid, except other essential roles are vital to success. The team of developers playing the roles might be as small as three or as large as 20 people, depending on the scale of the project. The leading roles are explained below.
Business users have good knowledge of the application domain and, in most cases, benefit significantly from the developed data product. They know how to transform data into a business value in the organisation. Typically, they take positions like Production manager, Business/market analyst, and Domain expert.
He is the one who defines the business problem and is triggering the birth of the project. He represents the project's scope and volume and meets the necessary provisions. While he defines project priorities, he does not have deep knowledge or skills in the technology, algorithms, or methods used.
As in most software projects, the project manager is responsible for meeting project requirements and specifications within the given time frame and available provisions. He selects the needed talents, chooses development methods and tools, and selects goals for the development team members. Usually, he reports to the project sponsor and ensures that information flows within the team.
He possesses deep knowledge in the given business domain, supported by his skills and experience. Therefore, he is a valuable asset for the team in understanding the data's content, origin, and possible meaning. He defines the key performance indicators (KPI) and metrics to assess the project's success level. He selects information and data sources to prepare information and data dashboards for the organisation's decision-makers.
He is responsible for configuring the development environment and Database (one, many, or a complex distributed system). In most cases, the configuration must meet specific performance requirements, which must be maintained. He ensures secure access to the data for the team members. During the project, he backs up data, restores it if needed, updates configuration, and provides other support.
Data engineers usually have deep technical knowledge of data manipulation methods and techniques. During the project, he tuned data manipulation procedures, SQL queries, and memory management and developed specific stored or server-side procedures. He is responsible for extracting particular data chunks for the Sandbox environment and formatting and tuning them according to data scientists' needs.
Develops or selects data processing models needed to meet the project specifications. Develops, tests and implements data processing methods and algorithms; develops decision-making support methods and their implementations for some projects. Provides needed research capacities for selecting and developing the data processing methods and models.
As it might be noticed, there is no doubt that the Data Scientist is playing a vital role, but only in cooperation with the other roles. For a single person, depending on their competencies and capacities, roles might overlap, or a single team member could provide several roles. Once the team is built, the development process can start. As with any other product development, data product development follows a specific life cycle of phases. Depending on particular project needs, there might be variations, but the data product development follows the well-known waterfall pattern in most cases. The phases are explained in the figure 1:
The project team learns about the problem domain, the problem itself, its structure, and possible data sources and defines the initial hypothesis. The phase involves interviewing the stakeholders and other potentially related parties to reach as broad an insight as necessary. It said that during this phase, the problem is farmed – defined the analytical problem, indicators of the success for the potential solutions, business goals and scope. To understand business needs, the project sponsor is involved in the process from the very beginning. The identified data sources might include external systems or APIs, sensors of different types, static data sources, official statistics and other vital sources. One of the primary outcomes of the phase is the Initial Hypothesis (IH), which concisely represents the team's vision of the problem and potential solution simultaneously. For instance, “Introduction of deep learning models for sensor time series forecast provides at least 25% better performance over statistical methods used at the moment.” Whatever the IH is, it is a much better starting point than defining the hypothesis during the project implementation in later phases.
The phase focuses on creating a sandbox system by extracting, transforming and loading it into a sandbox system (ETL – Extract, Transform, Load). This is usually the most prolonged phase in terms of time and can take up 50% of the total time allocated to the project. Unfortunately, most teams tend to underestimate this time consumption, which costs the project manager and analysts dearly, leading to losing trust in the project's success. Data scientists given a unique role and authority in the team tend to “skip” this phase and go directly to phase 3 or 4, which is costly because of incorrect or insufficient data to solve the problem.
The main task of the phase is to select model candidates for data clustering, classification or other needs consistent with the Initial Hypothesis from Phase 1.
During this phase, the initially selected trim models are implemented on a full scale concerning the gathered data. The main question is whether the data is enough to solve the problem. There are several steps to be performed:
In some areas, false positives are more dangerous than false negatives. For example, targeting systems may inadvertently target “their own”.
During this phase, the results must be compared against the established quality criteria and presented to those involved in the project. It is important not to show any drafts outside a group of data scientists! - The methods used by most of those involved are too complex, which leads to incorrect conclusions and unnecessary communication to the team. Usually, the team is biased in not accepting the results, which falsifies the hypotheses, taking it too personally. However, the data led the team to the conclusions, not the team itself! Anyway, it must be verified that the results are statistically reliable. If not, the results are not presented. It is also essential to present all the obtained side results, as they almost always provide additional value to the business. The general conclusions need to be complemented by sufficiently broad insights into the interpretation of the results, which is necessary for users of the results and decision-makers.
The results presented are first integrated into the pilot project before full-scale implementation, after which the widespread roll-out follows the pilot's tests in the production environment. During this phase, some performance gaps may require replacing, for instance, Python or R code with compiled code. Expectations for each of the roles during this phase: