In the article on data valorization we embarked on a brief journey to discover the declinations of the techniques of Machine Learning combined with Internet of Things technologies.
The applications that arise from this combination are certainly the most interesting and advanced within the Industry 4.0 world: also growing in interest in the operational and economic implications.
Continuing down the path, after talking about how to prepare data for Machine Learning we began to understand how carrying out extensive IoT data collection alone does not guarantee the success of the predictive project one wishes to undertake: central is the ability to extrapolate the crucial moments and aspects of the process being monitored.
Last but not least., to extrapolate knowledge from past experience, it is necessary for the information collected to really tell the story of what happened: as in any project where data must turn into value, the theme of data quality cannot be overlooked.
IoT data cleaning: what is it?
The Data Cleaning (also referred to as Data Cleansing) is as much mentioned as it is vast to address: why is it so important?
As in previous articles, we will try to understand what are the main critical issues that can occur when first approaching a large amount of IoT data., to understand when Data Cleansing can intervene and what situations it can solve.

During all the time the sensors have been in collection from the field, events of any kind may have occurred and introduced some defection into the memories of the plants:
- The time series of a monitored variable being monitored may contain a missing or significantly out-of-range value, leading to a blackout in the complete trend of the phenomenon;
- Some characteristics of the sensor, internal or external, may have varied over time. If the monitoring system is not properly hardened to these changes, one compromises the unambiguity of values sampled at different times in the face of the physical phenomenon.
These two simple examples identify as many seemingly similar but profoundly different situations regarding the possibility of intervention through data cleaning techniques.
The case histories described in the first example do not create great concern within the proper functioning of the apparatus: the key is to be able to identify them and apply the appropriate techniques to reconstruct the complete history.
The second example draws attention to the inescapable need for good health of the information-producing apparatus, on which its reliability depends.
If this pillar gives way, statistically any inference from any analysis can no longer be justified.
Examples of Data Cleaning Approaches on IoT Data
In some cases, the situation is easily manageable: think of a slow variable, as temperature trends often can be: even in the presence of a few blackout in sampling, we are able to estimate the average trend with good accuracy.
If, on the other hand, we were interested in reconstructing the continuity of the original historical series, a correct interpolation would be sufficient to obtain a sequence faithful to reality.
Other situations are more delicate, and a more focused approach is needed: think of the course of a current during a transient phase, where instantaneous values such as spikes and derivatives often encapsulate great indications of possible process anomalies.
The Data Lake IoT Cloud as a Strategic Choice
As seen above, there are some situations where no data cleaning technique can reconstruct a consistent sequence of values within the collected data.
That is why the first fundamental condition for the successful application of Machine Learning algorithms lies in the quality of the digitization process of the production line.
Today, a strategic choice is certainly to undertake the building a data lake according to the most advanced methodologies and leveraging the right platforms, so that as soon as possible we can start harnessing the power of data
The use of platforms provided by major Cloud players (SAP, Google, AWS) can both ensure universally recognized solution quality and be the fastest way to achieve an efficient data lake and reliable.