Differences

This shows you the differences between two versions of the page.

Link to this comparison view

en:iot-reloaded:preprocessing [2024/09/25 12:44] – created agrisniken:iot-reloaded:preprocessing [2024/09/25 12:47] (current) agrisnik
Line 1: Line 1:
-===== Level 5 Headline =====+===== Data preprocessing in clustering ===== 
 +{{:en:iot-open:czapka_p.png?50|General audience classification icon}}{{:en:iot-open:czapka_b.png?50|General audience classification icon}}{{:en:iot-open:czapka_e.png?50|General audience classification icon}}\\
  
 +Before starting clustering several important steps have to be performed:
  
-Significant data preprocessing steps before using clustering+  * **Check if the used data is metric:** In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, exam assessments or alike. Bad examples: gender, colour. 
 +  * **Select the proper scale:** For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' monthly incomes in euros and their credit ratios are typically at different scales – the incomes in thousands, while ratios between 0 and 1. If scales are not adjusted, the income dimension will dominate distance estimation among points, deforming the overall clustering results. A universal scale is usually applied to all dimensions to avoid this trap. For instance:  
 +     * **Unity interval:** a minimal factor value is substructed from the given point value and divided by the interval value, giving the result 0 to 1. 
 +     * **Z-scale:** The factor's average value is substructed from the original value of the given point and then divided by the factor's standard deviation, which provides results distributed around 0 with a standard deviation of 1. 
  
-- Check if the used data is metric: In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, exam assessments or alike. Bad examples: gender, colour.  
-- Select the proper scale: For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' monthly incomes in euros and their credit ratios are typically at different scales – the incomes in thousands, while ratios between 0 and 1. If scales are not adjusted, the income dimension will dominate distance estimation among points, deforming the overall clustering results. A universal scale is usually applied to all dimensions to avoid this trap. For instance:  
-o Unity interval: a minimal factor value is substructed from the given point value and divided by the interval value, giving the result 0 to 1. 
-o Z-scale: The factor's average value is substructed from the original value of the given point and then divided by the factor's standard deviation, which provides results distributed around 0 with a standard deviation of 1.  
- 
-Summary about clustering 
-- Besides the discussed, there are many other clustering methods; however, all of them, including the discussed, require prior knowledge about the problem domain; 
-- All of the clustering methods require setting some parameters which drive the algorithms. In most cases, the value setting might not be intuitive and may require interesting fintuning.   
-- Proper data coding in clustering may provide a significant value even in complex application domains, including medicine, customer behaviour analysis, and finetuning of other data analysis algorithms.   
-- In data analysis, clustering is used among the first methods to acquire the internal structure of the data before applying more informed methods. 
  
en/iot-reloaded/preprocessing.1727268276.txt.gz · Last modified: 2024/09/25 12:44 by agrisnik
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0