Differences

This shows you the differences between two versions of the page.

Link to this comparison view

--- en:iot-reloaded:k_means [2024/12/02 21:07] – [K-Means] ktokarz
+++ en:iot-reloaded:k_means [2024/12/10 21:36] (current) – pczekalski
@@ Line 1: / Line 1: @@
 ===== K-Means =====
-The first method discussed here is one of the most commonly used – K-Means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, which represents a distance from the given point of the group to the group's centre representing a group's prototype - centroid. The result of the clustering is N points grouped into K clusters, where each point has assigned a cluster index, which means that the distance from the point of the cluster centroid is closer than the distance to any other centroids of other clusters. Distance measure employs Euclidian distance, which requires scaled or normalised data to avoid the dominance of a single dimension over others.
+The first method discussed here is one of the most commonly used – K-means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, representing a distance from the given point of the group to the group's centre representing a group's prototype: centroid. The result of the clustering is N points grouped into K clusters, where each point has assigned a cluster index, which means that the distance from the point of the cluster centroid is closer than the distance to any other centroids of other clusters. Distance measure employs Euclidian distance, which requires scaled or normalised data to avoid the dominance of a single dimension over others.
 The algorithm steps schematically are represented in the following figure {{ref>K-means_steps}}:
 <figure K-means_steps>
-{{ :en:iot-reloaded:clustering_1.png?600 | |  K-means steps}}
+{{ :en:iot-reloaded:clustering_1.png?600 | K-means Steps}}
-<caption> K-means steps </caption>
+<caption> K-means Steps </caption>
 </figure>
 In the figure:
-  * **STEP 1:**	Initial data set, where points do not belong to any of the clusters;
+  * **STEP 1:**	Initial data set where points do not belong to any of the clusters.
-  * **STEP 2:**	Cluster initial centers are selected randomly;
+  * **STEP 2:**	Cluster initial centres are selected randomly.
-  * **STEP 3:**	For each point, the closest cluster centre is selected, which is the point marker;
+  * **STEP 3:**	For each point, the closest cluster centre, which is the point marker, is selected.
-  * **STEP 4:**	Cluster mark is assigned to each point;
+  * **STEP 4:**	Cluster mark is assigned to each point.
-  * **STEP 5:**	The initial cluster centre is being refined to minimise the average distance to the cluster centre from each cluster point. As a result, cluster centres might not be physical points any more; instead, they become imaginary.
+  * **STEP 5:**	The initial cluster centre is being refined to minimise the average distance to it from each cluster point. As a result, cluster centres might no longer be physical points; instead, they become imaginary.
-  * **STEP 6:**	Cluster marks of the points are updated;
+  * **STEP 6:**	Cluster marks of the points are updated.
 Steps 4-6 are repeated until changes in cluster positions do not change or changes are insignificant. The distance is measured using Euclidian distance:
 <figure Euclidian distance>
-{{ {{ :en:iot-reloaded:clustereq1.png?300 |  Euclidian distance}}
+{{ {{ :en:iot-reloaded:clustereq1.png?300 |  Euclidian Distance}}
-<caption> Euclidian distance </caption>
+<caption> Euclidian Distance </caption>
 </figure>
-, where:
+where:
-  * **Data points** - points {x<sub>i</sub>}, i = 1, ... ,N   in multi-dimensional Euclidian space i.e. each point is vector;
+  * **Data points** - points {x<sub>i</sub>}, i = 1, ... ,N   in multi-dimensional Euclidian space i.e. each point is a vector.
   * **K** – number of clusters set by the user.
-  * **r<sub>nk</sub>** – an indicator variable with values {0,1} – indicates if data point x<sub>n</sub> belongs to cluster k;
+  * **r<sub>nk</sub>** – an indicator variable with values {0,1} – indicates if data point x<sub>n</sub> belongs to cluster k.
-  * **m<sub>k</sub>** – centroid of kth  cluster;
+  * **m<sub>k</sub>** – centroid of kth  cluster.
-  * **D** – Squared Sum of all distances di to their assigned cluster centroids;
+  * **D** – Squared Sum of all distances di to their assigned cluster centroids.
-  * **Goal** is to find such values of variables r<sub>nk</sub> and m k to minimise D;
+  * **Goal** is to find such values of variables r<sub>nk</sub> and m k to minimise D.
 Example of initial data and assigned cluster marks with cluster centres after running the K-means algorithm (figure {{ref>K-means_example_1}}):
 <figure K-means_example_1>
-{{ {{ :en:iot-reloaded:Clustering_2.png?900 |  K-means example}}
+{{ {{ :en:iot-reloaded:Clustering_2.png?900 |  K-means Example with Two Clusters}}
-<caption> K-means example with two clusters </caption>
+<caption> K-means Example with Two Clusters </caption>
 </figure>
-Unfortunately, the K-Menas algorithm does not possess automatic mechanisms to select the number of clusters K, i.e., the user must set it.
+Unfortunately, the K-means algorithm does not possess automatic mechanisms to select the number of clusters K, i.e., the user must set it.
 Example of setting different numbers of cluster centres (figure {{ref>K-means_example_2}}):
 <figure K-means_example_2>
-{{ {{ :en:iot-reloaded:Clustering_3.png?900 |  K-means example}}
+{{ {{ :en:iot-reloaded:Clustering_3.png?900 | K-means Example with Three Clusters}}
-<caption> K-means example with three clusters </caption>
+<caption> K-means Example with Three Clusters </caption>
 </figure>
@@ Line 73: / Line 73: @@
   * Other methods, like the Silhouette score, can complement the elbow method to help determine the optimal NC.
-<figure Elbow example>
+<figure Elbow_example>
-{{ {{ :en:iot-reloaded:Clustering_4.png?600 |  Elbow example}}
+{{ {{ :en:iot-reloaded:Clustering_4.png?600 | Elbow Example on Two Synthetic Data Sets}}
-<caption> Elbow example on two synthetic data sets </caption>
+<caption> Elbow Example on Two Synthetic Data Sets </caption>
 </figure>
-The figure above demonstrates more and less obvious "elbows", where users could select the number of clusters equal to 3 or 4.
+The figure above (figure {{ref>Elbow_example}}) demonstrates more and less obvious "elbows", where users could select the number of clusters equal to 3 or 4.
 ==== Silhouette Score ====
@@ Line 85: / Line 85: @@
 The Silhouette score considers two main factors for each data point:
-  * **Cohesion (a(i))** - The cohesion measure for the ith point is the average distance between the point and all other points in the same cluster. It measures the point's proximity to other points in its cluster. A low a(i) value indicates that the point is tightly grouped with other points in the same cluster;
+  * **Cohesion (a(i))** - The cohesion measure for the ith point is the average distance between the point and all other points in the same cluster. It measures the point's proximity to other points in its cluster. A low a(i) value indicates that the point is tightly grouped with other points in the same cluster.
-  * **Separation (b(i))** – The separation measure fo the ith point estimates the average distance between the point and points in the nearest neighbouring cluster - the cluster that is not its own but is closest to it. A large value for b(i) indicates that the point is far away from the closest other cluster, meaning it is well-separated;
+  * **Separation (b(i))** – The separation measure for the ith point estimates the average distance between the point and points in the nearest neighbouring cluster - the cluster that is not its own but is closest to it. A large value for b(i) indicates that the point is far away from the closest other cluster, meaning it is well-separated.
 The silhouette score for a point i is then calculated as:
 <figure Silhouette score >
-{{ {{ :en:iot-reloaded:ClusterEq2.png?300 |  Silhouette score}}
+{{ {{ :en:iot-reloaded:ClusterEq2.png?300 | Silhouette Score}}
-<caption> Silhouette score </caption>
+<caption> Silhouette Score </caption>
 </figure>
-,where:
+where:
   * s(i) is the silhouette score for point i.
   * a(i) is the average distance from point i to all other points in the same cluster.
-  * b(i) is the average distance from point I to all points in the nearest other cluster.
+  * b(i) is the average distance from point i to all points in the nearest other cluster.
-  * s(i) ≈+1 indicated that the point i is well clustered;
+  * s(i) ≈+1 indicated that the point i is well clustered.
-  * s(i) around 0 indicates that the point lies close to the boundary between clusters;
+  * s(i) around 0 indicates that the point lies close to the boundary between clusters.
-  * s(i) ≈-1 indicated that the point i was most probably wrongly assigned to the cluster;
+  * s(i) ≈-1 indicated that the point i was most probably wrongly assigned to the cluster.
 Steps of the method:
@@ Line 107: / Line 107: @@
     * Computing the SC for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20).
     * Plotting the SC values on the y-axis and the number of clusters NC on the x-axis.
-  - Observe the plot:
+  - **Observe the plot:**
     * As the number of clusters NC increases, the SC shows different score values, which may or may not gradually decrease, as in the case of the "elbow" method.
-    * The main goal is to observe the maximum SC value and the corresponding NC value;
+    * The main goal is to observe the maximum SC value and the corresponding NC value.
   - **Select optimal NC:**
     * The value of NC at the maximum SC value is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance.
@@ Line 118: / Line 118: @@
   * The Silhouette score might not always match intuitive or domain-specific clustering insights.
-An example is provided in the following figure:
+An example is provided in the following figure {{ref>Silhouette_example}}:
-<figure Silhouette example>
+<figure Silhouette_example>
-{{ {{ :en:iot-reloaded:Clustering_5.png?400 |  Elbow example}}
+{{ {{ :en:iot-reloaded:Clustering_5.png?400 | Silhouette Example on a Synthetic Data Set}}
-<caption> Silhouette example on a synthetic data set </caption>
+<caption> Silhouette Example on a Synthetic Data Set </caption>
 </figure>
 The user should look for the highest score, which in this case is for the 3-cluster option.

en/iot-reloaded/k_means.1733173646.txt.gz · Last modified: 2024/12/02 21:07 by ktokarz