This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:iot-reloaded:k_means [2024/12/02 21:07] – [K-Means] ktokarz | en:iot-reloaded:k_means [2024/12/10 21:36] (current) – pczekalski | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ===== K-Means ===== | ===== K-Means ===== | ||
| - | The first method discussed here is one of the most commonly used – K-Means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, | + | The first method discussed here is one of the most commonly used – K-means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, |
| The algorithm steps schematically are represented in the following figure {{ref> | The algorithm steps schematically are represented in the following figure {{ref> | ||
| <figure K-means_steps> | <figure K-means_steps> | ||
| - | {{ : | + | {{ : |
| - | < | + | < |
| </ | </ | ||
| In the figure: | In the figure: | ||
| - | * **STEP 1: | + | * **STEP 1: |
| - | * **STEP 2: | + | * **STEP 2: |
| - | * **STEP 3:** For each point, the closest cluster centre | + | * **STEP 3:** For each point, the closest cluster centre, which is the point marker, is selected. |
| - | * **STEP 4: | + | * **STEP 4: |
| - | * **STEP 5:** The initial cluster centre is being refined to minimise the average distance to the cluster centre | + | * **STEP 5:** The initial cluster centre is being refined to minimise the average distance to it from each cluster point. As a result, cluster centres might no longer |
| - | * **STEP 6: | + | * **STEP 6: |
| Steps 4-6 are repeated until changes in cluster positions do not change or changes are insignificant. The distance is measured using Euclidian distance: | Steps 4-6 are repeated until changes in cluster positions do not change or changes are insignificant. The distance is measured using Euclidian distance: | ||
| <figure Euclidian distance> | <figure Euclidian distance> | ||
| - | {{ {{ : | + | {{ {{ : |
| - | < | + | < |
| </ | </ | ||
| - | , where: | + | where: |
| - | * **Data points** - points {x< | + | * **Data points** - points {x< |
| * **K** – number of clusters set by the user. | * **K** – number of clusters set by the user. | ||
| - | * **r< | + | * **r< |
| - | * **m< | + | * **m< |
| - | * **D** – Squared Sum of all distances di to their assigned cluster centroids; | + | * **D** – Squared Sum of all distances di to their assigned cluster centroids. |
| - | * **Goal** is to find such values of variables r< | + | * **Goal** is to find such values of variables r< |
| Example of initial data and assigned cluster marks with cluster centres after running the K-means algorithm (figure {{ref> | Example of initial data and assigned cluster marks with cluster centres after running the K-means algorithm (figure {{ref> | ||
| <figure K-means_example_1> | <figure K-means_example_1> | ||
| - | {{ {{ : | + | {{ {{ : |
| - | < | + | < |
| </ | </ | ||
| - | Unfortunately, | + | Unfortunately, |
| Example of setting different numbers of cluster centres (figure {{ref> | Example of setting different numbers of cluster centres (figure {{ref> | ||
| <figure K-means_example_2> | <figure K-means_example_2> | ||
| - | {{ {{ : | + | {{ {{ : |
| - | < | + | < |
| </ | </ | ||
| Line 73: | Line 73: | ||
| * Other methods, like the Silhouette score, can complement the elbow method to help determine the optimal NC. | * Other methods, like the Silhouette score, can complement the elbow method to help determine the optimal NC. | ||
| - | < | + | < |
| - | {{ {{ : | + | {{ {{ : |
| - | < | + | < |
| </ | </ | ||
| - | The figure above demonstrates more and less obvious " | + | The figure above (figure {{ref> |
| ==== Silhouette Score ==== | ==== Silhouette Score ==== | ||
| Line 85: | Line 85: | ||
| The Silhouette score considers two main factors for each data point: | The Silhouette score considers two main factors for each data point: | ||
| - | * **Cohesion (a(i))** - The cohesion measure for the ith point is the average distance between the point and all other points in the same cluster. It measures the point' | + | * **Cohesion (a(i))** - The cohesion measure for the ith point is the average distance between the point and all other points in the same cluster. It measures the point' |
| - | * **Separation (b(i))** – The separation measure | + | * **Separation (b(i))** – The separation measure |
| The silhouette score for a point i is then calculated as: | The silhouette score for a point i is then calculated as: | ||
| <figure Silhouette score > | <figure Silhouette score > | ||
| - | {{ {{ : | + | {{ {{ : |
| - | < | + | < |
| </ | </ | ||
| - | ,where: | + | where: |
| * s(i) is the silhouette score for point i. | * s(i) is the silhouette score for point i. | ||
| * a(i) is the average distance from point i to all other points in the same cluster. | * a(i) is the average distance from point i to all other points in the same cluster. | ||
| - | * b(i) is the average distance from point I to all points in the nearest other cluster. | + | * b(i) is the average distance from point i to all points in the nearest other cluster. |
| - | * s(i) ≈+1 indicated that the point i is well clustered; | + | * s(i) ≈+1 indicated that the point i is well clustered. |
| - | * s(i) around 0 indicates that the point lies close to the boundary between clusters; | + | * s(i) around 0 indicates that the point lies close to the boundary between clusters. |
| - | * s(i) ≈-1 indicated that the point i was most probably wrongly assigned to the cluster; | + | * s(i) ≈-1 indicated that the point i was most probably wrongly assigned to the cluster. |
| Steps of the method: | Steps of the method: | ||
| Line 107: | Line 107: | ||
| * Computing the SC for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20). | * Computing the SC for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20). | ||
| * Plotting the SC values on the y-axis and the number of clusters NC on the x-axis. | * Plotting the SC values on the y-axis and the number of clusters NC on the x-axis. | ||
| - | - Observe the plot: | + | - **Observe the plot:** |
| * As the number of clusters NC increases, the SC shows different score values, which may or may not gradually decrease, as in the case of the " | * As the number of clusters NC increases, the SC shows different score values, which may or may not gradually decrease, as in the case of the " | ||
| - | * The main goal is to observe the maximum SC value and the corresponding NC value; | + | * The main goal is to observe the maximum SC value and the corresponding NC value. |
| - **Select optimal NC:** | - **Select optimal NC:** | ||
| * The value of NC at the maximum SC value is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance. | * The value of NC at the maximum SC value is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance. | ||
| Line 118: | Line 118: | ||
| * The Silhouette score might not always match intuitive or domain-specific clustering insights. | * The Silhouette score might not always match intuitive or domain-specific clustering insights. | ||
| - | An example is provided in the following figure: | + | An example is provided in the following figure |
| - | < | + | < |
| - | {{ {{ : | + | {{ {{ : |
| - | < | + | < |
| </ | </ | ||
| The user should look for the highest score, which in this case is for the 3-cluster option. | The user should look for the highest score, which in this case is for the 3-cluster option. | ||