Table of Contents

K-Means

General audience classification iconGeneral audience classification iconGeneral audience classification icon

The first method discussed here is one of the most commonly used – K-Means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, which represents a distance from the given point of the group to the group's centre representing a group's prototype - centroid. The result of the clustering is N points grouped into K clusters, where each point has assigned a cluster index, which means that the distance from the point of the cluster centroid is closer than the distance to any other centroids of other clusters. Distance measure employs Euclidian distance, which requires scaled or normalised data to avoid the dominance of a single dimension over others. The algorithm steps schematically are represented in the following figure:

 |  K-means steps
Figure 1: K-means steps

In the figure:

Steps 4-6 are repeated until changes in cluster positions do not change or changes are insignificant. The distance is measured using Euclidian distance:

  Euclidian distance
Figure 2: Euclidian distance

, where:

Example of initial data and assigned cluster marks with cluster centres after running the K-means algorithm:

  K-means example
Figure 3: K-means example with two clusters

Unfortunately, the K-Menas algorithm does not possess automatic mechanisms to select the number of clusters K, i.e., the user must set it. Example of setting different numbers of cluster centres:

  K-means example
Figure 4: K-means example with three clusters

Elbow method

In K-means clustering, a practical method – the Elbow method is used to select a particular number of clusters. The elbow method is based on finding the point at which adding more clusters does not significantly improve the model's performance. As explained, K-means clustering optimises the sum of squared errors (SSE) or squared distances between each point and its corresponding cluster centroid. Since the optimal number of clusters (NC) is not known initially, it is wise to increase the NCs iteratively. The SSE decreases as the number of clusters increases because the distances to the cluster centres also decrease. However, there is a point where the improvement in SSE diminishes significantly. This point is referred to as the “elbow” [1].

Steps of the method:

  1. Plot SSE against the number of clusters:
    • Computing the SSE for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20).
    • Plotting the SSE values on the y-axis and the number of clusters NC on the x-axis.
  2. Observe the plot:
    • As the number of clusters NC increases, the SSE will decrease because clusters become more specialised.
    • Initially, adding clusters will result in a significant drop in SSE.
    • After a certain point, the reduction in SSE will slow down, not showing a significant drop in the SSE.
  3. The “elbow” point:
    • The point on the curve where the rate of decrease sharply levels off forms the “elbow.”
    • This is where adding more clusters beyond this point doesn't significantly reduce SSE, indicating that the clusters are likely well-formed.
  4. Select optimal NC:
    • The value of NC at the elbow point is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance.

Since the method requires iteratively running the K-means algorithm, which might be resource-demanding, a selection of data might be employed to determine the NC first and then used to run the K-means on the whole dataset.

Limitations:

  Elbow example
Figure 5: Elbow example on two synthetic data sets

The figure above demonstrates more and less obvious “elbows”, where users could select the number of clusters equal to 3 or 4.

Silhouette Score

The Silhouette Score is a metric used to evaluate the quality of a clustering result. It measures how similar an object (point) is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from −1 to +1, where higher values indicate better-defined clusters [2].

The Silhouette score considers two main factors for each data point:

The silhouette score for a point i is then calculated as:

  Silhouette score
Figure 6: Silhouette score

,where:

Steps of the method:

  1. Plot silhouette score (SC) against the number of clusters:
    • Computing the SC for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20).
    • Plotting the SC values on the y-axis and the number of clusters NC on the x-axis.
  2. Observe the plot:
    • As the number of clusters NC increases, the SC shows different score values, which may or may not gradually decrease, as in the case of the “elbow” method.
    • The main goal is to observe the maximum SC value and the corresponding NC value;
  3. Select optimal NC:
    • The value of NC at the maximum SC value is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance.

Limitations:

An example is provided in the following figure:

  Elbow example
Figure 7: Silhouette example on a synthetic data set

The user should look for the highest score, which in this case is for the 3-cluster option.


[1] Robert L. Thorndike (December 1953). “Who Belongs in the Family?”. Psychometrika. 18 (4): 267–276. doi:10.1007/BF02289263. S2CID 120467216.
[2] Peter J. Rousseeuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis”. Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7.