Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data.
This provides very insightful information about the data's internal organisation, possible groups, their number and distribution, and other internal regularities that might help us better understand the data content. One might consider grouping customers by income estimate to explain the clustering better. It is natural to assume some threshold values of 1KEUR per month, 10KEUR per month, etc. However:
It is evident that, most probably, customers' behaviour depends on factors like occupation, age, total household income, and others. While the need for considering other factors is obvious, grouping is not – how exactly different factors interact to decide which group a given customer belongs to. That is where clustering exposes its strength – revealing natural internal structures of the data (customers in the provided example).
In this context, a cluster refers to a collection of data points aggregated together because of certain similarities [1]. Within this chapter, two different approaches to clustering are discussed:
In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should be numerical, assuming an Euclidian space.
Before starting clustering, several necessary steps have to be performed: