===== DBSCAN =====
{{:en:iot-open:czapka_p.png?50|General audience classification icon}}{{:en:iot-open:czapka_b.png?50|General audience classification icon}}{{:en:iot-open:czapka_e.png?50|General audience classification icon}}\\
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) employs density measures to mark points in high-density regions and those in low-density regions – the noise. Because of this natural behaviour of the algorithm, it is particularly useful in signal processing and similar application domains.
((Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M. (eds.). A density-based algorithm for discovering clusters in large spatial databases with noise (PDF). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. CiteSeerX 10.1.1.121.9220. ISBN 1-57735-004-9.)).
One of the essential concepts is the point's p neighbourhood, which is the set of points reachable within the user-defined distance eps (epsilon):
, where:
* **p** – the point of interest;
* **N(p)** – neighbourhood of the point p;
* **q** – any other point;
* **distance(p,q)** – Euclidian distance between points q and p;
* **eps** – epsilon – user-defined distance constant;
* **D** – the initial set of points available for the algorithm;
The algorithm treats different points differently depending on density and neighbouring points distribution around the point – its neighbourhood:
- **Core Points:**
* A point is a core point if it has at least MinPts neighbours within a distance eps, where MinPts and eps are user-defined parameters, i.e. N(p) ≥ MinPts.
- **Directly Density-Reachable points:**
* A point is directly density-reachable from a core point if it lies within the distance eps of the core point.
- **Border Points:**
* A border point is not a core point but within the eps distance of a core point. Border points are part of a cluster but do not have enough neighbours to be core points.
- **Noise points:**
* Points that are not core and are not reachable from any core point are considered noise or outliers.
* **The main steps in DBSCAN:**
* Selects a point from the data set – it might be the first in the data set or any randomly selected;
* If it is a core point, form a cluster by grouping it with all directly density-reachable points.
* Move to the next unvisited point and return to step 1.
* Border points are added to the nearest cluster, and points that are not reachable from any core point are marked as noise.
* **Advantages:**
* Can detect clusters of arbitrary shape.
* Naturally identifies outliers or noise.
* Unlike K-means, DBSCAN does not require specifying the number of clusters upfront.
* **Disadvantages:**
* Results depend heavily on the choice of eps and MinPts.
* It struggles with clusters of varying densities since eps is fixed.
DBSCAN is great for discovering clusters in data with noise, especially when clusters are not circular or spherical.
Some application examples:
A typical application in signal processing:
==== Selecting eps and MinPts values ====
Usually, MinPts is selected using some prior knowledge of the data and its internal structure. If it is done, the following steps might be applied:
* Unordered List ItemCalculate the average distance between every point and its k-nearest neighbours, where k = MinPts;
* The average distances are sorted and depicted on a chart, where x – is the index of the sorted average distance, y – is the distance value.
* The optimal eps value is when y increases rapidly, as shown in the following picture on artificial sample data.
The red horizontal line shows a possible eps value, around 0,045.