Classification assigns a class mark to a given object, indicating that the object belongs to the selected class or group. In contrast to clustering, classes should be pre-existent. In many cases, clustering might be a prior step to classification. Classification might be slightly understood differently in different contexts. However, in the context of this book, it will be used to describe a process of assigning marks of pre-existing classes to objects depending on their features.
Classification is used in almost all domains of modern data analysis, including medicine, signal processing, pattern recognition, different types of diagnostics and other more specific applications.
The classification process consists of two steps: first, an existing data sample is used to train the classification model, and then, in the second step, the model is used to classify unseen objects, thereby predicting to which class the object belongs. As with any other prediction, in classification, the model output is described by the error rate, i.e., true prediction vs. wrong prediction. Usually, objects that belong to a given class are called – positive examples, while those that do not belong are called – negative examples.
Depending on a particular output, several cases might be identified:
Example: A SPAM message is classified as SPAM, or a patient classified as being in a particular condition is, in fact, experiencing this condition.
Example: A harmless message is classified as SPAM, or a patient who is not experiencing a certain condition is classified as being in this condition.
Example: A harmless message is classified as harmless, or a patient not experiencing a certain condition is classified as not experiencing.
Example: A SPAM message is classified as harmless, or a patient experiencing a certain condition is classified as not experiencing.
While training the model and counting the number of training samples falling into the mentioned cases, it is possible to describe its accuracy mathematically. Here are the most commonly used statistics:
The classification model is trained using the initial sample data, which is split into training and testing subsamples. Usually, the training is done using the following steps:
The average statistics are used to describe the model.
The model's results on the test subsample depend on different factors—noise in the data, the proportion of classes represented in the data (how even classes are distributed), and others which are out of the developer's reach. However, by manipulating the sample split, it is possible to provide more data for training and thereby expect better training results—seeing more examples might lead to a better grasp of the class features. However, seeing too much might lead to a loss of generality and, consequently, dropped accuracy on test subsamples or previously unseen examples. Therefore, it is necessary to maintain a good balance between testing and training subsamples, usually 70% for training and 30% for testing or 60% for training and 40% for testing. In real applications, if the initial data sample is large enough, a third subsample is used – a validation set used only once to acquire final statistics and not provided to developers. It usually holds small but representative subsamples in 1-5% of the initial data sample.
Unfortunately, the data sample is not large enough in many practical cases. Therefore, several testing techniques are used to ensure the reliability of statistics and respect the scarcity of data. The method is called cross-validation, which uses the training and testing data subsets but allows data to be saved without using the validation set.
Most of the data is used for training in random sample cases (figure 1), and only a few randomly selected samples are used to test the model. The procedure is repeated many times to ensure the model's average accuracy. Random selection has to be made without replacements. In the case of using replacements, the method is called bootstrapping, which is widely used and generally is more optimistic.
This approach splits the training set into smaller sets called splits (in the figure 2 above, there are three splits). Then, for each split, the following steps are performed:
The overall performance for the k-fold cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications.
This approach splits the training set into smaller sets called splits in the same way as previous methods described here (in the figure 3 above; there are three splits). Then, for each split, the following steps are performed:
This method requires many iterations due to the limitations of the testing set.
Within the following sub-chapters, two very widely used algorithm groups are discussed: