Decision Tree-based Classification Models

Introduction

Classification assigns a class mark to a given object, indicating that the object belongs to the selected class or group. In contrast to clustering, classes should be pre-existent. In many cases, clustering might be a prior step to classification. Classification might be slightly understood differently in different contexts. However, in the context of this book, it will be used to describe a process of assigning marks of pre-existing classes to objects depending on their features.

Classification is used in almost all domains of modern data analysis, including medicine, signal processing, pattern recognition, different types of diagnostics and other more specific applications.

Interpretation of the model output

The classification process consists of two steps: first, an existing data sample is used to train the classification model, and then, in the second step, the model is used to classify unseen objects, thereby predicting to which class the object belongs. As with any other prediction, in classification, the model output is described by the error rate, i.e., true prediction vs. wrong prediction. Usually, objects that belong to a given class are called – positive examples, while those that do not belong are called – negative examples.

Depending on a particular output, several cases might be identified:

True positive (TP) – the object belongs to the class and is classified as a class member.

Example: A SPAM message is classified as SPAM, or a patient classified as being in a particular condition is, in fact, experiencing this condition.

False positive (FP) – the object that does not belong to the class is classified as a class member.

Example: A harmless message is classified as SPAM, or a patient who is not experiencing a certain condition is classified as being in this condition.

True negative (TN) – the object that is classified as not being a member of the class, in fact, is not a member.

Example: A harmless message is classified as harmless, or a patient not experiencing a certain condition is classified as not experiencing.

False negative (FN) – the object that belongs to the class is classified as not belonging to it.

Example: A SPAM message is classified as harmless, or a patient experiencing a certain condition is classified as not experiencing.

While training the model and counting the number of training samples falling into the mentioned cases, it is possible to describe its accuracy mathematically. Here are the most commonly used statistics:

Sensitivity = TP/(TP+FN)
Specificity = TN/(FP+TN)
Positive predictive value = TP/(TP+FP)
Negative predictive value = TN/(TN+FN)
Accuracy = (TP+TN)/(TP+FP+TN+FN)

Training the models

The classification model is trained using the initial sample data, which is split into training and testing subsamples. Usually, the training is done using the following steps:

The sample is split into training and testing subsamples.
Training subsample is used to train the model.
Test subsample is used to acquire accuracy statistics as described earlier.
Steps 1 – 3 are repeated several times (usually at least 10 – 25) to acquire average model statistics.

The average statistics are used to describe the model.

The model's results on the test subsample depend on different factors—noise in the data, the proportion of classes represented in the data (how even classes are distributed), and others which are out of the developer's reach. However, by manipulating the sample split, it is possible to provide more data for training and thereby expect better training results—seeing more examples might lead to a better grasp of the class features. However, seeing too much might lead to a loss of generality and, consequently, dropped accuracy on test subsamples or previously unseen examples. Therefore, it is necessary to maintain a good balance between testing and training subsamples, usually 70% for training and 30% for testing or 60% for training and 40% for testing. In real applications, if the initial data sample is large enough, a third subsample is used – a validation set used only once to acquire final statistics and not provided to developers. It usually holds small but representative subsamples in 1-5% of the initial data sample.

Unfortunately, the data sample is not large enough in many practical cases. Therefore, several testing techniques are used to ensure the reliability of statistics and respect the scarcity of data. The method is called cross-validation, which uses the training and testing data subsets but allows data to be saved without using the validation set.

Random sample

Most of the data is used for training in random sample cases (figure 1), and only a few randomly selected samples are used to test the model. The procedure is repeated many times to ensure the model's average accuracy. Random selection has to be made without replacements. In the case of using replacements, the method is called bootstrapping, which is widely used and generally is more optimistic.

K-folds

This approach splits the training set into smaller sets called splits (in the figure 2 above, there are three splits). Then, for each split, the following steps are performed:

Model is trained using k-1 folds; in the picture above (figure 2), every split (row) is divided into k folds, where in sequence, split by the split, an i-th fold is used for testing, while the k-1 folds for training.
The model's accuracy is assessed iteratively using the remaining fold for each split.

The overall performance for the k-fold cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications.

One out

This approach splits the training set into smaller sets called splits in the same way as previous methods described here (in the figure 3 above; there are three splits). Then, for each split, the following steps are performed:

The model is trained using n-1 samples, and only one sample is used for testing the model's performance.
The overall performance for the one-out cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications.

This method requires many iterations due to the limitations of the testing set.

Within the following sub-chapters, two very widely used algorithm groups are discussed:

Decision Trees - a fundamental set of methods and their variants are discussed.
Random Forests - one of the best out-of-the-box methods widely used by data analysts.