====== Decision tree-based classification Models ====== {{:en:iot-open:czapka_p.png?50|General audience classification icon}}{{:en:iot-open:czapka_b.png?50|General audience classification icon}}{{:en:iot-open:czapka_e.png?50|General audience classification icon}}\\ ===== Introduction ===== Classification assigns a class mark to a given object, indicating that the object belongs to the selected class or group. In contrast to clustering, classes should be pre-existent. In many cases, clustering might be a prior step to classification. Classification might be slightly understood differently in different contexts. However, in the context of this book, it will be used to describe a process of assigning marks of pre-existing classes to objects depending on their features. Classification is used in almost all domains of modern data analysis, including medicine, signal processing, pattern recognition, different types of diagnostics and other more specific applications. Within this chapter, two very widely used algorithm groups are discussed: * [[en:iot-reloaded:Decision trees|]] - a fundamental set of methods and their variants are discussed; * [[en:iot-reloaded:Random forests|]] - one of the best out-of-the-box methods widely used by data analysts; ===== Interpretation of the model output ===== The classification process consists of two steps: first, an existing data sample is used to train the classification model, and then, in the second step, the model is used to classify unseen objects, thereby predicting to which class the object belongs. As with any other prediction, in classification, the output of the model is described by the rate of error, i.e., true prediction vs. wrong prediction. Usually, objects that belong to a given class are called – positive examples, while those that do not belong are called – negative examples. Depending on a particular output, several cases might be identified: * True positive (TP) – the object belongs to the class and is classified as a class member. **Example:** A SPAM message is classified as SPAM, or a patient classified as being in a certain condition is in fact, experiencing this condition. * False positive (FP) – the object that does not belong to the class is classified as a class member. **Example:** A harmless message is classified as SPAM, or a patient who is not experiencing a certain condition is classified as being in this condition; * True negative (TN) – the object that is classified as not being a member of the class, in fact, is not a member; **Example:** A harmless message is classified as harmless, or a patient not experiencing a certain condition is classified as not experiencing; * False negative (FN) – the object that belongs to the class is classified as not belonging to it. **Example:** A SPAM message is classified as harmless, or a patient experiencing a certain condition is classified as not experiencing. While training the model and counting the number of training samples falling into the mentioned cases, it is possible to describe its accuracy mathematically. Here are the most commonly used statistics: * Sensitivity = TP(TP+FN) * Specificity = TN(FP+TN) * Positive predictive value = TP(TP+FP) * Negative predictive value = TN(TN+FN) * Accuracy = TP+TN(TP+FP+TN+FN) ===== Training the models ===== The classification model is trained using the initial sample data, which is split into training and testing subsamples. Usually, the training is done using the following steps: - The sample is split into training and testing subsamples; - Training subsample is used to train the model; - Test subsample is used to acquire accuracy statistics as described earlier; - Steps 1 – 3 are repeated several times (usually at least 10 – 25) to acquire average model statistics; The average statistics are used to describe the model. The model's results on the test subsample depend on different factors—noise in the data, the proportion of classes represented in the data (how even classes are distributed), and others which are out of the developer's reach. However, by manipulating the split of the sample, it is possible to provide more data for training and thereby expect better training results—seeing more examples might lead to a better grasp of the class features. However, seeing too much might lead to a loss of generality and, consequently, dropped accuracy on test subsamples or previously unseen examples. Therefore, it is necessary to maintain a good balance between testing and training subsamples, usually 70% for training and 30% for testing or 60% for training and 40% for testing. In real applications, if the initial data sample is large enough, a third subsample is used – a validation set used only once to acquire final statistics and not provided to developers. It usually holds small but representative subsamples in 1-5% of the initial data sample. Unfortunately, the data sample is not large enough in many practical cases. Therefore, several testing techniques are used to ensure the reliability of statistics and respect the scarcity of data. The method is called cross-validation, which uses the training and testing data subsets but allows data to be saved without using the validation set. ===== Random sample =====
{{ :en:iot-reloaded:classification_1.png?800 | | Random sample}} Random sample
Most of the data is used for training in random sample cases, and only a few randomly selected samples are used to test the model. The procedure is repeated many times to ensure the model's average accuracy. Random selection has to be made without replacements. In the case of using replacements, the method is called bootstrapping, which is widely used and generally is more optimistic. ===== K-folds =====
{{ :en:iot-reloaded:classification_2.png?800 | | K-folds}} K-folds
This approach splits the training set into smaller sets called splits (in the figure above, there are three splits). Then, for each split, the following steps are performed: * Model is trained using k-1 folds; in the picture above, every split (row) is divided into k folds, where in sequence, split by the split, an i-th fold is used for testing, while the k-1 folds for training, * The model’s accuracy is assessed using the remaining fold for each split iteratively; The overall performance for the k-fold cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications. ===== One out =====
{{ :en:iot-reloaded:classification_3.png?800 | | One out}} One out
This approach splits the training set into smaller sets called splits in the same way as previous methods described here (in the figure above, there are three splits). Then, for each split, the following steps are performed: * The model is trained using n-1 samples, and only one sample is used for testing the model’s performance. * The overall performance for the one-out cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications. This method requires many iterations due to the limitations of the testing set.