Decision trees are the most commonly used base technique in classifications. To describe the idea of the decision trees, a simple data set might be considered, as presented in figure 1:
In this dataset, xn indicates the n-th observation; each column refers to a particular factor, while the last column, “Call for technical assistance,” refers to the class variable with values Yes or No, respectively.
To build a decision tree for the given problem of calling the technical assistance, one might consider constructing a tree where each path from the root to tree leaves represents a separate example xn with a complete set of factors and their values corresponding to the given example. This solution would provide the necessary outcome – all examples will be classified correctly. However, there are two significant problems:
Referring to Occam's razor principle [1] the most desirable model is the most compact one, i.e., using only the factors necessary to make a valid decision. This means that one needs to select the most relevant factor and then the next most relevant factor until the decision is made without a doubt.
In the figure 2 above, on its left, the factor “The engine is running” is considered, which has two potential outputs: Yes and No. For the outcome Yes, the target class variable has an equal number of positive (Yes) and negative (No) class values, which does not help much in deciding since it is still 50/50. The same is true for output No. So, checking if the engine works does not bring the decision closer.
The figure 2 on its right considers a different factor with similar potential outputs: “There are small children in the car.” For the output No, all the examples have the same class variable value—No, which makes it ideal for deciding since there is no variability in the output variable. A slightly less confident situation is for the output Yes, which produces examples with six positive class values and one negative. While there is a little variability, it is much less than for the previously considered factor.
In this simple example, it is obvious that checking if children are in the car is more effective than checking the engine status. However, an effective estimate is needed to assess the potential effectiveness of a given factor. Ross Quinlan, in 1986, proposed an algorithm ID3 [2], which employs an Entropy measure:
where:
E(D) - Entropy for a given data set D.
C - Total number of values c of the class variable in the given data set D.
p(c) - The proportion of examples with class value c to the total number of examples in D.
E(D) = 0, when only one class value is represented (the most desirable case), and E(D) = 1, where class values are evenly distributed in the D (the least desirable case).
To select a particular, estimating how much uncertainty is removed from the data set after applying a specific factor (test) is necessary. Quinlan proposed to use Information gain:
where:
IG(D,A) - Information gain of the dataset D, when factor A is applied to split it into subsets.
E(D) - Entropy for a given data set D.
T - Subsets of D, created by applying factor A.
p(t) - The proportion of examples with class value t to the total number of examples in D.
E(t) - Entropy of subset t.
The attribute with the most significant information gain is selected to split the data set into subsets. Then, each subset is divided into subsets in the same way. The procedure is continued until each of the subsets has zero entropy or no factors remain to test. The approach, in its essence, is a greedy search algorithm with one hypothesis, which is refined by each iteration. It uses statistics from the entire data set, which makes it relatively immune to missing values, contradictions or errors. Since the algorithm seeks the best-fitting decision tree, it might run into a local minima trap, where the generalisation is lost. To avoid possible local minima solutions, it is necessary to simplify or generalise the decision tree. There are two common approaches:
However, knowing the best factor to split the data set is not always helpful due to the costs related to the factor value estimation. For instance, in the medical domain, the most effective diagnostic methods might be the most expensive and, therefore, not always the most appropriate. Over time, different alternatives to information gain have been developed to respect expenses that are related to factor value estimation:
Alternative 1:
Alternative 2:
Currently, many other alternatives to the known ID3 family are used: ILA [3], RULES 6 [4], CN2 [5], CART [6].
The mentioned here does not use entropy-based estimates, reducing the computational complexity of the algorithms.