Random forests

General audience classification iconGeneral audience classification iconGeneral audience classification icon

Random forests Random forests are among the best out-of-the-box methods highly valued by developers and data scientists. For better understanding of the process, an imaginary weather forecast problem might be considered, represented by the following true decision tree:

 |  Weather forecast example
Figure 1: Weather forecast example

Now, one might consider several forecast agents – friends of neighbours- where each provides their own forecast depending on the factor values. Some will be higher than the actual value, and some lower. However, since they all use some experience-based knowledge, the forecast collected will be distributed around the actual value. The Random forest (RF) method uses hundreds of forecast agents, decision trees, and then applies majority voting.

 |  Weather forecast voting example
Figure 2: Weather forecast voting example

Some advantages:

RF features:

Each tree in the forest is grown as follows:

Additional considerations

Correlation Between Trees in the Forest: The correlation between any two trees in a Random Forest refers to the similarity in their predictions across the same dataset. When trees are highly correlated, they are likely to make similar mistakes on the same inputs. In other words, if many trees make similar errors, the model's aggregated predictions will not effectively reduce the bias and variance, and the overall error rate of the forest will increase. The Random Forest method addresses this by introducing randomness in two main ways:

Strength of Each Individual Tree: The strength of an individual tree refers to its classification accuracy on new data, i.e., its ability to perform as a string classifier. In Random Forest terminology, a tree is strong if it has a low error rate. If each tree can classify well independently, the aggregate predictions of the forest will be more accurate.

Each tree's strength depends on various factors, including its depth and the features it uses for splitting. However, there is a trade-off between correlation and strength. For example, reducing m (the number of features considered at each split) increases the diversity among the trees, lowering correlation, but it may also reduce the strength of each tree, as it may limit the tree's access to highly predictive features.

Despite this trade-off, Random Forests balance these dynamics by optimising m to minimise the ensemble error. Generally, a moderate reduction in m lowers correlation without significantly compromising the strength of each tree, thus leading to an overall decrease in the forest’s error rate.

Implications for the Forest Error Rate: The forest error rate in a Random Forest model is influenced by both the correlation among the trees and the strength of each individual tree. Specifically:

Consequently, an ideal Random Forest model balances between individually strong and sufficiently diverse trees, typically achieved by tuning the m parameter.

For further reading on practical implementations, it is highly recommended to look at the SciKit-learn package of the Python community [1]


[1] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.