====== Random forests ====== {{:en:iot-open:czapka_p.png?50|General audience classification icon}}{{:en:iot-open:czapka_b.png?50|General audience classification icon}}{{:en:iot-open:czapka_e.png?50|General audience classification icon}}\\ Random forests [[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro|Random forests]] are among the best out-of-the-box methods highly valued by developers and data scientists. For better understanding of the process, an imaginary weather forecast problem might be considered, represented by the following true decision tree:
{{ :en:iot-reloaded:classification_6.png?800 | | Weather forecast example}} Weather forecast example
Now, one might consider several forecast agents – friends of neighbours- where each provides their own forecast depending on the factor values. Some will be higher than the actual value, and some lower. However, since they all use some **experience-based knowledge**, the forecast collected will be distributed around the actual value. The Random forest (RF) method uses hundreds of forecast agents, decision trees, and then applies majority voting.
{{ :en:iot-reloaded:classification_7.png?800 | | Weather forecast voting example}} Weather forecast voting example
Some advantages: * RF uses more knowledge than a single decision tree; * Furthermore, the more diverse initial information sources have been used, the more diverse models will be and the more robust the final estimate is; * This is true because a single data source might suffer from data anomalies reflected in model anomalies; RF features: * Each tree in the forest uses randomly selected subset of factors; * Each tree has a randomly sampled subset of training data; * However, each tree is trained like usual; * This increases the independence of data anomalies; * When a decision is made, it is a simple average of the whole forest; Each tree in the forest is grown as follows: * If the number of cases in the training set is N, a sample of N cases at random is taken - but with replacement, from the original data. Some samples will be represented more than once; * This sample will be the training set for growing the tree. * If there are M input factors, a number m<