random Forest
Random forest for classification. Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.
Each tree is constructed using the following algorithm:
If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.
If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.
The advantages of random forest are:
For many data sets, it produces a highly accurate classifier.
It runs efficiently on large data sets.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
The disadvantages are
Random forests are prone to over-fitting for some datasets. This is even more pronounced on noisy data.
For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.
Return
Random forest classification model.
Parameters
a symbolic description of the model to be fitted.
the data frame of the explanatory and response variables.
the number of trees.
the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
the maximum depth of the tree.
the maximum number of leaf nodes in the tree.
the minimum size of leaf nodes.
the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
Decision tree node split rule.