Data Mining Survivor: Random_Forests

DATA MINING
Desktop Survival Guide
by Graham Williams

Examples

Here's an example using the iris data:

> iris.rf <- randomForest(Species ~ ., iris, sampsize=c(10, 20, 10))

This will randomly sample 10, 20 and 10 entities from the three classes of species (with replacement) to grow each tree.

You can also name the classes in the Roption[]sampsize specification:

> samples <- c(setosa=10, versicolor=20, virginica=10) > iris.rf <- randomForest(Species ~ ., iris, sampsize=samples)

You can do a stratified sampling using a different variable than the class labels so that you even up the distribution of the class. Andy Liaw gives an example of the multi-centered clinical trial data where you want to draw the same number of patients per center to grow each tree where you can do something like:

> randomForest(..., strata=center, sampsize=rep(min(table(center))), nlevels(center)))

This samples the same number of patients (minimum at any center) from each center to grow each tree.

To be confident that the random forest score is simply the proportion of positive examples, we can try building one tree, then multiple trees, and see what we get. We can start with a single tree (note that we use the Rattle generated commands, as listed in the Log tab, and thus we use the Rattle internal variables.

First build a single tree:

> set.seed(123) > crs$rf <- randomForest(as.factor(Adjusted) ~ ., data=crs$dataset[crs$sample,c(2:10,13)], ntree=1, importance=TRUE, na.action=na.omit) > crs$pr <- predict(crs$rf, crs$dataset[-crs$sample, c(2:10,13)], type="prob")[,2] > summary(as.factor(crs$pr)) 0 1 NA's 423 139 38

Now build two trees and rerun the code:

> set.seed(123) > crs$rf <- randomForest(as.factor(Adjusted) ~ ., data=crs$dataset[crs$sample,c(2:10,13)], ntree=2, importance=TRUE, na.action=na.omit) > crs$pr <- predict(crs$rf, crs$dataset[-crs$sample, c(2:10,13)], type="prob")[,2] > summary(as.factor(crs$pr)) 0 0.5 1 NA's 353 124 85 38

And then four trees:

> set.seed(123) > crs$rf <- randomForest(as.factor(Adjusted) ~ ., data=crs$dataset[crs$sample,c(2:10,13)], ntree=4, importance=TRUE, na.action=na.omit) > crs$pr <- predict(crs$rf, crs$dataset[-crs$sample, c(2:10,13)], type="prob")[,2] > summary(as.factor(crs$pr)) 0 0.25 0.5 0.75 1 NA's 293 98 68 62 41 38

Thus, we can see that when we have four trees voting, the score will be either 0 (no tree voted in favour of the case), 0.25 (one tree in favour), 0.5 (two trees in favour). 0.75 (three trees in favour), and 1 (all trees in favour).

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010