Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Examples

Here's an example using the iris data:



> iris.rf <- randomForest(Species ~ ., iris, sampsize=c(10, 20, 10))

This will randomly sample 10, 20 and 10 entities from the three classes of species (with replacement) to grow each tree.

You can also name the classes in the Roption[]sampsize specification:

> samples <- c(setosa=10, versicolor=20, virginica=10)
> iris.rf <- randomForest(Species ~ ., iris, sampsize=samples)

You can do a stratified sampling using a different variable than the class labels so that you even up the distribution of the class. Andy Liaw gives an example of the multi-centered clinical trial data where you want to draw the same number of patients per center to grow each tree where you can do something like:

> randomForest(..., strata=center,
               sampsize=rep(min(table(center))), nlevels(center)))

This samples the same number of patients (minimum at any center) from each center to grow each tree.

To be confident that the random forest score is simply the proportion of positive examples, we can try building one tree, then multiple trees, and see what we get. We can start with a single tree (note that we use the Rattle generated commands, as listed in the Log tab, and thus we use the Rattle internal variables.

First build a single tree:

> set.seed(123)
> crs$rf <- randomForest(as.factor(Adjusted) ~ ., 
            data=crs$dataset[crs$sample,c(2:10,13)], 
            ntree=1, importance=TRUE, na.action=na.omit)
> crs$pr <- predict(crs$rf, 
                    crs$dataset[-crs$sample, c(2:10,13)], 
                    type="prob")[,2]
> summary(as.factor(crs$pr))
   0    1 NA's 
 423  139   38

Now build two trees and rerun the code:

> set.seed(123)
> crs$rf <- randomForest(as.factor(Adjusted) ~ ., 
            data=crs$dataset[crs$sample,c(2:10,13)], 
            ntree=2, importance=TRUE, na.action=na.omit)
> crs$pr <- predict(crs$rf, 
                    crs$dataset[-crs$sample, c(2:10,13)], 
                    type="prob")[,2]
> summary(as.factor(crs$pr))
   0  0.5    1 NA's 
 353  124   85   38

And then four trees:

> set.seed(123)
> crs$rf <- randomForest(as.factor(Adjusted) ~ ., 
            data=crs$dataset[crs$sample,c(2:10,13)], 
            ntree=4, importance=TRUE, na.action=na.omit)
> crs$pr <- predict(crs$rf, 
                    crs$dataset[-crs$sample, c(2:10,13)], 
                    type="prob")[,2]
> summary(as.factor(crs$pr))
   0 0.25  0.5 0.75    1 NA's 
 293   98   68   62   41   38

Thus, we can see that when we have four trees voting, the score will be either 0 (no tree voted in favour of the case), 0.25 (one tree in favour), 0.5 (two trees in favour). 0.75 (three trees in favour), and 1 (all trees in favour).

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010