Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Model Tuning

What is the right value to use for each of the variables of the model building algorithms that we us in data mining? The variable settings can make the difference between a good and a poor model.

The package caret, as well as providing a unified interface to many of the model builders we have covered in this book, provides a parameter tuning approach. Here's a couple of examples:



> library(rattle)
> library(caret)
> data(audit)
> mysample <- sample(nrow(audit), 1400)
> myrpart <- train(audit[mysample, c(2,4:5,7:10)], 
                   as.factor(audit[mysample, c(13)]), "rpart")
Model 1: maxdepth=6
 collapsing over other values of maxdepth
> myrpart
Call:
train.default(x = audit[mysample, c(2, 4:5, 7:10)], y = as.factor(audit[mysample, 
    c(13)]), method = "rpart")

1400 samples, 7 predictors

largest class: 77.71% (0)

summary of bootstrap (25 reps) sample sizes:
    1400, 1400, 1400, 1400, 1400, 1400, ... 

boot resampled training results across tuning parameters:

  maxdepth  Accuracy  Kappa  Accuracy SD  Kappa SD  Optimal
  2         0.817     0.423  0.0142       0.0386           
  3         0.818     0.413  0.0171       0.0617    *      
  6         0.814     0.412  0.019        0.0488           

Accuracy was used to select the optimal model
> myrpart$finalModel
n= 1400 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 1400 312 0 (0.77714286 0.22285714)  
   2) Marital=Absent,Divorced,Married-spouse-absent,Unmarried,Widowed 773  38 0 (0.95084088 0.04915912) *
   3) Marital=Married 627 274 0 (0.56299841 0.43700159)  
     6) Education=College,HSgrad,Preschool,Vocational,Yr10,Yr11,Yr12,Yr1t4,Yr5t6,Yr7t8,Yr9 409 129 0 (0.68459658 0.31540342)  
      12) Deductions< 1708 400 120 0 (0.70000000 0.30000000) *
      13) Deductions>=1708 9   0 1 (0.00000000 1.00000000) *
     7) Education=Associate,Bachelor,Doctorate,Master,Professional 218  73 1 (0.33486239 0.66513761) *

Similarly we can replace rpart with rf.

The XnullXR functionsR functions (R function)R functionsR libraries (R library)R functionsR option (R option)R functionsR packages (R package)R functionsDatasets (Dataset)R functionsR functionstune function from the e1071 package provides a simple, if sometimes computationally expensive, approach to find a good value for a collection of tuning variables. We explore the use of this function here.

The XnullXR functionsR functions (R function)R functionsR libraries (R library)R functionsR option (R option)R functionsR packages (R package)R functionsDatasets (Dataset)R functionsR functionstune function provides a number of global tuning variables that affect how the tuning happens. The nrepeat variable (number of repeats) specifies how often the training should be repeated. The repeat.aggregate variable identifies a function that specifies how to combine the training results over the repeated training. The sampling identifies the sampling scheme to use, allowing for cross-validation, bootstrapping or a simple train/test split. For each type of sample, further variables are supplied, including, for example, cross = 10 to set the cross validation to be 10-fold. The sampling.aggregate variable specifies a function to combine the training results over the various training samples. A good default (provided by XnullXR functionsR functions (R function)R functionsR libraries (R library)R functionsR option (R option)R functionsR packages (R package)R functionsDatasets (Dataset)R functionsR functionstune) is to train once with 10-fold cross validation.



Subsections
Copyright © 2004-2008 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by Togaware.