|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Model Tuning |
What is the right value to use for each of the variables of the model building algorithms that we us in data mining? The variable settings can make the difference between a good and a poor model.
The package caret, as well as providing a unified interface to many of the model builders we have covered in this book, provides a parameter tuning approach. Here's a couple of examples:
> library(rattle)
> library(caret)
> data(audit)
> mysample <- sample(nrow(audit), 1400)
> myrpart <- train(audit[mysample, c(2,4:5,7:10)],
as.factor(audit[mysample, c(13)]), "rpart")
Model 1: maxdepth=6
collapsing over other values of maxdepth
> myrpart
Call:
train.default(x = audit[mysample, c(2, 4:5, 7:10)], y = as.factor(audit[mysample,
c(13)]), method = "rpart")
1400 samples, 7 predictors
largest class: 77.71% (0)
summary of bootstrap (25 reps) sample sizes:
1400, 1400, 1400, 1400, 1400, 1400, ...
boot resampled training results across tuning parameters:
maxdepth Accuracy Kappa Accuracy SD Kappa SD Optimal
2 0.817 0.423 0.0142 0.0386
3 0.818 0.413 0.0171 0.0617 *
6 0.814 0.412 0.019 0.0488
Accuracy was used to select the optimal model
> myrpart$finalModel
n= 1400
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 1400 312 0 (0.77714286 0.22285714)
2) Marital=Absent,Divorced,Married-spouse-absent,Unmarried,Widowed 773 38 0 (0.95084088 0.04915912) *
3) Marital=Married 627 274 0 (0.56299841 0.43700159)
6) Education=College,HSgrad,Preschool,Vocational,Yr10,Yr11,Yr12,Yr1t4,Yr5t6,Yr7t8,Yr9 409 129 0 (0.68459658 0.31540342)
12) Deductions< 1708 400 120 0 (0.70000000 0.30000000) *
13) Deductions>=1708 9 0 1 (0.00000000 1.00000000) *
7) Education=Associate,Bachelor,Doctorate,Master,Professional 218 73 1 (0.33486239 0.66513761) *
|
Similarly we can replace rpart with rf.
The XnullXR functionsR functions (R function)R functionsR libraries (R library)R functionsR option (R option)R functionsR packages (R package)R functionsDatasets (Dataset)R functionsR functionstune function from the e1071 package provides a simple, if sometimes computationally expensive, approach to find a good value for a collection of tuning variables. We explore the use of this function here.
The XnullXR functionsR functions (R function)R functionsR libraries (R library)R functionsR option (R option)R functionsR packages (R package)R functionsDatasets (Dataset)R functionsR functionstune function provides a number of global tuning variables that affect how the tuning happens. The nrepeat variable (number of repeats) specifies how often the training should be repeated. The repeat.aggregate variable identifies a function that specifies how to combine the training results over the repeated training. The sampling identifies the sampling scheme to use, allowing for cross-validation, bootstrapping or a simple train/test split. For each type of sample, further variables are supplied, including, for example, cross = 10 to set the cross validation to be 10-fold. The sampling.aggregate variable specifies a function to combine the training results over the various training samples. A good default (provided by XnullXR functionsR functions (R function)R functionsR libraries (R library)R functionsR option (R option)R functionsR packages (R package)R functionsDatasets (Dataset)R functionsR functionstune) is to train once with 10-fold cross validation.