Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Min Bucket (minbucket)

The minbucket is the minimum number of observations in any terminal leaf node.

The two variables minbucket and minsplit are closely related. In rpart if either is not specified then by default the other is calculated as $minsplit = 3*minbucket$.

Using rpart directly we specify minbucket within an option called control which takes the results from a function called rpart.control. In this example we



> audit <- read.csv(url("http://rattle.togaware.com/audit.csv"))
> audit.rpart <- rpart(TARGET_Adjusted ~ Age + Marital 
                                             + Occupation 
                                             + Deductions, 
                       data=audit,
                       method="class", 
                       control=rpart.control(minbucket=100))
> audit.rpart

Changing minbucket can result in different variables being chosen at different nodes. Compare the tree obtain with the command above (with minbucket set to 100) to the result when minbucket is set to 10. Note how node 7 was originally split using XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesAge but with the minimum bucket size set to 10 the node is split on XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesDeductions. We can see why -- the resulting node 15 has only 30 entities:



[...] 
  control=rpart.control(minbucket=100))
[...]
   7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721)  
    14) Age< 36.5 151  72 0 (0.52317881 0.47682119) *
    15) Age>=36.5 365 128 1 (0.35068493 0.64931507) *


[...]
  control=rpart.control(minbucket=10))
[...]
    7) Occupation=Clerical [...] 516 207 1 (0.40116279 0.59883721)  
     14) Deductions< 1299.833 486 207 1 (0.42592593 0.57407407)  
[...]
     15) Deductions>=1299.833 30   0 1 (0.00000000 1.00000000) *

Whilst the default is to set minbucket to be one third of minsplit there is no requirement for minbucket to be less than minsplit. A node will always have at least minbucket entities, and it will be considered for splitting if it has at least minsplit entities and on splitting, each of its children have at least minbucket entities.

Copyright © 2004-2008 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by Togaware.