Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Resources

Decision trees have been around for a long time as a mechanism for structuring a series of questions and choosing the next question to ask on the basis of the answer to the previous question. In data mining we commonly identify decision trees as the knowledge representation scheme targeted by the family of techniques originating from ID3 in 1979.

C4.5 was made available together with a book (, ) that served as a guide to using the code, which was printed as half of the book, and supplied on electronically.

The similar technique of classification trees was independently developed at the same time.

() present a comprehensive empirical comparison of many of the modern model builders. An older comparison is known as the Statlog comparison (, ).

Traditional decision tree induction, as epitomised by CART and ID3/C4.5, do not employ any test of statistical significance in deciding on which variables to choose when partitioning the data. Conditional trees have been introduced to address this by using a conditional distribution, measuring the association between the output and the input variables. They take into account distributional properties.

The traditional decision tree algorithms suffer from overfitting and a bias toward selecting variables with many possible splits. The algorithms do not use any statistical significance concepts and thus, as noted by (), cannot distinguish between significant and insignificant improvements in the information measure.

The Borgelt collection (See Chapter 48) contains dtree, a generic implementation of the decision tree divide and conqueror algorithm. Weka (See Chapter 53) also provides a freely available implementation of a decision tree induction algorithm (J48) within its Java-based framework. Decision tree induction is a fundamental data mining tool and implementations of C4.5 or its variations are available in most commercial data mining toolkits, including Clementine (See Chapter 54) and STATISTICA (See Chapter 60).

Can library(rgl) be used to visualise a decision tree?

() discussed generalised degrees of freedom and shows that to get an unbiased estimate of $R^2$ from recursive partitioning (decision tree building) you have to use the formula for adjusted $R^2$ with the number of parameters far exceeding the number of final splits. He showed how to estimate the degrees of freedom. Decision tree building can result in simple predictive models but this can be an illusion.

An alternative is provided by the tree package, although rpart is the generally preferred function.

For multivariate use mvpart.

Visualise trees with maptree and pinktoe.

The algorithm complexity figure comes from ().

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010