Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Decision Trees

One of the classic machine learning techniques, widely deployed in data mining, is decision tree induction. Using a simple algorithm and a simple knowledge structure, the approach has proven to be very effective. These simple tree structures represent a classification (and regression) model. Starting at the root node, a simple question is asked (usually a test on a variable value, like Age $<$ 35). The branches emanating from the node correspond to alternative answers. For example, with a test of Age $<$ 35 the alternatives would be Yes and No. Once a leaf node is reached (one from which no branches emanate) we take the decision or classification associated with that node. Some form of probability may also be associated with the nodes, indicating a degree of certainty for the decision. Decision tree algorithms handle mixed types of variables, handle missing values, are robust to outliers and monotonic transformations of the input, and robust to irrelevant inputs. Predictive power tends to be poorer than other techniques.

The model is expressed in the form of a simple decision tree (the knowledge representation). At each node of the tree we test the value of one of the variables, and depending on its value, we follow one of the branches emanating from that node. Thus, each branch can be thought of as having a test associated with it, for example Age $<$ 35. This branch then leads to another node where there will be another variable to test, and so on, until we reach a leaf node of the tree. The leaf node represents the decision to be made. For example, it may be a yes or no for deciding whether an insurance claim appears to be fraudulent.

In searching for a decision tree to best model our data, alternative decision trees are considered in a top-down fashion, beginning with the decision of the variable to initially partition the data (at the root node).

Image rattle-audit-evaluate-riskchart-rpart
Decision trees are the building blocks of data mining. Since their development back in the 1980's they have been the most widely deployed data mining model builder. The attraction lies in the simplicity of the resulting model, where a decision tree (at least one that is not too large) is quite easy to view, to understand, and, indeed, to explain to management! However, decision trees do not deliver the best performance in terms of the risk charts, and so there is a trade off between performance and simplicity of explanation and deployment.



Subsections
Copyright © 2004-2008 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by Togaware.