One of the classic machine learning techniques, widely deployed in
data mining, is decision tree induction. Using a simple algorithm and
a simple knowledge structure, the approach has proven to be very
effective. These simple tree structures represent a classification
(and regression) model. Starting at the root node, a simple question
is asked (usually a test on a variable value, like Age
35). The branches emanating from the node correspond to alternative
answers. For example, with a test of Age
35 the
alternatives would be Yes and No. Once a leaf node
is reached (one from which no branches emanate) we take the decision
or classification associated with that node. Some form of probability
may also be associated with the nodes, indicating a degree of
certainty for the decision. Decision tree algorithms handle mixed types of
variables, handle missing values, are robust to outliers and monotonic
transformations of the input, and robust to irrelevant
inputs. Predictive power tends to be poorer than other techniques.
The model is expressed in the form of a simple decision tree (the
knowledge representation). At each node of the tree we test the value
of one of the variables, and depending on its value, we follow one of
the branches emanating from that node. Thus, each branch can be
thought of as having a test associated with it, for example
Age
35. This branch then leads to another node where
there will be another variable to test, and so on, until we reach a
leaf node of the tree. The leaf node represents the decision to be
made. For example, it may be a yes or no for
deciding whether an insurance claim appears to be fraudulent.
In searching for a decision tree to best model our data, alternative
decision trees are considered in a top-down fashion, beginning with
the decision of the variable to initially partition the data (at the
root node).
Decision trees are the building blocks of data mining. Since their
development back in the 1980's they have been the most widely deployed
data mining model builder. The attraction lies in the simplicity of
the resulting model, where a decision tree (at least one that is not
too large) is quite easy to view, to understand, and, indeed, to
explain to management! However, decision trees do not deliver the best
performance in terms of the risk charts, and so there is a trade off
between performance and simplicity of explanation and deployment.
Subsections
Copyright © 2004-2008 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
PDF version is properly formatted and forms a comprehensive book (draft with over 600 pages).
Brought to you by Togaware.