Desktop Survival Guide
by Graham Williams

Evaluation and Deployment

Todo: This chapter is still under review.

From David Hand:

``Performance criteria play a crucial role in data mining. We need simple numerical summaries which can be compared automatically. Poor choice of performance criterion yields a poor model. Example: Error rate is inappropriate with unbalanced classes. For rare disease detection, minimum error rate is achieved by classifying everyone as healthy.

Different types of classification model performance criteria:

  1. 1. Problem-based criteria 2. Model fit criteria 3. Classification accuracy criteria

Problem based criteria - speed of construction - speed of classification - ability to handle very large data sets - effectiveness on small n large p problems - ability to cope with incomplete data - interpretability - ability to identify important aspects - ease of dynamic updating All problem and context dependent.

2. Model fit criteria Accuracy of probability estimates - log-likelihood: sum ci ln ( p (1| xi ) ) − (1 − ci ) ln (1 − p (1| xi ) ) n i =1 n - Brier score: sum ( ci − p (1| xi ) ) 2 i =1 But goodness of fit of the model is not the same as accuracy of prediction Poorer fit can mean greater accuracy

. Classification accuracy criteria See plot on slide 16 of KDD09 presentation. etc. How to choose t''

This process is often referred to as cross validation. The model has been built without access to this testing dataset.

Evaluating the performance of model building is important. We need to measure how any model we build will perform on previously unseen cases. A measure will also allow us to ascertain how well a model performs in comparison to other models we might choose to build, either using the same model builder, or a very different model builder. A common approach is to measure the error rate as the proportional number of cases that the model incorrectly (or equivalently, correctly) classifies. Common methods for presenting and estimating the empirical error rate include confusion matrices and cross-validation.

The various approaches to measuring performance include Lift, area under the ROC curve, the F-score, average precision, precision/recall, squared error, and risk charts.

In this chapter we explore Rattle's various tools for reporting the performance of a model and it's various approaches to evaluating the output of data mining. We include the confusion matrix (using underneath the table function) for producing confusion matrices, Rattle's new Risk Chart for effectively displaying model performance including a measure of the success of each case, and we explore the use of the ROCR package for the graphical presentation of numerous evaluations, including those common approaches included in Rattle. Moving in to R illustrates how to fine the presentations for your own needs.

This chapter also touches on issues around Deployment of our models, and in particular Rattle's Scoring option, which allows us to load a new dataset and apply our model to that dataset, saving the scores together with any identity data, to a file for actioning.


For the past 25 years, two methods have been used to evaluate computer-based methods for classifying patient samples, but Swedish researchers at Uppsala University have found that this methodology is worthless when applied to practical problems. These methods are the basis for many technical applications, such as recognizing human speech, images, and fingerprints, and are now being used in new fields such as health care. However, to evaluate the performance of a classification model, a number of trial examples that were never used in the design of the model are needed. Unfortunately, there are seldom tens of thousands of test samples available for this type of evaluation, often because the samples are too rare or expensive to collect to use on an evaluation. Numerous methods have been proposed to solve this problem and since the 1980s two methods have dominated the field-cross validation and resampling/bootstrapping. The Uppsala researchers used both theory and computer simulations to show that those methods are worthless in practice when the total number of examples is small in relation to the natural variation that exists among different observations. What constitutes a small number depends on the problem being studied. The researchers say it is essentially impossible to determine whether the number of examples used is sufficient. "Our main conclusion is that this methodology cannot be depended on at all, and that it therefore needs to be immediately replaced by Bayesian methods, for example, which can deliver reliable measures of the uncertainty that exists," says Uppsala University professor Mats Gustafsson, who co-directed the study with professor Anders Isaksson. "Only then will multivariate analyses be in any position to be adopted in such critical applications as health care."

Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010