Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Issues: Under-represented Classes

Consider the problem of fraud investigation, perhaps in insurance claims. Suppose some 10,000 cases have been investigated and of those just 5% (or 500) were found to be fraudulent. This is a typical scenario for numerous organisations. With modelling we wish to improve the deployment of our resources so that the 95% of the cases that were not fraudulent need not all be investigated, yet the 5% still need to be identified. Each case of actual fraud also has a dollar value associated with it, representing the risk (actually, the magnitude of the risk) associated with the case.

The advantage of a decision tree approach is that the resulting tree (and particularly if we traverse each path through the tree to obtain a set of rules) can be easily understood and explained, allowing decision makers the opportunity to understand the changes being suggested.

An aim of modelling here is to present a model which will allow us to identify a caseload tradeoff with coverage whilst maximising the recovery of dollars represented as the risk.

The first step is to build a decision tree. Because of the skewness of the outcome we might ``trick'' rpart into working harder to identify the frauds. As Breiman et al. 1984 indicate, different costs for misclassification can be modelled either by modifying the loss matrix or by using different prior probabilities for the classes, or by using different weights for the response classes. These can be achieve using rpart with the Roption[]parms option which will record the options we want for the tree building. The variables Roption[]loss and Roption[]prior can be set within the Roption[]parms list of variables. Another approach is to use the Roption[]weights (to weight each case) and Roption[]cost (the relative cost of obtaining the variable value, thus can tune the choice of variables in the model) options of rpart.

In using Roption[]prior the relative prior probability assigned to each class can be used to adjust the importance of misclassifications for each class. Thus, priors may be interpreted as case weights, although case weights are treated as case multipliers.

In fraud it is desirable not to misclassify cases of fraud, thus a more accurate classification is desired for some classes over others. This will not be exhibited through the relative class sizes. However, if the criterion for predictive accuracy is misclassification costs, as it often is, then minimising costs amounts to minimising the proportion of misclassified cases when priors are considered proportional to the class sizes and misclassification costs are taken to be equal for every class. A Roption[]loss matrix elaborates the loss incurred if an observation of one decision class (say 1) is erroneously classified by our model as another decision class (say 0).

For the following examples we use the audit dataset from the rattle package. This dataset consists of a bunch of input variables with the target (Adjusted) being in the last column. The outcome is binary (0/1) with the positive case (1) being under-represented. The measure of the risk (Adjustment) is assumed to be in the second last column. The risk is the dollar amount recovered from the review of the fraud and should not be used as an input variable in the modelling (thus we use audit[,-2] to remove this column from the data). The examples here show the R code behind Rattle as presented in Chapter [*].

Using Roption[]prior to over-emphasize the under-represented outcome:

library(rpart)
library(rattle)
data(audit)
audit.rpart <- rpart(Adjusted ~ .,data=audit[,-12],parms=list(prior=c(.5,.5)))

Using Roption[]loss:

library(rpart)
library(rattle)
data(audit)
loss <- matrix(c(0, 2, 1, 0), byrow=TRUE, ncol=2)
audit.rpart <- rpart(Adjusted ~ ., data=audit[,-12], parms=list(loss=loss))

Using Roption[]weights based on the value of the risk:

library(rpart)
library(rattle)
data(audit)
weight <- abs(audit$Adjustment)/max(audit$Adjustment)*10+1
audit.rpart <- rpart(Adjusted ~ ., data=audit[,-12], weights=weight)

Now we apply the model to the data to obtain probabilistic predictions (note that we are applying the model to the same training set and this is will give us an optimistic estimate - Rattle uses training/test sets to give a better estimate). The result is the probability, for each case, of it being a fraud:

audit.predict <- predict(audit.rpart, audit)

Now, using Rattle (see Chapter [*]) we can produce a Risk Chart that presents the cases ordered by the probability of being a fraud, and plotting the coverage and risk (percentage of dollars) recovered for each choice of caseload.

library(rattle)
eval <- evaluateRisk(audit.predict, audit$Adjusted, audit$Adjustment)
plotRisk(eval$Caseload, eval$Precision, eval$Recall, eval$Risk)
title(main="Risk Chart using rpart on the audit dataset", 
      sub=paste("Rattle", Sys.time(), Sys.info()["user"]))

To produce:

The plot can be easily used to tell a story about the tradeoff between recovering all risk cases and the amount of effort expended. The solid black diagonal line can be thought of as the baseline. The so called optimal line (the caseload where the sum of the distances of the Revenue and Adjustments from the baseline is maximal) might be an interesting point to consider. The story says that if our investigators actually only investigated 25% of the cases that they are currently investigating, then they would recover 64% of the cases that were found to be fraudulent, and 72% of the dollars that were recovered. The other 75% of the investigative resources could be better deployed, perhaps to target higher risk populations where the returns are greater. Note that the Strike Rate has increased from 26% in the original dataset to 67% at this optimal point.

Perhaps an even better story is that with half of the resources currently deployed on investigations (i.e., a caseload of 50%), with our model we could recover almost 90% of the frauds and marginally more than 90% of the dollars known to be recoverable.

We do note that here we are assuming the caseload directly reflects the actual workload (i.e., every case takes the same amount of effort).

Such Risk Charts are used to compare the performance of alternative models, where the aim is often to extend the red (Revenue) and green (Recall) lines toward the top left corner of the plot, or to maximise the area under these curves.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010