Desktop Survival Guide
by Graham Williams
This is a common problem that we find in areas such as fraud, rare disease diagnosis, network intrusion, and others. The problem is that one class is very much underrepresented in the data. For example, cases of fraud in a very large medical insurance dataset are perhaps less than 1%. In compliance work where claims are being reviewed for compliance, often the number of claims that require adjustment is perhaps only 10%. In such circumstances, if we build a model in the usual way, where the aim is to minimise error rates, we can build the most accurate model to say that there is no fraud, and the model is up to 99% accurate, but of very little use.
Data mining of unbalanced datasets will often involve adjustments to the modelling in some way. One approach is to down sample the majority case to even up the classes. Alternatively, we might over sample entities from the rare class and by so doing increase the weight of the minorities! Such approaches can work, but it is not always clear that they will. Under-sampling can lead to a loss of information, whilst over-sampling may lead to over-fitting. Although, adaptive under-sampling can lead to a reduced loss of information, producing better results than over-sampling, and is more efficient.
An important thing to know when we have an unequal distribution of negative and positive cases is the misclassification cost--that is, what is the cost of incorrectly classifying a positive case as a negative (a false negative) and of incorrectly classifying a negative as a positive (a false positive). Often these will be different. In fraud for example, it is important to ensure we identify all cases of fraud, and we might be willing to accept that we will have some false positives. Thus false negatives have a very high cost. If the misclassification cost is equal for both false positives and false negatives then a reasonable strategy is simply to minimise the number of misclassified examples (regardless of whether they belong to the majority class or the minority class).
We illustrate two approaches to dealing with unbalanced datasets in See Chapter . There, one approach is to modify the weights, and the second is to down sample to balance up the classes. Both have been found to be very effective approaches when coupled with random forests.
Copyright © 2004-2010 Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.