DATA MINING
Desktop Survival Guide by Graham Williams 


Understanding rare cases and probabilities:
(Paraphrased from a New York Times article on cancer tests  Mammogram Math, December 2009)
Suppose that we have a predictive model to identify fraudsters with 95 percent accuracy. That is, if someone is a fraudster then the model will predict them as a fraudster 95 percent of the time. If a person is known not to be a fraudster, then the model might identify them as a fraudster, incorrectly, 1 percent of the time. Overall, we might know that 0.5 percent (i.e., one out of 200 people) are actually fraudsters.
We might apply the predictive model to the population. If the model predicts someone as a fraudster does this mean they really are likely to be a fraudster? No.
Suppose 100,000 people are put through the predictive model. We would expect that on average 500 of these 100,000 people (0.5 percent of 100,000) will be fraudsters. Since 95 percent of these 500 people will be scored as fraudsters by the predictive model. Thus, on average, 475 people (.95 x 500) will be actual fraudsters.
Of the remaining 99,500 people who are not fraudsters, 1 percent will be predicted as fraudsters, resulting in 995 false positive (.01 x 99,500 = 995). Thus a total of 1,470 people will be identified as fraudsters (995 + 475 = 1,470). Most of them (995) will be false positives. The probability of being a fraudster, given that the model predicted you to be a fraudster, is only 475/1,470, or about 32 percent!
We contrast this with the probability that you are predicted as a fraudster, given that you are actually a fraudster, which by assumption is 95 percent.
Intuition about probabilities can often be out of step with reality.
Copyright © 20042010 Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.