DATA MINING
Desktop Survival Guide
by Graham Williams

Example

Suppose we have a database of patients for whom we have, in the past, identified suitable types of contact lenses. Four categoric variables might describe each patient: Age (having possible values young, pre-presbyopic, and presbyopic), type of Prescription (myope and hypermetrope), whether the patient is Astigmatic (boolean), and TearRate (reduced or normal). The patients are already classified into one of three classes indicating the type of contact lens to fit: hard, soft, none.

Sample data is presented in Table 42.1.

**Table 42.1:** Contact lens training data.

We can make use of this historic data to talk about the probabilities of requiring soft, hard, or no lenses, according to a patient's Age, Prescription, Astigmatic condition and TearRate. Having a probabilistic model we could then make predictions using the model about the suitability of particular lens types for new clients. Thus, if we know historically that the probability of a patient requiring a soft contact lens, given that they were young and had a normal rate of tear production, was 0.9 (i.e., 90%) then we would supply soft lens to most such patients in the future. Symbolically we write this probability as:

$\begin{displaymath} P(soft\vert young,normal) = 0.9 \end{displaymath}$

Generally the real problem is to determine something like:

$\begin{displaymath} P(soft\vert young,myope,no,normal) \end{displaymath}$

To do this we would need to have a sufficient number of examples of

in the database. For databases with many variables, and even for very large databases, it is not likely that we will have sufficient (or even any) examples of all possible combinations of all variable values. This is where naïve Bayes helps.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010