Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Missing

Missing values present challenges to data mining and modelling in general. There can be many reasons for missing values, including the fact that the data is hard to collect, and so not always available (e.g., results of an expensive medical test), or that it is simply not recorded because it is in fact 0 (e.g., spouse income for someone without a spouse). Knowing why the data is missing is important in deicing how to deal with the missing value.

The Show Missing check button of the Summary option of the Explore tab provides a summary of missing values in our dataset. A summary of missing data is displayed in Figure 6.1. Such information is useful in understanding structure in the missing data, and perhaps coming to an understanding of why the data is missing.

Figure 6.1: Missing value summary for a modified version of the audit dataset.
Image rattle-audit-explore-missing

The missing value summary table is presented with the variables from the dataset listed along the top. Each row corresponds to a pattern of missing values. A 1 indicates a value is present, whereas a 0 indicates a value is missing, and the pattern generally relates to a collection of entities.

The left hand column records the number of entities that exhibit that pattern, so that the sum of this column (which is not actually shown in the output) will equal the number of entities in our dataset. The right hand column records the number of variables with missing values for each pattern. So the first row, corresponding to no missing values for any variables, has a 0.

The final row records the number of missing values over the whole dataset for each of the variables, with the total number of missing values recorded at the bottom right.

The rows and columns are sorted in ascending order according to the amount of missing data.

Generally, the first row records the number of entities that have no missing values, as is the case in Figure 6.1, where 1575 rows are complete.

The second row corresponds to a pattern of missing values for the variable Age. There are 39 entities that have just Age missing (and there are 42 entities that have Age missing, overall). This particular row's pattern has just a single variable missing, as indicated by the 1 in the final column.

The final row indicates that there are, for example, 37 missing values for the variable Marital, and that there are 560 missing values altogether in this dataset.

See Section [*] for dealing with missing values through imputation.

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010