Desktop Survival Guide
by Graham Williams
Often a simple, if not always satisfactory, choice for missing values that are known not to be zero is to use some ``central'' value of the variable. This is often the mean, median, or mode, and thus usually has limited impact on the distribution. We might choose to use the mean, for example, if the variable is otherwise generally normally distributed (and in particular does not have any skewness). If the data does exhibit some skewness though (e.g., there are a small number of very large values) then the median might be a better choice.
For categoric variables, there is, of course, no mean nor median, and so in such cases we might choose to use the mode (the most frequent value) as the default to fill in for the otherwise missing values. The mode can also be used for numeric variables.
Whilst this is a simple and computationally quick approach, it is a very blunt approach to imputation and can lead to poor performance from the resulting models.
We can see the effect of the imputation of missing values on the variable Age using the mode in Figure
Refer to Data Mining With R, from page 42, for more details.
Copyright © 2004-2010 Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.