Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Cleaning the Survey Dataset

We summarise a number of cleaning operations that might be performed on the survey dataset.

Remove entities with null values:

> load("survey.RData")
> survey <- na.omit(survey)
> dim(survey)
[1] 30162    15

Remove non-numeric data:

> load("survey.RData")
> rmcols <- rev(seq(1,ncol(survey))[as.logical(lapply(survey, is.factor))])
> for (i in rmcols) survey[[i]] <- NULL
> dim(survey)
[1] 32561     6
> colnames(survey)
[1] "Age"            "fnlwgt"         "Education.Num"  "Capital.Gain"  
[5] "Capital.Loss"   "Hours.Per.Week"



Copyright © 2004-2010 Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010