DATA MINING
Desktop Survival Guide
by Graham Williams

Variable Selection

Variable selection (also known as feature selection) will identify a good subset of the data from which to perform modelling. In many cases, using a good subset of all available variables will lead to better models, expressed in the simplest of forms. This may include removing redundant input variables. Indeed, the principle of Occam's Razor indicates, and the need to communicate and understand models requires, that it is best to choose the simplest model from among the models that explain the data. This also avoids unnecessary variables confusing the modelling process with noise, and reduces the likelihood of having input variables that are dependent.

Variable selection is important in classification and is the process of selecting key features fro the collection of variables (sometimes from thousands of variables) available. In such cases, most of the variables might be unlikely to be useful for classification purposes. Decision tree algorithms perform automatic feature selection and so they are relatively insensitive to variable selection. However, nearest neighbour classifiers do not perform feature selection and instead all variables, whether they are relevant or not, are used in building the classifier!

In decision tree algorithms, variables are selected at each step based on a selection criteria, and the number of variables that are used in the final model is determined by pruning the tree using cross-validation.

We present an example here of removing columns from data in R.

The dprep package in R provides support for variable selection, including finco and relief selection methods.

Associated with variable selection is variable weighting. The aim here is to essentially score variables according to their relevance or predictive power with respect to an output variable. Algorithms for variable weighting come in two flavours: those that use feedback from the modelling and those that don't. So called wrapper methods score variables by using subsets of variables to model and rating the variables according to how well the model performs. Filter algorithms, on the other hand, explore relationships within the data. The wrapper based approaches tend to produce better results but are computationally more expensive.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010