Data Mining Survivor: Data_Options

DATA MINING
Desktop Survival Guide
by Graham Williams

Variable Roles

When loading data into Rattle certain special strings are used to identify variable roles. For example, if the variable name starts with ID then the variable is marked as having a role as an identifier. See Section 5.7.2 for details.

When building a model a variable will play a specific role. Variables can be inputs to the model, or might be the target of the model. Variables can also be identified as a so-called risk variable (a variable which is not used for modelling as such) or ignored completely for our purposes. The default role for most variables is that of an Input (i.e., independent) variable. Generally, these are the variables that will be used to predict the value of a Target (or dependent) variable.

Variables with particular names will have a default role assigned for them. For example, if the variable name begins with ID then the default role is set to Identifier. Special strings that are looked for at the beginning of the variable's name include:

ID Identifier

IGNORE Ignored

IMP Imputed

RISK Risk measure

Some special treatment of the variables might be deployed. If a variable name begins with IMP_ (for imputed) then if the same variable name without the IMP_ is found in the dataset then it will be set to be ignored by default.

Rattle also uses simple heuristics to guess at a Target role for one of the variables. Here we see that Adjusted has been selected as the target variable. In this instance it is correct. The heuristic involves examining the number of distinct values that a variable has, and if it has less than 5, then it is considered as a candidate target variable. The candidate list is ordered starting with the last variable (often the last variable is the target), and then proceeding from the first onwards to find the first variable that meets the conditions of looking like a target.

Any numeric variables that have a unique value for each observation is automatically identified as an Ident. Any number of variables can be tagged as being an Ident. All Ident variables are ignored when modelling, but are used after scoring a dataset, being written to the resulting score file so that the cases that are scored can be identified.

At any one time a target is either treated as categoric or numeric. For a numeric variable chosen as the target, if it has 10 or less unique values then RStat will automatically treat it as a categoric variable, if the Auto radio button (of the Data tab) is chosen. For modelling purposes, the consequence is that only classification model builders (Regression/Multinomial in RStat) will be available. To have regression model builders available (Regression/Linear, Regression/Generalised, and Regression/Poisson in RStat), you need to override the heuristic by selecting the Numeric radio button of the Data tab.

Sometimes not all variables in your dataset should be used or may not be appropriate for a particular modelling task. For example, the random forest model builder does not handle categoric variables with more than 32 levels, so you may choose to Ignore Accounts. You can change the role of any variable to suit your needs, although you can only have one Target and one Risk.

For an example of the use of the Risk variable, see Section 22.3.

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010