|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
|
The CSV option of the Data tab is an easy way to load data from many different sources into Rattle. CSV stands for ``comma separated value'' and is a standard file format often used to exchange data between applications. CSV files can be exported from spreadsheets and databases, including OpenOffice Calc, Gnumeric, MS/Excel, SAS/Enterprise Miner, Teradata's Warehouse, and many, many, other applications. This is a pretty good option for importing your data into Rattle, although it does lose meta data information (that is, information about the data types of the dataset). Without this meta data R sometimes guesses at the wrong data type for a particular column, but it isn't usually fatal!
To load a dataset from a CSV file, click in the Filename button (Figure 3.2) to display a file chooser dialogue (Figure 3.3).
![]() |
We use the CSV file chooser dialogue to browse our file system to find the file we wish to load into Rattle. By default, only files that have a .csv extension will be listed (together with folders).
The pull down menu near the bottom right of the file chooser dialogue (above the Open button) allows us to select which files are listed. We can list only files that end with a .csv or a .txt or else list all files.
The window on the left of the popup allows us to browse to the different file systems available to us, while the series of buttons along the top allow us to navigate through a series of folders on a single file system. Once we have navigated to the folder containing the CSV file we wish to load we can select this file in the main panel of the file chooser dialogue. Clicking the Open button tells Rattle that this is the file we are interested in (without yet actually loading it).
The contents of the text window in Rattle (Figure 3.4) will change to provide a reminder as to what we need to do next. We have not yet told Rattle to actually load the data--we have just identified where the data is. So we now click the Execute button (or press the F5 key) to load the dataset from the file. Since Rattle is a simple graphical interface sitting on top of R itself, the message in the textview also reminds us that some errors encountered by R on loading the data (and in fact during any operation performed by Rattle) may be displayed in the R Console.
![]() |
A sample CSV file is provided by Rattle and is called
audit.csv. It will have been installed when Rattle was
installed and we can find it's actual location with the R command
shown here:
> system.file("csv", "audit.csv", package = "rattle")
[1] "/usr/local/lib/R/site-library/rattle/csv/audit.csv"
> file.show(system.file("csv", "audit.csv", package = "rattle"))
|
The simplest way to load this file into Rattle is to leave the CSV filename entry empty and click the Execute button. You will be asked whether you would like to load the audit dataset--choose Yes.
The top of the audit file will be similar to the following (perhaps with
quotes around values, although they are not necessary, and perhaps
with some different values):
ID,Age,Employment,Education,Marital,Occupation,Income,Gender,... 1004641,38,Private,College,Unmarried,Service,81838,Female,... 1010229,35,Private,Associate,Absent,Transport,72099,Male,... 1024587,32,Private,HSgrad,Divorced,Clerical,154676.74,Male,... 1038288,45,Private,Bachelor,Married,Repair,27743.82,Male,... 1044221,60,Private,College,Married,Executive,7568.23,Male,... ... |
As we can see, a CSV file is actually a normal text file that we could load into any text editor to review its contents. A CSV file usually begins with a header row, listing the names of the variables, each separated by a comma. If any name (or indeed, any value in the file) contains an embedded comma, then that name (or value) will be surrounded by quote marks. The remainder of the file after the header is expected to consist of rows of data that record information about the entities, with fields separated by commas recording the values of the variables for this entity.
You can choose the field delimiter through the Separator entry. A
comma is the default. To load a .txt file which uses a tab as the field separator enter \\t
(that is, two slashes followed by a t) as the separator. You
can also leave the separator empty and any white space will be used as
the separator.
Any data with missing values (i.e., no value between a pair of commas) or having the value ``NA'' or ``.'' or ``?'' is treated as a missing value, which is represented in R as the string NA. Support for the ``.'' convention allows the importation of CSV data generated by SAS, whilst the usage of ``?'' is common following its usage in some of the early machine learning applications like C4.5.
The contents of the textview of the Data tab has now changed again, as we see in Figure 3.5. The panel contains a brief summary of the dataset. From the summary we see that Rattle has loaded the file we requested, showing the full path to the file. We then see that Rattle has created something called a 'data.frame'. This is a basic data type in R used to store a table of data, where the columns (the variables) can have a mixture of data types. We then see that Rattle has loaded 2,000 entities (called observations or obs. in R), each described by 13 variables. The data type, and the first few values, for each entity are also displayed.
We can start getting an idea of the shape of the data from this simple summary. For example, the first two variables, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesID and XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesAge, are both identified as integers (int). The first few values of XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesID are 1004641, 1010229, 1024587, and so on. They all appear to be of the same length (i.e, the same number of digits) and together with having a name like XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesID provides a very strong indicator that this is some kind of identifier for each entity. The first few values of XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesAge are 38, 35, 32, 45, 60, and so on.
The next variable, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesEmployment, illustrates how R deals with categoric variables. In R terms it is a Factor with 8 levels (i.e., 8 possible values). The levels begin with "Consultant" and "Private". The following sequence of numbers, all of which happen to be 2 for the first 10 entities of this dataset, discloses how R stores categoric data. Effectively, R maintains an integer indexed table, associating the levels with integers, so that "Consultant" is associated with 1, "Private" with 2, and so on. Then only these integers need to be stored for each entity, which is generally more efficient on memory usage. We see this more convincingly for the following categoric variables, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesEducation, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesMarital, and XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesOccupation (because they have more than just a single level displayed in this summary).
The seventh variable, XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesIncome, has been identified as a
more general numeric rather than specific integer variable. The
display of the first few values does not actually give us any insight
as to why this might be so, but reviewing the actual CSV data as above
we see that the third entity actually has a value of
for
XnullXRattle!VariablesR functions (R function)Rattle!VariablesR libraries (R library)Rattle!VariablesR option (R option)Rattle!VariablesR packages (R package)Rattle!VariablesDatasets (Dataset)Rattle!VariablesRattle!VariablesIncome, indicating that these values are real numbers
rather than just integers.
We also note that Adjusted, for example, looks like it might be a categoric variable, with values 0 and 1, but R identifies it as an integer! That's fine for our purposes here. We can always changes this later.
Copyright © 2004-2008 Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.