Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Loading Data

Data can come in many different formats from many different sources, as we have already seen in Chapter 4. Rattle, by using R's extensive capabilities, provides direct access to such data. Indeed we are fortunate with the R system in that it is an open system and therefore is strong on sharing and operating with other applications. R supports the importation of data in most forms.

One of the most common formats for data exchange between applications is the comma separated value file (i.e., files with a exttt.csv filename extension). This is a simple text file format oriented around rows and columns, using a comma to separate the columns in the file. Such files can be used to transfer data through export and import between spreadsheets, databases, weather monitoring stations, and many other applications. A variation on the idea is to separate the columns with other markers, such as a tab character which is often associated with files having a exttt.txt filename extension.

These simple data files (the exttt.csv and exttt.txt files) contain no explicit metadata information--that is, there is no data to describe the structure of the data contained in the file. That information often needs to be guessed at by the software reading the data.

Other types of data sources do provide information about the data, so that our software does not need to make guesses about what it is reading. Files with a exttt.arff extension, for example, extend the exttt.csv format with metadata.

Extracting data directly from a database often delivers the metadata along with the data itself. The Open Database Connectivity (ODBC) standard provides an open access method for accessing data stored in a variety of databases, and is fully supported by R. This allows direct connection to a vast collection of data sources including MS/Excel, MS/Access, SQL Server, Oracle, IBM DB2, Teradata, MySQL, Postgres, and SQLite.

We don't need to use the Rattle interface to load a dataset. We could simply use the underlying R commands to do the same. We can directly use functions like read.csv, read.delim, read.arff, and odbcConnect. In the following subsections we will illustrate loading data through the Rattle interface, and then review the underlying R commands.

Once a dataset source has been identified and the Data tab executed, an overview of the data will be displayed in the textview. Figure 5.1 displays the Rattle application after loading the weather.csv file, which is supplied as a sample dataset with the Rattle package. We get here by starting up R, and then loading the rattle package, start up Rattle, then click the Execute button to be offered to load the weather dataset.



> library(rattle)
> rattle()

Figure 5.1: The file, weather.csv, from the rattle package has been loaded into Rattle.
Image start:rattle_startup_weather

In this chapter we review the different source data formats and discuss how to load them for data mining. We include a review of options that Rattle provides for identifying how the data is to be used for data mining.



Subsections
Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010