Data Mining Survivor: R

DATA MINING
Desktop Survival Guide
by Graham Williams

Rattle

Rattle (, ) is a graphical data mining application built upon the statistical language R. An understanding of R is not required in order to use Rattle. However, a basic introduction is provided through this book, acting as a springboard into more sophisticated data mining directly in R itself. Rattle is simple to use, quick to deploy, and allows us to rapidly work through the data processing, modelling, and evaluation phases of a data mining project. On the other hand, R provides a very powerful language for performing data mining well beyond the limitations that must be embodied in any graphical user interface and the consequentially canned approaches to data mining. When we need to fine tune and further develop our data mining projects we can migrate from Rattle to R.

Rattle uses the Gnome graphical user interface and runs under various operating systems, including GNU/Linux, Macintosh OS/X, and MS/Windows. Its intuitive user interface takes us through the basic steps of data mining, as well as illustrating the actual R code that is used to achieve this. Rattle exposes all of the underlying R code to allow it to be directly deployed within the R as well as saved in R scripts for future reference. The R code can be loaded into R (outside of Rattle) to repeat any data mining exercise. This is an important aspect of any scientific and deployed endeavour--to be able to repeat our ``experiments.''

While Rattle by itself may be sufficient for all of a user's needs, particularly in the context of our introduction to data mining, it does provide this stepping stone to more sophisticated processing and modelling in R itself. It is worth emphasising that the user is not limited to how Rattle does things. For sophisticated and unconstrained data mining, the experienced user we progress to interacting directly with a powerful statistical software environment.

The purpose of this Chapter is to place us in a position to effectively interact with Rattle so that we can illustrate the data mining process. Of course, we first need to have Rattle available on our computer, and since it is freely available open source software, it is available to anyone.

Appendix A takes us through the installation of Rattle. It is recommended that you install Rattle to work along with the examples presented in this book.

In this Chapter we look at the initial interface presented by Rattle and the basic process for interacting with Rattle, which implements a common data mining process. The Menus and Buttons of the interface are covered in Section , presenting the Rattle interface and its basic environment. Graphical presentations of data and models are an important component of data mining, and Rattle's mechanism for displaying and interacting with graphical plots is introduced in Section 2.10.4.

We generally start up Rattle from a running instance of R. Packaged versions of Rattle (including RStat) may provide an icon or button that hide the initiation of R and simply appear to display the Rattle application. Nonetheless, they all must do the following:

> library(rattle) > rattle()

The user interface for Rattle follows a typical data mining process. The idea is to progress through the Tabs that form the primary mechanism for operating with Rattle. We work our way from the left most tab (the Data tab where we identify the source of data to be mined) to the right most tab (the Log tab where we can review all steps of our mining and save it to file as a script that can be rerun at a later time).

We introduce data mining in this book using the simple interface provided by Rattle for the common case of what we call the Two Class paradigm in Section . We limit ourselves to just two classes to ensure we develop a good understanding of the technology, but the ideas and algorithms generalise to the Multiple Class paradigm.

It is well reported that a data mining project involves a lot more time than just the time spent building models. The commonly recognised six phases are Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, and Deployment.

The typical work flow process for data mining, in the context of Rattle, can be summarised as:

Load a Dataset;
Select variables and entites for exploring and mining;
Explore the data to understand how it is distributed or spread;
Transform the data to suit our data mining purposes;
Build our Models;
Evaluate the models;
Review the Log of the data mining process.

Pictorially, we illustrate a typical work flow that is embodied in the Rattle interface in Figure 1.1.

**Figure 1.1:** Initial steps of the data mining process (Tony Nolan)

**Figure 1.2:** The data mining process

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010