DATA MINING
Desktop Survival Guide
by Graham Williams

R

R is a statistical and data mining package consisting of a programming language and a graphics system. It is used throughout this book to illustrate data mining procedures. It is the programming language used to implement the Rattle graphical user interface for data mining. If you are moving to R from SAS or SPSS then you will find () a great resource. An early version is available from http://RforSASandSPSSusers.com.

R is the most sophisticated statistical software available, easily installed, instructional, state-of-the-art, and it is free and open source.

Learning by example is a powerful learning paradigm. Motivated by the programming paradigm of ``programming by example'' (, ), the intention is that you will be able to replicate the examples from the book, and then fine tune them to suit your own needs. This is one of the underlying principles of Rattle where all of the R commands that are used under the graphical user interface are exposed to the user. This makes it a useful teaching tool in learning R for the specific task of data mining, and also a good memory aid!

So R is a language. The basic modus operandi is to write sentences expressed in this language. After a while you will want to do more than to issue single, simple, commands (sentences), but to write sentences and paragraphs and full novels in the language! R script files (often with the R filename extension) are the place to write scripts. You can re-run your scripts to transform, at will and automatically, your source data into information and knowledge. As we progress through this book we will become familiar with the common R commands.

Whilst for data mining purposes we will use the Rattle graphical user interface, more advanced users will prefer the powerful Emacs editor, augmented with the ESS package. Both run under GNU/Linux, Mac/OSX, and MS/Windows.

We also note that direct interaction with R has a steeper learning curve than using GUI based systems, but once into R, performing operations over the same or similar datasets becomes very easy using its programming language interface.

R is an interactive, interpretive programming language. It is written in the lower level procedural programming language C, with much of the system on top of this written in R. Where computation requirements are significant, R code is often translated into C code which will generally execute much faster. The details are not important for us here, but this allows R to be surprisingly faster when it needs to be, without the user of R actually needing to be aware of how the function they are using is implemented.

The foundation R code consists of something over 300,000 lines of C code and 130,000 lines of R code. These calculations are based on the distribution of R 2.10.1, counting lines of code in C and R files found in the src folder. This includes the R packages that form part of the R foundation (like utils, stats, graphics, tools, and base). But this doesn't count the more than 2,000 other packages that are available for R, many of which are written in R itself, though intermixed with C, Fortran, and Java.

A new version of R is released twice a year: in April and in October. It is free, so a sensible approach is to upgrade our installation of R on each release. This ensure we keep up with bug fixes and new developments.

Subsections

Rattle

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010