DATA MINING
Desktop Survival Guide
by Graham Williams

Exploring Data

As a data miner we really need to live and breathe our data. Even before we start building our data mining models we can gain significant insights through checking and exploring the data. These insights can also deliver new discoveries to our clients--discoveries that can offer benefits early on in a data mining project.

Through exploring our data we can discover what the data looks like, the boundaries of the data (like the minimum and maximum values), its numeric characteristics (like the average value), and how the data is distributed (like how spread out the data is). The data begins to tell us a story, and we need to build and understand that story for ourselves. By capturing that story, we can communicate the story back to our clients.

This task of exploratory data analysis (often abbreviated as EDA), which generally involves getting a basic understanding of a dataset. Statistics, the fundamental tool here, is essentially about uncertainty--to understand it and thereby to make allowance for it. It also provides a framework for understanding the discoveries made in data mining. Discoveries need to be statistically sound and statistically significant--uncertainty associated with modelling needs to be understood.

We explore the shape or distribution of our data before we begin mining. Through this exploration we begin to understand the ``lay of the land,'' just as a miner works to understand the terrain before blindly digging for gold. Through this exploration we may identify problems with the data, including missing values, noise and erroneous data, and skewed distributions. This will then drive our choice of tools for preparing and transforming our data and for mining it.

Rattle provides tools ranging from textual summaries to visually appealing graphical summaries, tools for identifying correlations between variables, and a link to the very sophisticated GGobi tool for visualising data. The Explore tab provides an opportunity to understand our data in various ways.

Subsections

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010