Data Mining Survivor: Data_Transformation

DATA MINING
Desktop Survival Guide
by Graham Williams

Pivot Tables

The reshape package was inspired by pivot tables.

The basic principle to understand in using reshape is the two types of variables in our data. One collection are the identifier variables and the other are the ``measures''. The melt function will reorganise the data so that each identifier variable is a single column (where they may have previously been a mixture of rows and columns).

The aim is to generate various reorganisations and aggregate summaries of the data. For example, with the wine dataset we may like to look at the average values of a number of input variables for each Type. The first step, using the reshape package, is to melt the data frame, which expands the non-identifying variables across the identifying variables:

> wine.molten <- melt(wine, id="Type") > head(wine.molten) Type variable value 1 1 Alcohol 14.23 2 1 Alcohol 13.20 3 1 Alcohol 13.16 4 1 Alcohol 14.37 5 1 Alcohol 13.24 6 1 Alcohol 14.20 > tail(wine.molten) Type variable value 17312 3 Proline 660 17412 3 Proline 740 17512 3 Proline 750 17612 3 Proline 835 17712 3 Proline 840 17812 3 Proline 560

Now we can use cast to recast the data into the shape we want. Here we reshape it by Type and list the mean of each input variable across the values of Type:

> cast(wine.molten, Type ~ variable, mean) Type Alcalinity Alcohol Ash Color Dilution Flavanoids Hue 1 17.03729 13.74475 2.455593 5.528305 3.157797 2.9823729 1.0620339 2 20.23803 12.27873 2.244789 3.086620 2.785352 2.0808451 1.0562817 3 21.41667 13.15375 2.437083 7.396250 1.683542 0.7814583 0.6827083 Magnesium Malic Nonflavanoids Phenols Proanthocyanins Proline 106.3390 2.010678 0.290000 2.840169 1.899322 1115.7119 94.5493 1.932676 0.363662 2.258873 1.630282 519.5070 99.3125 3.333750 0.447500 1.678750 1.153542 629.8958

We can also include the column and row totals. We will illustrate this with a subset of the wine dataset:

> measure <- c("Alcohol", "Malic", "Ash") > wine.molten <- melt(wine, id="Type", measure=measure) > cast(wine.molten, Type ~ variable, mean, margins=c("grand_row", "grand_col")) Type Alcohol Malic Ash . 1 13.74475 2.010678 2.455593 6.070339 2 12.27873 1.932676 2.244789 5.485399 3 13.15375 3.333750 2.437083 6.308194 . 13.00062 2.336348 2.366517 5.901161

In this case the row totals have no meaning but the column totals do.

Also see aggregate:

> aggregate(wine[,-1], list(Type=wine$Type), mean) Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids 1 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 2 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451 3 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583 Nonflavanoids Proanthocyanins Color Hue Dilution Proline 1 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119 2 0.363662 1.630282 3.086620 1.0562817 2.785352 519.5070 3 0.447500 1.153542 7.396250 0.6827083 1.683542 629.8958

Another example using reshape.

> dat <- read.table("clipboard", header=TRUE) > dat Q S C 1 1 A 5 2 1 B 10 3 1 C 50 4 1 D 10 5 2 A 20 6 2 E 10 7 2 C 40 8 3 D 5 9 3 F 1 10 3 G 5 11 3 B 75 > res <- reshape(dat, direction = "wide", idvar = "Q", timevar = "S") > res Q C.A C.B C.C C.D C.E C.F C.G 1 1 5 10 50 10 NA NA NA 5 2 20 NA 40 NA 10 NA NA 8 3 NA 75 NA 5 NA 1 5 > res[is.na(res)] <- 0 > names( > res Q C.A C.B C.C C.D C.E C.F C.G 1 1 5 10 50 10 0 0 0 5 2 20 0 40 0 10 0 0 8 3 0 75 0 5 0 1 5

Or the same, but using the reshape package:

> library(reshape) > datm <- melt(dat, id=1:2) > cast(datm, Q ~ S) S A B C D E F G Q A B C D E F G 1 5 10 50 10 NA NA NA 2 20 NA 40 NA 10 NA NA 3 NA 75 NA 5 NA 1 5

With the basics in hand we can now explore the data in a more graphical fashion, beginning with plots that help understand individual variables (barplot, piechart, and line plots), followed by a number of plots that explore relationships between variables (scatterplot and correlation plot).

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010