Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Pivot Tables

The reshape package was inspired by pivot tables.

The basic principle to understand in using reshape is the two types of variables in our data. One collection are the identifier variables and the other are the ``measures''. The melt function will reorganise the data so that each identifier variable is a single column (where they may have previously been a mixture of rows and columns).

The aim is to generate various reorganisations and aggregate summaries of the data. For example, with the wine dataset we may like to look at the average values of a number of input variables for each Type. The first step, using the reshape package, is to melt the data frame, which expands the non-identifying variables across the identifying variables:

> wine.molten <- melt(wine, id="Type")
> head(wine.molten)
  Type variable value
1    1  Alcohol 14.23
2    1  Alcohol 13.20
3    1  Alcohol 13.16
4    1  Alcohol 14.37
5    1  Alcohol 13.24
6    1  Alcohol 14.20
> tail(wine.molten)
      Type variable value
17312    3  Proline   660
17412    3  Proline   740
17512    3  Proline   750
17612    3  Proline   835
17712    3  Proline   840
17812    3  Proline   560

Now we can use cast to recast the data into the shape we want. Here we reshape it by Type and list the mean of each input variable across the values of Type:

> cast(wine.molten, Type ~ variable, mean)

 Type Alcalinity  Alcohol      Ash    Color Dilution Flavanoids       Hue
    1   17.03729 13.74475 2.455593 5.528305 3.157797  2.9823729 1.0620339
    2   20.23803 12.27873 2.244789 3.086620 2.785352  2.0808451 1.0562817
    3   21.41667 13.15375 2.437083 7.396250 1.683542  0.7814583 0.6827083

 Magnesium    Malic Nonflavanoids  Phenols Proanthocyanins   Proline
  106.3390 2.010678      0.290000 2.840169        1.899322 1115.7119
   94.5493 1.932676      0.363662 2.258873        1.630282  519.5070
   99.3125 3.333750      0.447500 1.678750        1.153542  629.8958

We can also include the column and row totals. We will illustrate this with a subset of the wine dataset:

> measure <- c("Alcohol", "Malic", "Ash")
> wine.molten <- melt(wine, id="Type", measure=measure)
> cast(wine.molten, Type ~ variable, mean, margins=c("grand_row", "grand_col"))
 Type  Alcohol    Malic      Ash        .
    1 13.74475 2.010678 2.455593 6.070339
    2 12.27873 1.932676 2.244789 5.485399
    3 13.15375 3.333750 2.437083 6.308194
    . 13.00062 2.336348 2.366517 5.901161

In this case the row totals have no meaning but the column totals do.

Also see aggregate:

> aggregate(wine[,-1], list(Type=wine$Type), mean)
  Type  Alcohol    Malic      Ash Alcalinity Magnesium  Phenols Flavanoids
1    1 13.74475 2.010678 2.455593   17.03729  106.3390 2.840169  2.9823729
2    2 12.27873 1.932676 2.244789   20.23803   94.5493 2.258873  2.0808451
3    3 13.15375 3.333750 2.437083   21.41667   99.3125 1.678750  0.7814583
  Nonflavanoids Proanthocyanins    Color       Hue Dilution   Proline
1      0.290000        1.899322 5.528305 1.0620339 3.157797 1115.7119
2      0.363662        1.630282 3.086620 1.0562817 2.785352  519.5070
3      0.447500        1.153542 7.396250 0.6827083 1.683542  629.8958

Another example using reshape.



> dat <- read.table("clipboard", header=TRUE)
> dat
   Q S  C
1  1 A  5
2  1 B 10
3  1 C 50
4  1 D 10
5  2 A 20
6  2 E 10
7  2 C 40
8  3 D  5
9  3 F  1
10 3 G  5
11 3 B 75
> res <- reshape(dat, direction = "wide", idvar = "Q", timevar = "S")
> res
  Q C.A C.B C.C C.D C.E C.F C.G
1 1   5  10  50  10  NA  NA  NA
5 2  20  NA  40  NA  10  NA  NA
8 3  NA  75  NA   5  NA   1   5
> res[is.na(res)] <- 0
> names(
> res
  Q C.A C.B C.C C.D C.E C.F C.G
1 1   5  10  50  10   0   0   0
5 2  20   0  40   0  10   0   0
8 3   0  75   0   5   0   1   5

Or the same, but using the reshape package:



> library(reshape)
> datm <- melt(dat, id=1:2)
> cast(datm, Q ~ S)

 S  A  B  C  D  E  F  G
 Q  A  B  C  D  E  F  G
 1  5 10 50 10 NA NA NA
 2 20 NA 40 NA 10 NA NA
 3 NA 75 NA  5 NA  1  5

With the basics in hand we can now explore the data in a more graphical fashion, beginning with plots that help understand individual variables (barplot, piechart, and line plots), followed by a number of plots that explore relationships between variables (scatterplot and correlation plot).

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010