Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Basket Analysis

The simplest association analysis is often referred to as market basket analysis. Within Rattle this is enabled when the Baskets button is checked. In this case, the data is thought of as representing shopping baskets (or any other type of collection of items, such as a basket of medical tests, a basket of medicines prescribed to a patient, a basket of stocks held by an investor, and so on). Each basket has a unique identifier, and the variable specified as an Ident variable in the Data tab is taken as the identifier of a shopping basket. The contents of the basket are then the items contained in the column of data identified as the target variable. For market basket analysis, these are the only two variables used.

To illustrate market basket analysis with Rattle, we will use a very simple dataset consisting of the DVD movies purchased by customers. Suppose the data is stored in the file dvdtrans.csv and consists of the following:

ID,Item
1,Sixth Sense
1,LOTR1
1,Harry Potter1
1,Green Mile
1,LOTR2
2,Gladiator
2,Patriot
2,Braveheart
3,LOTR1
3,LOTR2
4,Gladiator
4,Patriot
4,Sixth Sense
5,Gladiator
5,Patriot
5,Sixth Sense
6,Gladiator
6,Patriot
6,Sixth Sense
7,Harry Potter1
7,Harry Potter2
8,Gladiator
8,Patriot
9,Gladiator
9,Patriot
9,Sixth Sense
10,Sixth Sense
10,LOTR
10,Galdiator
10,Green Mile

We load this data into Rattle and choose the appropriate variable roles. In this case it is quite simple:

Image rattle-dvd-variables
On the Associate tab (of the Unsupervised paradigm) ensure the Baskets check box is checked. Click the Execute button to identify the associations:
Image rattle-dvd-associate-top
Here we see a summary of the associations found. There were 38 association rules that met the criteria of having a minimum support of 0.1 and a minimum confidence of 0.1. Of these, 9 were of length 1 (i.e., a single item that has occurred frequently enough in the data), 20 were of length 2 and another 9 of length 3. Across the rules the support ranges from 0.11 up to 0.56. Confidence ranges from 0.11 up to 1.0, and lift from 0.9 up to 9.0.

The lower part of the same textview contains information about the running of the algorithm:

Image rattle-dvd-associate-bot
We can see the variable settings used, noting that Rattle only provides access to a smaller set of settings (support and confidence). The output includes timing information fore the various phases of the algorithm. For such a small dataset, the times are of course essentially 0!

Copyright © Togaware Pty Ltd
Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010