Data Mining Survivor: Association_Analysis

DATA MINING
Desktop Survival Guide
by Graham Williams

Overview

The Apriori algorithm is the original association rule algorithm. Each transaction is thought of as a basket of items (which we might represent as $\{A,B,C,D,E,F\}$ ). The algorithm searches for collections of items that appear together in multiple baskets (e.g., $\{A,C,F\}$ ). From these so called itemsets it identifies rules like which we read as indicating that and appearing in a transaction typically entails that will also appear in the transaction.

The basis of an association analysis algorithm is the generation of frequent itemsets. However, naïve approaches will be quite expensive in computational time with even moderately sized databases. The Apriori algorithm takes advantage of the simple apriori observation that all subsets of a frequent itemset must also be frequent. That is, if $\{milk, bread, cheese\}$ is a frequent itemset then so must each of the smaller itemsets $\{milk, bread\}$ , $\{milk, cheese\}$ , $\{bread, cheese\}$ , $\{milk\}$ , $\{bread\}$ , and $\{cheese\}$ . This observation allows the algorithm to consider a significantly reduced search space by starting with frequent individual items (eliminating rare items). We can then combine these into itemsets containing just two items and retain only those that are frequent enough. Similarly for itemsets containing three items, and so on.

Suppose we have a rule of the form $\mathcal{A}Rightarrow \mathcal{C}$ . We call $\mathcal{A}$ the antecedent and $\mathcal{C}$ the consequent, and both are non-empty sets of items.

The concept of frequent enough is a parameter of the algorithm, used to control the number of association rules discovered This support specifies how frequently the items must appear in the whole dataset before the items can be considered as a candidate association rule. For example, the user may choose to consider only sets of items that occur in at least 5% of all transactions. Formally we define support for a collection of items $\mathcal{I}$ as the proportion of all baskets in which all items in $\mathcal{I}$ appear. Then we can define the support for an association rule as:

$support(\mathcal{A} Rightarrow \mathcal{C}) = \textit{support}(\mathcal{A}\cup \mathcal{C})$

A second parameter, the confidence, calculates the proportion of transactions containing $\mathcal{A}$ that also contain $\mathcal{C}$ . The confidence specifies a minimal probability for the association rule. For example, the user may choose to only generate rules which are true at least 90% of the time (that is, when $\mathcal{A}$ appears in the basket, $\mathcal{C}$ also appears in the same basket at least 90% of the time). Formally:

$confidence(\mathcal{A}Rightarrow\mathcal{C}) = support(\mathcal{A}Rightarrow \mathcal{C})/support(\mathcal{A})$

Support further development through the purchase of the PDF version of the book.
The PDF version is a formatted comprehensive draft book (with over 800 pages).
Brought to you by Togaware. This page generated: Sunday, 22 August 2010