|   | DATA MINING Desktop Survival Guide by Graham Williams |   | |||
| Overview | 
The Apriori algorithm is the original association rule algorithm. Each
transaction is thought of as a basket of items (which we might
represent as 
 ). The algorithm searches for
collections of items that appear together in multiple baskets (e.g.,
). The algorithm searches for
collections of items that appear together in multiple baskets (e.g.,
 ). From these so called itemsets it identifies
rules like
). From these so called itemsets it identifies
rules like 
 which we read as indicating that
 which we read as indicating that  and
 and
 appearing in a transaction typically entails that
 appearing in a transaction typically entails that  will also
appear in the transaction.
 will also
appear in the transaction.
The basis of an association analysis algorithm is the generation of
frequent itemsets. However, naïve approaches will be quite expensive
in computational time with even moderately sized databases. The
Apriori algorithm takes advantage of the simple apriori
observation that all subsets of a frequent itemset must also be
frequent. That is, if 
 is a frequent itemset
then so must each of the smaller itemsets
 is a frequent itemset
then so must each of the smaller itemsets 
 ,
, 
 ,
, 
 ,
,  ,
,  , and
, and
 .  This observation allows the algorithm to consider a
significantly reduced search space by starting with frequent
individual items (eliminating rare items). We can then combine these
into itemsets containing just two items and retain only those that are
frequent enough. Similarly for itemsets containing three items, and so
on.
.  This observation allows the algorithm to consider a
significantly reduced search space by starting with frequent
individual items (eliminating rare items). We can then combine these
into itemsets containing just two items and retain only those that are
frequent enough. Similarly for itemsets containing three items, and so
on.
Suppose we have a rule of the form 
 . We
call
. We
call  the antecedent and
 the antecedent and  the
consequent, and both are non-empty sets of items.
 the
consequent, and both are non-empty sets of items. 
The concept of frequent enough is a parameter of the
algorithm, used to control the number of association rules discovered
This support specifies how frequently the
items must appear in the whole dataset before the items can be
considered as a candidate association rule.  For example, the user may
choose to consider only sets of items that occur in at least 5% of
all transactions.  Formally we define support for a
collection of items  as the proportion of all baskets in
which all items in
 as the proportion of all baskets in
which all items in  appear. Then we can define the support
for an association rule as:
 appear. Then we can define the support
for an association rule as:
 
A second parameter, the confidence,
calculates the proportion of transactions containing  that also contain
that also contain  . The confidence specifies a minimal
probability for the association rule.  For example, the user may
choose to only generate rules which are true at least 90% of the time
(that is, when
. The confidence specifies a minimal
probability for the association rule.  For example, the user may
choose to only generate rules which are true at least 90% of the time
(that is, when  appears in the basket,
 appears in the basket,  also
appears in the same basket at least 90% of the time). Formally:
 also
appears in the same basket at least 90% of the time). Formally:
 
Copyright © Togaware Pty Ltd Support further development through the purchase of the PDF version of the book.