Association Rules main function is to Associate
Past Events from the database and predict the future events association with
high probability. Rules can be extended any dimension using various Algorithms.
Interesting applications can be in the Space Flights, Human Response to various
Medicines, Marketing Success/Failure Analysis, Stress Failure Analysis in Civil
Structure Design, Deformation Studies of Bridges, Power Grid faults,
Aerodynamics Effects on Flights, Radio Coverage Patterns of Cellular Networks,
Call Failure Analysis in Telecommunications, Stocks Association in Price
Movement and many other fields. We could effectively bring association of stocks
where price movement was associated with one another, if one stock started
falling others in the Group will start falling most of the time with same %
Deviation.
Association Rules when used in conjunction with
Neural Networks and Time Series created Predictions with Uncanny accuracy.
Many times we were surprised with results that predicted values were
accurate upto .01% Accuracy. But this is not routine and most of the
predictions were within 1% of the Future values and worst case scenario was
5% accuracy.
The aim of
association rule mining is to find interesting and useful patterns in a
transaction database. The database contains transactions which consist of a
set of items and a transaction identifier like market basket. Support is
defined on itemsets and gives the proportion of transactions which contain
Z. It is used as a measure of significance (importance) of an
itemset. Since it basically uses the count of transactions it is often
called a frequency constraint. An itemset with a support > a set min.
support threshold is called a frequent or large itemset. Confidence
is defined as the probability of seeing the rule's consequent under the
condition that the transactions also contain the antecedent. Confidence is
directed and gives different values for the rules X -> Y and Y ->
X.
Support is first
used to find frequent (significant) itemsets exploiting its down-ward
closure property to prune the search space. Then confidence is used in a
second step to produce rules from the frequent itemsets that exceed a min.
confidence threshold.
A problem with confidence is that it is sensitive to the frequency of the
consequent (Y) in the database. Caused by the way confidence is
calculated, consequents with higher support will automatically produce
higher confidence values even if there exists no association between the
items.
Association Rules
The
input for a typical associations-mining algorithm is a set T of itemsets t, each
of which is drawn from a set I of all possible items. Each t is a member of the
power set 2I, but T is not considered a subset of 2I since it may contain
duplicates (it is a multiset). Since I is typically large, the general problem
of finding all common subsets in an arbitrary selection of itemsets is
considered intractable. Therefore input sets in T, and any results derived there
from, are typically assumed to be small. It is an ongoing area of research to
find algorithms which relax this assumption and allow processing of larger sets.
Fixed-Confidence Mining
Commonly, association-finding algorithms attempt to find all sets of elements
which occur in at least a fraction C of the data, where C is a chosen Confidence
Threshold (e.g. 2%). The number of occurrences of a subset is called its
support. Sets whose support exceeds C are called frequent itemsets.
If a set s is frequent, then any subset of s must also be, and most
association-finding algorithms attempt to exploit this fact. Most
association-finding algorithms reduce to a traversal of this subset lattice of I
in some order, extending frequent itemsets and pruning out infrequent sets and
their supersets.
The fixed confidence threshold has little basis in statistics, since some sets
may exceed it simply by random coincidence (thereby defeating the goal of
finding meaningful correlations), and some meaningful associations may occur in
the data without reaching the threshold. However, in practice it does eliminate
vast numbers of insignificant sets.
For a given data set, the set of its frequent itemsets can be described by its
maximal frequent itemsets, which are frequent itemsets S that are not subsets of
any larger frequent itemset T. During mining, finding maximal frequent itemsets
first allows their subsets to be skipped, an important improvement if sets are
large.
K-optimal pattern discovery is a data mining technique
that provides an alternative to the frequent pattern discovery approach that
underlies most association rule learning techniques.
Frequent pattern discovery techniques find all patterns for which there are
sufficient examples in the sample data. In contrast, k-optimal pattern discovery
techniques find the k patterns that optimize a user-specified measure of
interest. The value k is also specified by the user.
Examples of k-optimal pattern discovery techniques include:
k-optimal classification rule discovery
k-optimal subgroup discovery
finding k most interesting patterns using sequential sampling
mining top.k frequent closed patterns without minimum support
k-optimal rule discovery
Apriori
Apriori is a classic algorithm for learning
association rules. Apriori is designed to operate on databases
containing transactions (for example, collections of items
bought by customers, or details of a website frequentation).
Other algorithms are designed for finding association rules in
data having no transactions (Winepi and Minepi), or having no
timestamps (DNA sequencing).
As is common in association rule mining, given a set of itemsets
(for instance, sets of retail transactions, each listing
individual items purchased), the algorithm attempts to find
subsets which are common to at least a minimum number C (the
cutoff, or confidence threshold) of the itemsets. Apriori uses a
"bottom up" approach, where frequent subsets are extended one
item at a time (a step known as candidate generation), and
groups of candidates are tested against the data. The algorithm
terminates when no further successful extensions are found.
Apriori uses breadth-first search and a hash tree structure to
count candidate item sets efficiently. It generates candidate
item sets of length k from item sets of length k − 1. Then it
prunes the candidates which have an infrequent sub pattern.
According to the downward closure lemma, the candidate set
contains all frequent k-length item sets. After that, it scans
the transaction database to determine frequent item sets among
the candidates. For determining frequent items quickly, the
algorithm uses a hash tree to store candidate itemsets. This
hash tree has item sets at the leaves and hash tables at
internal nodes . Note that this is not the same kind of hash
tree used in for instance p2p systems
Apriori, while historically significant, suffers from a number
of inefficiencies or trade-offs, which have spawned other
algorithms. Candidate generation generates large numbers of
subsets (the algorithm attempts to load up the candidate set
with as many as possible before each scan). Bottom-up subset
exploration (essentially a breadth-first traversal of the subset
lattice) finds any maximal subset S only after all 2 | S | − 1
of its proper subsets.
Apriori
Algorithm
Association rule mining is to
find out association rules that satisfy the
predefined minimum support and confidence from a
given database. The problem is usually
decomposed into two subproblems. One is to find
those itemsets whose occurrences exceed a
predefined threshold in the database; those
itemsets are called frequent or large itemsets.
The second problem is to generate association
rules from those large itemsets with the
constraints of minimal confidence. Suppose one
of the large itemsets is Lk, Lk = {I1, I2, … ,
Ik}, association rules with this itemsets are
generated in the following way: the first rule
is {I1, I2, … , Ik-1}⇒ {Ik}, by checking the
confidence this rule can be determined as
interesting or not. Then other rule are
generated by deleting the last items in the
antecedent and inserting it to the consequent,
further the confidences of the new rules are
checked to determine the interestingness of
them. Those processes iterated until the
antecedent becomes empty. Since the second
subproblem is quite straight forward, most of
the researches focus on the first subproblem.
The Apriori algorithm finds the frequent sets
L In
Database D.
Find frequent set
Lk
− 1.
Join Step.
Ck
is generated by joining
Lk
− 1with itself
Prune Step.
Any
(k − 1)
-itemset that is not frequent cannot be
a subset of a frequent
k
-itemset, hence should be removed.
where
(Ck:
Candidate itemset of size
k)
(Lk:
frequent itemset of size
k)
Apriori Pseudocode
Apriori
large 1-itemsets
that appear in more than
transactions}
while
Generate(Lk
− 1)
for
transactions
Subset(Ck,t)
for
candidates
return
Applications
Market Basket Analysis
• Consider shopping cart filled with several
items
• Market basket analysis tries to answer the
following questions:
– Who makes purchases?
– What do customers buy together?
– In what order do customers purchase items?
• Given a database of customer transactions,
each transaction is a set of items – deduce
association rules.
•
Co-ocurrences – 80% of all customers purchase items a, b, and c together.
• Association Rules – 60% of all customers who purchase X and Y also buy Z.
• Sequential Patterns – 60% of customers who first buy X also purchase Y
within two weeks.
Confidence and Support
• We prune the set of all possible association rules using two measures of
interest:
– Confidence of a rule: X ->Y has confidence c if P(Y|X)=c.
– Support of a rule: X->Y has support s if P(XY)=s. Also, support of an item
Other Applications
•
Direct Marketing
• Fraud Detection for Medical Insurance
• Floor/Shelf Planning
• Web Site Layout
• Cross-selling
• Frequent Itemsets applications:
– Classification
– Seeds for constructing Bayesian networks
– Web log analysis
– Collaborative filteringset XY