Data Mining for Association Rules and Sequent
(Hardcover)
by Jean-Marc Adamo
The book provides a unified presentation of algorithms for
association rule and sequential pattern discovery. For both mining problems, the
presentation relies on the lattice structure of the search space. All algorithms
are built as processes running on this structure. Proving their properties takes
advantage of the mathematical properties of the structure. Mining for
association rules and sequential patterns is known to be a problem with large
computational complexity. The issue of designing efficient parallel algorithms
should be considered as critical. Most algorithms in the book are devised for
both sequential and parallel execution. Parallel algorithm design takes
advantage of the lattice structure of the search space. Partitioning is
performed via lattice recursive bisection. Database partitioning is also used as
an additional source of parallelism.
Part of the motivation for writing this book was postgraduate teaching. Since
the book only assumes elementary mathematical knowledge in the domains of
lattices, combinatorial optimization, probability calculus, and statistics, it
is fit for use by undergraduate students as well. The algorithms are described
in a C-like pseudo programming language. The computations are shown in great
detail. This makes the book also fit for use by implementers: computer
scientists in many domains as well as industry engineers.
Multi-Threaded Object-Oriented MPI-Based Message
Passing Interface: The ARCH Library (The Springer International Series in
Engineering and Computer Science)
Multi-Threaded Object-Oriented
MPI-Based Message Passing Interface: The ARCH Library presents ARCH, a library
built as an extension to MPI. ARCH relies on a small set of programming
abstractions that allow the writing of well-structured multi-threaded parallel
codes according to the object-oriented programming style. ARCH has been written
with C++. The book describes the built-in classes, and illustrates their use
through several template application cases in several fields of interest:
Distributed Algorithms (global completion detection, distributed process
serialization), Parallel Combinatorial Optimization (A* procedure), Parallel
Image-Processing (segmentation by region growing). It shows how new
application-level distributed data types - such as a distributed tree and a
distributed graph - can be derived from the built-in classes. A feature of
interest to readers is that both the library and the application codes used for
illustration purposes are available via the Internet. The material can be
downloaded for installation and personal parallel code development on the
reader's computer system. ARCH can be run on Unix/Linux as well as Windows
NT-based platforms. Current installations include the IBM-SP2, the CRAY-T3E, the
Intel Paragon, PC-networks under Linux or Windows NT. Multi-Threaded
Object-Oriented MPI-Based Message Passing Interface: The ARCH Library is aimed
at scientists who need to implement parallel/distributed algorithms requiring
complicated local and/or distributed control structures. It can also benefit
parallel/distributed program developers who wish to write codes in the
object-oriented style. The author has been using ARCH for several years as a
medium to teach parallel and network programming. Teachers can employ the
library for the same purpose while students can use it for training. Although
ARCH has been used so far in an academic environment, it will be an effective
tool for professionals as well. Multi-Threaded Object-Oriented MPI-Based Message
Passing Interface: The ARCH Library is suitable as a secondary text for a
graduate level course on Data Communications and Networks, Programming
Languages, Algorithms and Computational Theory and Distributed Computing and as
a reference for researchers and practitioners in industry.
Algorithms for Frequent Itemset
Mining and Database Sanitization [Paperback]
Yu-Chiang Li (Author)
Data mining techniques have been
widely applied in numerous areas and represent an important field of research.
In Chapter 1, the research motivation, objectives and contributions are
introduced. Chapter 2 introduces background work on data mining, share mining,
utility mining, and privacy-preserving data mining. Chapter 3 describes the
proposed NFP-growth method for discovering frequent itemsets. Chapters 4 through
6 explain several novel fast algorithms for share mining --- including FSM, EFSM,
SuFSM, ShFSM, and DCG --- to efficiently generate all share- frequent itemsets.
Furthermore, Chapter 7 presents the Isolated Items Discarding Strategy (IIDS),
which can be applied to any existing level-wise share mining or utility mining
method to reduce candidates and to improve its performance. Next, Chapter 8
introduces the proposed Maximum Item Conflict First (MICF) algorithm, which has
a low sanitization rate and achieves a low misses cost, for hiding all
restrictive itemsets. At the end of Chapters 3 through 8, the experimental
results and evaluates the performance of the proposed algorithms are provided.
Finally, Chapter 9 draws a summary of the dissertation.
Constraint-Based
Mining and Inductive Databases: European Workshop on Inductive Databases and
Constraint Based Mining, Hinterzarten, Germany, March 11-13, ... / Lecture Notes
in Artificial Intelligence)
The interconnected ideas of inductive databases
and constraint-based mining are appealing and have the potential to radically
change the theory and practice of data mining and knowledge discovery. This book
reports on the results of the European IST project "cInQ" (consortium on
knowledge discovery by Inductive Queries) and its final workshop entitled
Constraint-Based Mining and Inductive Databases organized in Hinterzarten,
Germany in March 2004.
The 18
articles in this state-of-the-art survey present the latest results in inductive
querying and constraint-based data mining and also identify future directions of
this newly emerging field lying at the intersection of data mining and database
research. The papers address topical sections on foundations of inductive
database frameworks, optimizing inductive queries on local patterns, optimizing
inductive queries on global patterns, and applications of inductive querying
techniques.
Algorithms for Mining Frequent Itemsets
AIS
Authors: Rakesh Agrawal, Tomasz Imielinski, Arun Swami
When: 1993
Paper: Mining association rules between sets of items in large databases
Where: Preceedings of the ACM SIGMOD Conference on Management of Data,
Washington, D.C.
There was a real buzz in early 90s about how to emulate the biological immune
system in the real world scenarios. The capacity of the immune system to
proliferate cells that produce antibodies whenever it detects a high degree of
matching with an antigen is, without doubts, fascinating. A series of algortihms
were invented and new systems called artificial immune systems were designed.
AIS algorithm uses candidate generation to detect the frequent itemsets. The
candidates are generated on the fly and are compared with previously found
frequent itemsets. The disadvantage of the algorithm is that it generates and
counts too many candidate itemsets that turn out to be small. AIS was the first
algorithm that introduced the problem of generating association rules.
SETM
Authors: Maurice Houtsma, Arun Swami
When: 1995
Paper: Set-oriented mining for association rules in relational databases
Where: The 11th International Conference on Data Engineering
This algorithm was actually created by Houtsma and Swami in October 1993 and
included in a research report while they were working in the IBM Almaden
Research Center but for some reason it was officially released only in 1995. The
algorithm also generates candidates on the fly based on the transaction read
from the database, just like AIS algorithm. But SETM was more created for SQL
computing and uses relational operations. To use standard SQL join operation for
candidate generation, SETM separates candidate generation from counting. It
first generates the candidates using equi-joins and then it sorts them all and
removes the ones that don't meet the minimum support. Like AIS, SETM generates
many candidate itemsets that in the end turn out not be frequent.
APRIORI
Authors: Rakesh Agrawal and Ramakrishnan Srikant
When: 1994
Paper: Fast algorithms for mining association rules
Where: Proceedings of 20th International Conference on Very Large Data Bases,
September 12-15, 1994, Santiago de Chile, Chile
It is by far the most important data mining algorithms for mining frequent
itemsets and associations. It opened new doors and created new modalities to
mine the data. Since its inception, many scholars have improved and optimized
the Apriori algorithm and have presented new Apriori-like algorithms. The
authors became living legends in the data mining communities. They both received
masters and PhDs from University of Wisconsin, Madison and both worked for IBM.
The IBM's Intelligent Miner was created mainly by them. Once coleagues, they now
work for competing companies - Agrawal for Microsoft and Srikant for Google.
Apriori uses a breadth-first search strategy to count the support of itemsets
and uses a candidate generation function which exploits the downward closure
property of support.
DHP - Direct Hashing and Pruning
Authors: J. S. Park, M. Chen, P.S. Yu.
Paper: An effective hash based algorithm for mining association rules.
Where: ACM SIGMOD International Conference on Management of Data
When: May 1995
As its name suggests, DHP uses a hash technique that makes it very efficient for
the generation of candidate itemsets, in particular for the large two-itemsets,
thus greatly improving the performance bottleneck of the whole process. In
addition, DHP employs effective pruning techniques to progressively reduce the
transaction database size. It is however a variation of the Apriori algorithm
like so many other.
Partition
Authors: Ashok Savasere, Edward Omiecinski, Shamkant Navathe
Paper: An efficient algorithm for mining association rules in large databases
Where: The 21st VLDB Conference
When: 1995
The algorithm is fundamentally different from all the previous algorithms in
that it reads the database at most two times to generate all significant
association rules. In the first scan of the database, it generates a set of all
potentially large itemsets by scanning the database once and dividing it in a
number of non-overlaping partitions. This set is a superset of all frequent
itemsets so it may contain itemsets that are not frequent. During the second
scan, counters for each of these itemsets are set up and their actual support is
measured.
ECLAT - Echivalence Class Clustering and Bottom-up Lattice Traversal
Authors: Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei
Li
When: 1997
Paper: New Algorithms for Fast Discovery of Assoctiation Rules
Where: The 3rd International Conference on Knowledge Discovery and Data Mining
It is the first algorithm that uses a vertical data (inverted) layout. ECLAT is
very efficient for large itemsets but less efficient for small ones. The
frequent itemsets are determined using simple tid-list intersections in a
depth-first graph. Few other algorithm that use the same techique are presented
in the same paper (MAXECLAT, CLIQUE, MAXCLIQUE, TOP-DOWN) but ECLAT remained the
best known.
FP-GROWTH
Authors: Jiawei Han, Jian Pei, Yiwen Yin
When: 2000
Paper: Mining Frequent Patterns without Candidate Generation
Where: Proceedings of the 2000 ACM SIGMOD International Conference on Management
of Data, May 16-18, 2000, Dallas, Texas, USA
The Apriori algorithm has its own disadvantages that made other scholars think
of new aproaches to mine frequent patterns. The 2 main downsides are the
possible need of generating a huge number of candidates if the number of
frequent 1-itemsets is high or if the size of the frequent pattern is big the
database has to be scanned repeatedly to match the candidates and determine the
support What if we find a way to mine the frequent patterns without candidate
generation? This would be a big improvement over Apriori. This is what the
frequent-pattern growth (FP-growth) algorithm does. It adopts a
divide-and-conquer strategy and a frequent-pattern tree.
TREE-PROJECTION
Authors: Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad
When: 2000
Paper: A tree projection algorithm for generation of frequent itemsets
Where: Journal of Parallel and Distributed Computing
This inovation brought by this algorithm is the use of a lexicograph tree which
requires substantially less memory than a hash tree. The support of the frequent
itemsets is counted by projecting the transactions onto the nodes of this tree.
This improves the performance of counting the number of transactions that have
frequent itemsets. The lexicograph tree is traversed in a top-down fashion.
PASCAL
Authors: Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, L. Lakhal
When: 2000
Paper: Mining Frequent Patterns with Counting Inference
Where: ACM SIGKDD Explorations Newsletter
The algorithm is named after the French mathematician Blaise Pascal who invented
an early computing device and it is basically an optimization of the Apriori
algorithm. The authors introduce the notion of key patterns and show that other
frequent patterns can be infered from the key patterns without access to the
database. The algorithm finds both frequent and closed sets and it is twice as
fast as Close and 10 times as fast as Apriori but is only practical when the
pattern length is short.
H-MINE
Authors: Jian Pei, Jiawei Han Lu, Shojiro Nishio, Shiwei Tang, Dongqing Yang
When: 2001
Paper: H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases
Where: IEEE International Conference on Data Mining
The algorithm introduces the concept of hyperlinked data structure (H-struct)
and uses it to dynamically adjust links in the mining process. The authors
actually proposed 2 algorithms. The first one, H-mine(Mem), is a memory based,
efficient pattern-growth algorithm. This algorithm can be scaled up to very
large databases by database partitioning and this is what actual H-mine
algorithm deals with.
RELIM (Recursive Elimination)
Author: Christian Borgelt
When: 2005
Paper: Keeping Things Simple: Finding Frequent Item Sets by Recursive
Elimination
Where: Workshop Open Source Data Mining Software (OSDM'05, Chicago, IL)
As the author mentioned in the paper's abstract, the algorithm is strongly
inspired by FP-growth (it's much simpler though) and very similar to H-mine. It
doesn't use prefix trees and any other complicated structures. The work is done
in one simple recursive function, which can be written with relatively few lines
of code.
Algorithms for Mining Maximal Itemsets
MAX-MINER
Authors: R.J. Bayardo
When: 1998
Paper: Efficiently Mining Long Patterns from Databases
Where: Proceedings of the 1998 ACM SIGMOD International Conference on Management
of Data, Seattle, Washington, United States The algorithm extracts only the
maximal frequent itemsets (you know by now that an itemset is maximal frequent
if it has no superset that is frequent). By extracting the maximal frequent
itemsets and because any frequent itemset is a subset of a maximal frequent
itemset, Max-Miner generates all the frequent itemsets. The algorithm combines a
levelwise bottom-up traversal with a top-down traversal in order ot quickly find
the maximal frequent patterns. Then, all frequent patterns are derived from
these ones and one last database scan is carried on to count their support.
DepthProject
Authors: Ramesh Agrawal, Charu Aggarwal, and V.V.V. Prasad
When: August 2000
Paper: Depth first generation of long patterns
Where: The 7th International Conference on Knowledge Discovery and Data Mining
The algorithm finds frequent itemsets by using depth first search on a
lexicographic tree of itemsets. The authors claimed and proved that this
algorithm is faster than MaxMiner.
MAFIA
Authors: Doug Burdick, Manuel Calimlim, Johannes Gehrke
When: 2001
Paper: MAFIA: A maximal frequent itemset algorithm for transactional databases
Where: Proceedings of the 17th International Conference on Data Engineering,
Heidelberg, Germany
MAFIA is an algorithm for mining maximal frequent itemsets from a transactional
database (it has however the option to mine the closed sets as well). It is
especially efficient when the itemsets in the database are very long. The search
strategy integrates a depth-first traversal of the itemset lattice.
GenMax
Authors: Karam Gouda and Mohammed J. Zaki
When: November 2001
Paper: Efficiently mining maximal frequent itemsets
Where: The 1st IEEE International Conference on Data Mining
GenMax is a backtrack search based algorithm for mining maximal frequent
itemsets. It uses progressive focusing to perform maximality checking, and
diffset propagation to perform fast frequency computation.
Algorithms for Mining Closed Frequent Itemsets
CLOSE
Authors: Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
When: 1999
Paper: Efficient mining of association rules using closed itemset lattices
Where: Information System, Vol 24
Close is another Apriori-like algorithm that directly mines frequent closed
itemsets. The first of this algorithm is to use bottom-up search to identify the
smallest frequent itemset that determines a closed itemset. After finding the
frequent k-itemsets, Close compares the support of each set with its subsets at
the previous level. If the support of an itemset matches the support of any of
its subsets, the itemset is pruned. The second step in Close is to compute the
closure of all the itemsets found in the first step. The authors also created a
variation of this algorithm, called A-CLOSE, that also generates a reduced set
of association rules without having to determine all frequent itemsets, thus
lowering the algorithm computation costs.
CLOSET
Authors: Jian Pei, Jiawei Han, Runying Mao
When: 2000
Paper: CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Where: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery
It is an algorithm for minng closed itemsets that uses an FP-tree strucutre for
mining closed itemsets without candidate generation. It also uses a recursive
divide-and-conquer strategy and a database projection approach to mine long
patterns.
CHARM
Authors: Mohammed Javeed Zaki, Ching-Jiu Hsiao
When: 2002
Paper: CHARM: An Efficient Algorithm for Closed Itemset Mining
Where: SDM - SIAM International Conference on Data Mining, Chicago
It is another algorithm for mining all frequent closed itemsets. There are some
innovative ideas developed by this algorithm that are of a certain value. CHARM
simultaneously explores both the itemset space and transaction space, rather
than only the itermset search space. It also uses a highly efficient hybrid
search method that skips many levels of the IT-tree ( itemset-tidset tree) to
quickly identify the frequent closed itemsets, instead of having to enumerate
many possible subsets.
Reference
Mining association rules between sets of items in large databases - Rakesh
Agrawal, Tomasz Imielinski, Arun Swami
Set-oriented mining for association rules in relational databases - Maurice
Houtsma, Arun Swami
Fast algorithms for mining association rules - Rakesh Agrawal, Ramakrishnan
Srikant
An effective hash based algorithm for mining association rules - J. S. Park, M.
Chen, P.S. Yu.
An efficient algorithm for mining association rules in large databases - Ashok
Savasere, Edward Omiecinski, Shamkant Navathe
New Algorithms for Fast Discovery of Assoctiation Rules - Mohammed Javeed Zaki,
Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li
Mining Frequent Patterns without Candidate Generation - Jiawei Han, Jian Pei,
Yiwen Yin
H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases - Jian
Pei, Jiawei Han Lu, Shojiro Nishio, Shiwei Tang, Dongqing Yang
Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination -
Christian Borgelt
A tree projection algorithm for generation of frequent itemsets - Ramesh C.
Agarwal, Charu C. Aggarwal, V.V.V. Prasad
Mining Frequent Patterns with Counting Inference - Y. Bastide, R. Taouil, N.
Pasquier, G. Stumme, L. Lakhal
MAFIA: A maximal frequent itemset algorithm for transactional databases - Doug
Burdick, Manuel Calimlim, Johannes Gehrke
Efficiently Mining Long Patterns from Databases - R.J. Bayardo
Depth first generation of long patterns - Ramesh Agrawal, Charu Aggarwal, and
V.V.V. Prasad
Efficiently mining maximal frequent itemsets - Karam Gouda, Mohammed J. Zaki
Efficient mining of association rules using closed itemset lattices - Nicolas
Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets - Jian Pei,
Jiawei Han, Runying Mao
CHARM: An Efficient Algorithm for Closed Itemset Mining - Mohammed Javeed Zaki,
Ching-Jiu Hsiao
Association Rules main function is to Associate
Past Events from the database and predict the future events association with
high probability. Rules can be extended any dimension using various Algorithms.
Interesting applications can be in the Space Flights, Human Response to various
Medicines, Marketing Success/Failure Analysis, Stress Failure Analysis in Civil
Structure Design, Deformation Studies of Bridges, Power Grid faults,
Aerodynamics Effects on Flights, Radio Coverage Patterns of Cellular Networks,
Call Failure Analysis in Telecommunications, Stocks Association in Price
Movement and many other fields. We could effectively bring association of stocks
where price movement was associated with one another, if one stock started
falling others in the Group will start falling most of the time with same %
Deviation.
Association Rules when used in conjunction with
Neural Networks and Time Series created Predictions with Uncanny accuracy.
Many times we were surprised with results that predicted values were
accurate upto .01% Accuracy. But this is not routine and most of the
predictions were within 1% of the Future values and worst case scenario was
5% accuracy.
The aim of
association rule mining is to find interesting and useful patterns in a
transaction database. The database contains transactions which consist of a
set of items and a transaction identifier like market basket. Support is
defined on itemsets and gives the proportion of transactions which contain
Z. It is used as a measure of significance (importance) of an
itemset. Since it basically uses the count of transactions it is often
called a frequency constraint. An itemset with a support > a set min.
support threshold is called a frequent or large itemset. Confidence
is defined as the probability of seeing the rule's consequent under the
condition that the transactions also contain the antecedent. Confidence is
directed and gives different values for the rules X -> Y and Y ->
X.
Support is first
used to find frequent (significant) itemsets exploiting its down-ward
closure property to prune the search space. Then confidence is used in a
second step to produce rules from the frequent itemsets that exceed a min.
confidence threshold.
A problem with confidence is that it is sensitive to the frequency of the
consequent (Y) in the database. Caused by the way confidence is
calculated, consequents with higher support will automatically produce
higher confidence values even if there exists no association between the
items.
Association Rules
The
input for a typical associations-mining algorithm is a set T of itemsets t, each
of which is drawn from a set I of all possible items. Each t is a member of the
power set 2I, but T is not considered a subset of 2I since it may contain
duplicates (it is a multiset). Since I is typically large, the general problem
of finding all common subsets in an arbitrary selection of itemsets is
considered intractable. Therefore input sets in T, and any results derived there
from, are typically assumed to be small. It is an ongoing area of research to
find algorithms which relax this assumption and allow processing of larger sets.
Fixed-Confidence Mining
Commonly, association-finding algorithms attempt to find all sets of elements
which occur in at least a fraction C of the data, where C is a chosen Confidence
Threshold (e.g. 2%). The number of occurrences of a subset is called its
support. Sets whose support exceeds C are called frequent itemsets.
If a set s is frequent, then any subset of s must also be, and most
association-finding algorithms attempt to exploit this fact. Most
association-finding algorithms reduce to a traversal of this subset lattice of I
in some order, extending frequent itemsets and pruning out infrequent sets and
their supersets.
The fixed confidence threshold has little basis in statistics, since some sets
may exceed it simply by random coincidence (thereby defeating the goal of
finding meaningful correlations), and some meaningful associations may occur in
the data without reaching the threshold. However, in practice it does eliminate
vast numbers of insignificant sets.
For a given data set, the set of its frequent itemsets can be described by its
maximal frequent itemsets, which are frequent itemsets S that are not subsets of
any larger frequent itemset T. During mining, finding maximal frequent itemsets
first allows their subsets to be skipped, an important improvement if sets are
large.
K-optimal pattern discovery is a data mining technique
that provides an alternative to the frequent pattern discovery approach that
underlies most association rule learning techniques.
Frequent pattern discovery techniques find all patterns for which there are
sufficient examples in the sample data. In contrast, k-optimal pattern discovery
techniques find the k patterns that optimize a user-specified measure of
interest. The value k is also specified by the user.
Examples of k-optimal pattern discovery techniques include:
k-optimal classification rule discovery
k-optimal subgroup discovery
finding k most interesting patterns using sequential sampling
mining top.k frequent closed patterns without minimum support
k-optimal rule discovery
Apriori
Apriori is a classic algorithm for learning
association rules. Apriori is designed to operate on databases
containing transactions (for example, collections of items
bought by customers, or details of a website frequentation).
Other algorithms are designed for finding association rules in
data having no transactions (Winepi and Minepi), or having no
timestamps (DNA sequencing).
As is common in association rule mining, given a set of itemsets
(for instance, sets of retail transactions, each listing
individual items purchased), the algorithm attempts to find
subsets which are common to at least a minimum number C (the
cutoff, or confidence threshold) of the itemsets. Apriori uses a
"bottom up" approach, where frequent subsets are extended one
item at a time (a step known as candidate generation), and
groups of candidates are tested against the data. The algorithm
terminates when no further successful extensions are found.
Apriori uses breadth-first search and a hash tree structure to
count candidate item sets efficiently. It generates candidate
item sets of length k from item sets of length k − 1. Then it
prunes the candidates which have an infrequent sub pattern.
According to the downward closure lemma, the candidate set
contains all frequent k-length item sets. After that, it scans
the transaction database to determine frequent item sets among
the candidates. For determining frequent items quickly, the
algorithm uses a hash tree to store candidate itemsets. This
hash tree has item sets at the leaves and hash tables at
internal nodes . Note that this is not the same kind of hash
tree used in for instance p2p systems
Apriori, while historically significant, suffers from a number
of inefficiencies or trade-offs, which have spawned other
algorithms. Candidate generation generates large numbers of
subsets (the algorithm attempts to load up the candidate set
with as many as possible before each scan). Bottom-up subset
exploration (essentially a breadth-first traversal of the subset
lattice) finds any maximal subset S only after all 2 | S | − 1
of its proper subsets.
Apriori
Algorithm
Association rule mining is to
find out association rules that satisfy the
predefined minimum support and confidence from a
given database. The problem is usually
decomposed into two subproblems. One is to find
those itemsets whose occurrences exceed a
predefined threshold in the database; those
itemsets are called frequent or large itemsets.
The second problem is to generate association
rules from those large itemsets with the
constraints of minimal confidence. Suppose one
of the large itemsets is Lk, Lk = {I1, I2, ,
Ik}, association rules with this itemsets are
generated in the following way: the first rule
is {I1, I2, , Ik-1}⇒ {Ik}, by checking the
confidence this rule can be determined as
interesting or not. Then other rule are
generated by deleting the last items in the
antecedent and inserting it to the consequent,
further the confidences of the new rules are
checked to determine the interestingness of
them. Those processes iterated until the
antecedent becomes empty. Since the second
subproblem is quite straight forward, most of
the researches focus on the first subproblem.
The Apriori algorithm finds the frequent sets
L In
Database D.
Find frequent set
Lk
− 1.
Join Step.
Ck
is generated by joining
Lk
− 1with itself
Prune Step.
Any
(k − 1)
-itemset that is not frequent cannot be
a subset of a frequent
k
-itemset, hence should be removed.
where
(Ck:
Candidate itemset of size
k)
(Lk:
frequent itemset of size
k)
Apriori Pseudocode
Apriori
large 1-itemsets
that appear in more than
transactions}
while
Generate(Lk
− 1)
for
transactions
Subset(Ck,t)
for
candidates
return
Applications
Market Basket Analysis
Consider shopping cart filled with several
items
Market basket analysis tries to answer the
following questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase items?
Given a database of customer transactions,
each transaction is a set of items deduce
association rules.
Co-ocurrences 80% of all customers purchase items a, b, and c together.
Association Rules 60% of all customers who purchase X and Y also buy Z.
Sequential Patterns 60% of customers who first buy X also purchase Y
within two weeks.
Confidence and Support
We prune the set of all possible association rules using two measures of
interest:
Confidence of a rule: X ->Y has confidence c if P(Y|X)=c.
Support of a rule: X->Y has support s if P(XY)=s. Also, support of an item
Other Applications
Direct Marketing
Fraud Detection for Medical Insurance
Floor/Shelf Planning
Web Site Layout
Cross-selling
Frequent Itemsets applications:
Classification
Seeds for constructing Bayesian networks
Web log analysis
Collaborative filteringset XY