Recomsys.net

 

USA       India     Asia       Europe

Technology    Trade   Travel     Shopping   Entertainment

 

Home | Artificial Intelligence| Business Intelligence| Bio-Technology | Data Warehouse | Genetic Programming | Robotics Design| Multi   Agent Systems| Services | Telecommunications | Electric Power |  Oil and Gas | Pharma | Real Estate | Research | Search EngineSpintronics | Solar Power | Software Products| Management | Project Management |Space |

Advertisers | Air Travel Bookings |  Abrasive Tools| Acoustic Components |Adhesives & Sealants| Adhesive Tapes| Airlines | Air Beds & Mattresses |Air Pumps| Air Separators |Air Wrenches| Alarms | Alcohol | Anchors | Apparel | Bags | Bazzar | Books |Cars| CNBC | CCTV Camera | Cell Phones | Cement | Computers | Contact Lenses Coupons for Bags| Department Store | Electronics | Eye Care| Fashion | Furcoats | Furniture | History Channel | Home Products | Hotels | Hollywood | Jewelry | K-Mart |

K-Mart Coupons & Price drops | Industrial Products| Magazines | Office Furniture | Perfumes |  Pet Med | Property | Recom Amazon | Retail Store| Sears | Sears Coupons & Price Drops | Shopping Mall | Shopping Mall Electronics | Our Software Products| Our Services | Contact Us

Our Expertise areas:

SYSTEMS ENGINEERING  TRAINING  e-LEARNING  TELECOMMUNICATIONS  NETWORKING  INFORMATION TECHNOLOGY  HARDWARE ENGINEERING  PRODUCT DESIGN  CONSULTING

Artificial Intelligence

 Association Rules |Clustering |Decision Trees| Naive Bayes| Regression Analysis| Neural Networks| Time Series |Text Mining |

Genetic Programming

Data Mining for Association Rules and Sequent… (Hardcover)
by Jean-Marc Adamo

The book provides a unified presentation of algorithms for association rule and sequential pattern discovery. For both mining problems, the presentation relies on the lattice structure of the search space. All algorithms are built as processes running on this structure. Proving their properties takes advantage of the mathematical properties of the structure. Mining for association rules and sequential patterns is known to be a problem with large computational complexity. The issue of designing efficient parallel algorithms should be considered as critical. Most algorithms in the book are devised for both sequential and parallel execution. Parallel algorithm design takes advantage of the lattice structure of the search space. Partitioning is performed via lattice recursive bisection. Database partitioning is also used as an additional source of parallelism.
Part of the motivation for writing this book was postgraduate teaching. Since the book only assumes elementary mathematical knowledge in the domains of lattices, combinatorial optimization, probability calculus, and statistics, it is fit for use by undergraduate students as well. The algorithms are described in a C-like pseudo programming language. The computations are shown in great detail. This makes the book also fit for use by implementers: computer scientists in many domains as well as industry engineers.
Multi-Threaded Object-Oriented MPI-Based Message Passing Interface: The ARCH Library (The Springer International Series in Engineering and Computer Science)

Multi-Threaded Object-Oriented MPI-Based Message Passing Interface: The ARCH Library presents ARCH, a library built as an extension to MPI. ARCH relies on a small set of programming abstractions that allow the writing of well-structured multi-threaded parallel codes according to the object-oriented programming style. ARCH has been written with C++. The book describes the built-in classes, and illustrates their use through several template application cases in several fields of interest: Distributed Algorithms (global completion detection, distributed process serialization), Parallel Combinatorial Optimization (A* procedure), Parallel Image-Processing (segmentation by region growing). It shows how new application-level distributed data types - such as a distributed tree and a distributed graph - can be derived from the built-in classes. A feature of interest to readers is that both the library and the application codes used for illustration purposes are available via the Internet. The material can be downloaded for installation and personal parallel code development on the reader's computer system. ARCH can be run on Unix/Linux as well as Windows NT-based platforms. Current installations include the IBM-SP2, the CRAY-T3E, the Intel Paragon, PC-networks under Linux or Windows NT. Multi-Threaded Object-Oriented MPI-Based Message Passing Interface: The ARCH Library is aimed at scientists who need to implement parallel/distributed algorithms requiring complicated local and/or distributed control structures. It can also benefit parallel/distributed program developers who wish to write codes in the object-oriented style. The author has been using ARCH for several years as a medium to teach parallel and network programming. Teachers can employ the library for the same purpose while students can use it for training. Although ARCH has been used so far in an academic environment, it will be an effective tool for professionals as well. Multi-Threaded Object-Oriented MPI-Based Message Passing Interface: The ARCH Library is suitable as a secondary text for a graduate level course on Data Communications and Networks, Programming Languages, Algorithms and Computational Theory and Distributed Computing and as a reference for researchers and practitioners in industry.

  Algorithms for Frequent Itemset Mining and Database Sanitization [Paperback]
Yu-Chiang Li (Author)

Data mining techniques have been widely applied in numerous areas and represent an important field of research. In Chapter 1, the research motivation, objectives and contributions are introduced. Chapter 2 introduces background work on data mining, share mining, utility mining, and privacy-preserving data mining. Chapter 3 describes the proposed NFP-growth method for discovering frequent itemsets. Chapters 4 through 6 explain several novel fast algorithms for share mining --- including FSM, EFSM, SuFSM, ShFSM, and DCG --- to efficiently generate all share- frequent itemsets. Furthermore, Chapter 7 presents the Isolated Items Discarding Strategy (IIDS), which can be applied to any existing level-wise share mining or utility mining method to reduce candidates and to improve its performance. Next, Chapter 8 introduces the proposed Maximum Item Conflict First (MICF) algorithm, which has a low sanitization rate and achieves a low misses cost, for hiding all restrictive itemsets. At the end of Chapters 3 through 8, the experimental results and evaluates the performance of the proposed algorithms are provided. Finally, Chapter 9 draws a summary of the dissertation.

Constraint-Based Mining and Inductive Databases: European Workshop on Inductive Databases and Constraint Based Mining, Hinterzarten, Germany, March 11-13, ... / Lecture Notes in Artificial Intelligence) The interconnected ideas of inductive databases and constraint-based mining are appealing and have the potential to radically change the theory and practice of data mining and knowledge discovery. This book reports on the results of the European IST project "cInQ" (consortium on knowledge discovery by Inductive Queries) and its final workshop entitled Constraint-Based Mining and Inductive Databases organized in Hinterzarten, Germany in March 2004.

The 18 articles in this state-of-the-art survey present the latest results in inductive querying and constraint-based data mining and also identify future directions of this newly emerging field lying at the intersection of data mining and database research. The papers address topical sections on foundations of inductive database frameworks, optimizing inductive queries on local patterns, optimizing inductive queries on global patterns, and applications of inductive querying techniques.

Algorithms for Mining Frequent Itemsets

   AIS
Authors: Rakesh Agrawal, Tomasz Imielinski, Arun Swami
When: 1993
Paper: Mining association rules between sets of items in large databases
Where: Preceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C.
There was a real buzz in early 90s about how to emulate the biological immune system in the real world scenarios. The capacity of the immune system to proliferate cells that produce antibodies whenever it detects a high degree of matching with an antigen is, without doubts, fascinating. A series of algortihms were invented and new systems called artificial immune systems were designed. AIS algorithm uses candidate generation to detect the frequent itemsets. The candidates are generated on the fly and are compared with previously found frequent itemsets. The disadvantage of the algorithm is that it generates and counts too many candidate itemsets that turn out to be small. AIS was the first algorithm that introduced the problem of generating association rules.

   SETM
Authors: Maurice Houtsma, Arun Swami
When: 1995
Paper: Set-oriented mining for association rules in relational databases
Where: The 11th International Conference on Data Engineering
This algorithm was actually created by Houtsma and Swami in October 1993 and included in a research report while they were working in the IBM Almaden Research Center but for some reason it was officially released only in 1995. The algorithm also generates candidates on the fly based on the transaction read from the database, just like AIS algorithm. But SETM was more created for SQL computing and uses relational operations. To use standard SQL join operation for candidate generation, SETM separates candidate generation from counting. It first generates the candidates using equi-joins and then it sorts them all and removes the ones that don't meet the minimum support. Like AIS, SETM generates many candidate itemsets that in the end turn out not be frequent.
 

   APRIORI
Authors: Rakesh Agrawal and Ramakrishnan Srikant
When: 1994
Paper: Fast algorithms for mining association rules
Where: Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile
It is by far the most important data mining algorithms for mining frequent itemsets and associations. It opened new doors and created new modalities to mine the data. Since its inception, many scholars have improved and optimized the Apriori algorithm and have presented new Apriori-like algorithms. The authors became living legends in the data mining communities. They both received masters and PhDs from University of Wisconsin, Madison and both worked for IBM. The IBM's Intelligent Miner was created mainly by them. Once coleagues, they now work for competing companies - Agrawal for Microsoft and Srikant for Google. Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.
 

   DHP - Direct Hashing and Pruning
Authors: J. S. Park, M. Chen, P.S. Yu.
Paper: An effective hash based algorithm for mining association rules.
Where: ACM SIGMOD International Conference on Management of Data
When: May 1995
As its name suggests, DHP uses a hash technique that makes it very efficient for the generation of candidate itemsets, in particular for the large two-itemsets, thus greatly improving the performance bottleneck of the whole process. In addition, DHP employs effective pruning techniques to progressively reduce the transaction database size. It is however a variation of the Apriori algorithm like so many other.
 

   Partition
Authors: Ashok Savasere, Edward Omiecinski, Shamkant Navathe
Paper: An efficient algorithm for mining association rules in large databases
Where: The 21st VLDB Conference
When: 1995
The algorithm is fundamentally different from all the previous algorithms in that it reads the database at most two times to generate all significant association rules. In the first scan of the database, it generates a set of all potentially large itemsets by scanning the database once and dividing it in a number of non-overlaping partitions. This set is a superset of all frequent itemsets so it may contain itemsets that are not frequent. During the second scan, counters for each of these itemsets are set up and their actual support is measured.
 

   ECLAT - Echivalence Class Clustering and Bottom-up Lattice Traversal
Authors: Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li
When: 1997
Paper: New Algorithms for Fast Discovery of Assoctiation Rules
Where: The 3rd International Conference on Knowledge Discovery and Data Mining
It is the first algorithm that uses a vertical data (inverted) layout. ECLAT is very efficient for large itemsets but less efficient for small ones. The frequent itemsets are determined using simple tid-list intersections in a depth-first graph. Few other algorithm that use the same techique are presented in the same paper (MAXECLAT, CLIQUE, MAXCLIQUE, TOP-DOWN) but ECLAT remained the best known.
 

   FP-GROWTH
Authors: Jiawei Han, Jian Pei, Yiwen Yin
When: 2000
Paper: Mining Frequent Patterns without Candidate Generation
Where: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA
The Apriori algorithm has its own disadvantages that made other scholars think of new aproaches to mine frequent patterns. The 2 main downsides are the possible need of generating a huge number of candidates if the number of frequent 1-itemsets is high or if the size of the frequent pattern is big the database has to be scanned repeatedly to match the candidates and determine the support What if we find a way to mine the frequent patterns without candidate generation? This would be a big improvement over Apriori. This is what the frequent-pattern growth (FP-growth) algorithm does. It adopts a divide-and-conquer strategy and a frequent-pattern tree.
 

   TREE-PROJECTION
Authors: Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad
When: 2000
Paper: A tree projection algorithm for generation of frequent itemsets
Where: Journal of Parallel and Distributed Computing
This inovation brought by this algorithm is the use of a lexicograph tree which requires substantially less memory than a hash tree. The support of the frequent itemsets is counted by projecting the transactions onto the nodes of this tree. This improves the performance of counting the number of transactions that have frequent itemsets. The lexicograph tree is traversed in a top-down fashion.
 

   PASCAL
Authors: Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, L. Lakhal
When: 2000
Paper: Mining Frequent Patterns with Counting Inference
Where: ACM SIGKDD Explorations Newsletter
The algorithm is named after the French mathematician Blaise Pascal who invented an early computing device and it is basically an optimization of the Apriori algorithm. The authors introduce the notion of key patterns and show that other frequent patterns can be infered from the key patterns without access to the database. The algorithm finds both frequent and closed sets and it is twice as fast as Close and 10 times as fast as Apriori but is only practical when the pattern length is short.
 

   H-MINE
Authors: Jian Pei, Jiawei Han Lu, Shojiro Nishio, Shiwei Tang, Dongqing Yang
When: 2001
Paper: H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases
Where: IEEE International Conference on Data Mining
The algorithm introduces the concept of hyperlinked data structure (H-struct) and uses it to dynamically adjust links in the mining process. The authors actually proposed 2 algorithms. The first one, H-mine(Mem), is a memory based, efficient pattern-growth algorithm. This algorithm can be scaled up to very large databases by database partitioning and this is what actual H-mine algorithm deals with.
 

   RELIM (Recursive Elimination)
Author: Christian Borgelt
When: 2005
Paper: Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination
Where: Workshop Open Source Data Mining Software (OSDM'05, Chicago, IL)
As the author mentioned in the paper's abstract, the algorithm is strongly inspired by FP-growth (it's much simpler though) and very similar to H-mine. It doesn't use prefix trees and any other complicated structures. The work is done in one simple recursive function, which can be written with relatively few lines of code.

Algorithms for Mining Maximal Itemsets

   MAX-MINER
Authors: R.J. Bayardo
When: 1998
Paper: Efficiently Mining Long Patterns from Databases
Where: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, United States The algorithm extracts only the maximal frequent itemsets (you know by now that an itemset is maximal frequent if it has no superset that is frequent). By extracting the maximal frequent itemsets and because any frequent itemset is a subset of a maximal frequent itemset, Max-Miner generates all the frequent itemsets. The algorithm combines a levelwise bottom-up traversal with a top-down traversal in order ot quickly find the maximal frequent patterns. Then, all frequent patterns are derived from these ones and one last database scan is carried on to count their support.
 

   DepthProject
Authors: Ramesh Agrawal, Charu Aggarwal, and V.V.V. Prasad
When: August 2000
Paper: Depth first generation of long patterns
Where: The 7th International Conference on Knowledge Discovery and Data Mining
The algorithm finds frequent itemsets by using depth first search on a lexicographic tree of itemsets. The authors claimed and proved that this algorithm is faster than MaxMiner.
 

   MAFIA
Authors: Doug Burdick, Manuel Calimlim, Johannes Gehrke
When: 2001
Paper: MAFIA: A maximal frequent itemset algorithm for transactional databases
Where: Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany
MAFIA is an algorithm for mining maximal frequent itemsets from a transactional database (it has however the option to mine the closed sets as well). It is especially efficient when the itemsets in the database are very long. The search strategy integrates a depth-first traversal of the itemset lattice.
 

   GenMax
Authors: Karam Gouda and Mohammed J. Zaki
When: November 2001
Paper: Efficiently mining maximal frequent itemsets
Where: The 1st IEEE International Conference on Data Mining
GenMax is a backtrack search based algorithm for mining maximal frequent itemsets. It uses progressive focusing to perform maximality checking, and diffset propagation to perform fast frequency computation.

Algorithms for Mining Closed Frequent Itemsets

   CLOSE
Authors: Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
When: 1999
Paper: Efficient mining of association rules using closed itemset lattices
Where: Information System, Vol 24
Close is another Apriori-like algorithm that directly mines frequent closed itemsets. The first of this algorithm is to use bottom-up search to identify the smallest frequent itemset that determines a closed itemset. After finding the frequent k-itemsets, Close compares the support of each set with its subsets at the previous level. If the support of an itemset matches the support of any of its subsets, the itemset is pruned. The second step in Close is to compute the closure of all the itemsets found in the first step. The authors also created a variation of this algorithm, called A-CLOSE, that also generates a reduced set of association rules without having to determine all frequent itemsets, thus lowering the algorithm computation costs.
 

   CLOSET
Authors: Jian Pei, Jiawei Han, Runying Mao
When: 2000
Paper: CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Where: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
It is an algorithm for minng closed itemsets that uses an FP-tree strucutre for mining closed itemsets without candidate generation. It also uses a recursive divide-and-conquer strategy and a database projection approach to mine long patterns.
 

   CHARM
Authors: Mohammed Javeed Zaki, Ching-Jiu Hsiao
When: 2002
Paper: CHARM: An Efficient Algorithm for Closed Itemset Mining
Where: SDM - SIAM International Conference on Data Mining, Chicago
It is another algorithm for mining all frequent closed itemsets. There are some innovative ideas developed by this algorithm that are of a certain value. CHARM simultaneously explores both the itemset space and transaction space, rather than only the itermset search space. It also uses a highly efficient hybrid search method that skips many levels of the IT-tree ( itemset-tidset tree) to quickly identify the frequent closed itemsets, instead of having to enumerate many possible subsets.

Reference

Mining association rules between sets of items in large databases - Rakesh Agrawal, Tomasz Imielinski, Arun Swami
Set-oriented mining for association rules in relational databases - Maurice Houtsma, Arun Swami
Fast algorithms for mining association rules - Rakesh Agrawal, Ramakrishnan Srikant
An effective hash based algorithm for mining association rules - J. S. Park, M. Chen, P.S. Yu.
An efficient algorithm for mining association rules in large databases - Ashok Savasere, Edward Omiecinski, Shamkant Navathe
New Algorithms for Fast Discovery of Assoctiation Rules - Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li
Mining Frequent Patterns without Candidate Generation - Jiawei Han, Jian Pei, Yiwen Yin
H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases - Jian Pei, Jiawei Han Lu, Shojiro Nishio, Shiwei Tang, Dongqing Yang
Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination - Christian Borgelt
A tree projection algorithm for generation of frequent itemsets - Ramesh C. Agarwal, Charu C. Aggarwal, V.V.V. Prasad
Mining Frequent Patterns with Counting Inference - Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, L. Lakhal
MAFIA: A maximal frequent itemset algorithm for transactional databases - Doug Burdick, Manuel Calimlim, Johannes Gehrke
Efficiently Mining Long Patterns from Databases - R.J. Bayardo
Depth first generation of long patterns - Ramesh Agrawal, Charu Aggarwal, and V.V.V. Prasad
Efficiently mining maximal frequent itemsets - Karam Gouda, Mohammed J. Zaki
Efficient mining of association rules using closed itemset lattices - Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets - Jian Pei, Jiawei Han, Runying Mao
CHARM: An Efficient Algorithm for Closed Itemset Mining - Mohammed Javeed Zaki, Ching-Jiu Hsiao

Association Rules main function is to Associate Past Events from the database and predict the future events association with high probability. Rules can be extended any dimension using various Algorithms. Interesting applications can be in the Space Flights, Human Response to various Medicines, Marketing Success/Failure Analysis, Stress Failure Analysis in Civil Structure Design, Deformation Studies of Bridges, Power Grid faults, Aerodynamics Effects on Flights, Radio Coverage Patterns of Cellular Networks, Call Failure Analysis in Telecommunications, Stocks Association in Price Movement and many other fields. We could effectively bring association of stocks where price movement was associated with one another, if one stock started falling others in the Group will start falling most of the time with same % Deviation.

 

Association Rules when used in conjunction with Neural Networks and Time Series created Predictions with Uncanny accuracy. Many times we were surprised with results that predicted values were accurate upto .01% Accuracy. But this is not routine and most of the predictions were within 1% of the Future values and worst case scenario was 5% accuracy.  

 

The aim of association rule mining is to find interesting and useful patterns in a transaction database. The database contains transactions which consist of a set of items and a transaction identifier like market basket. Support is defined on itemsets and gives the proportion of transactions which contain Z. It is used as a measure of significance (importance) of an itemset. Since it basically uses the count of transactions it is often called a frequency constraint. An itemset with a support > a set min. support threshold is called a frequent or large itemset. Confidence is defined as the probability of seeing the rule's consequent under the condition that the transactions also contain the antecedent. Confidence is directed and gives different values for the rules X -> Y and Y -> X.
 

Support is first used to find frequent (significant) itemsets exploiting its down-ward closure property to prune the search space. Then confidence is used in a second step to produce rules from the frequent itemsets that exceed a min. confidence threshold.
A problem with confidence is that it is sensitive to the frequency of the consequent (Y) in the database. Caused by the way confidence is calculated, consequents with higher support will automatically produce higher confidence values even if there exists no association between the items.

 

Association Rules

The input for a typical associations-mining algorithm is a set T of itemsets t, each of which is drawn from a set I of all possible items. Each t is a member of the power set 2I, but T is not considered a subset of 2I since it may contain duplicates (it is a multiset). Since I is typically large, the general problem of finding all common subsets in an arbitrary selection of itemsets is considered intractable. Therefore input sets in T, and any results derived there from, are typically assumed to be small. It is an ongoing area of research to find algorithms which relax this assumption and allow processing of larger sets.

Fixed-Confidence Mining
Commonly, association-finding algorithms attempt to find all sets of elements which occur in at least a fraction C of the data, where C is a chosen Confidence Threshold (e.g. 2%). The number of occurrences of a subset is called its support. Sets whose support exceeds C are called frequent itemsets.

If a set s is frequent, then any subset of s must also be, and most association-finding algorithms attempt to exploit this fact. Most association-finding algorithms reduce to a traversal of this subset lattice of I in some order, extending frequent itemsets and pruning out infrequent sets and their supersets.

The fixed confidence threshold has little basis in statistics, since some sets may exceed it simply by random coincidence (thereby defeating the goal of finding meaningful correlations), and some meaningful associations may occur in the data without reaching the threshold. However, in practice it does eliminate vast numbers of insignificant sets.

For a given data set, the set of its frequent itemsets can be described by its maximal frequent itemsets, which are frequent itemsets S that are not subsets of any larger frequent itemset T. During mining, finding maximal frequent itemsets first allows their subsets to be skipped, an important improvement if sets are large.


K-optimal pattern discovery is a data mining technique that provides an alternative to the frequent pattern discovery approach that underlies most association rule learning techniques.

Frequent pattern discovery techniques find all patterns for which there are sufficient examples in the sample data. In contrast, k-optimal pattern discovery techniques find the k patterns that optimize a user-specified measure of interest. The value k is also specified by the user.

Examples of k-optimal pattern discovery techniques include:

k-optimal classification rule discovery
k-optimal subgroup discovery
finding k most interesting patterns using sequential sampling
mining top.k frequent closed patterns without minimum support
k-optimal rule discovery

 

Apriori

Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing).
As is common in association rule mining, given a set of itemsets (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to find subsets which are common to at least a minimum number C (the cutoff, or confidence threshold) of the itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.
Apriori uses breadth-first search and a hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length k from item sets of length k − 1. Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates. For determining frequent items quickly, the algorithm uses a hash tree to store candidate itemsets. This hash tree has item sets at the leaves and hash tables at internal nodes . Note that this is not the same kind of hash tree used in for instance p2p systems
Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms. Candidate generation generates large numbers of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2 | S | − 1 of its proper subsets.
 

Apriori Algorithm

Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two subproblems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, … , Ik}, association rules with this itemsets are generated in the following way: the first rule is {I1, I2, … , Ik-1}⇒ {Ik}, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the researches focus on the first subproblem. The Apriori algorithm finds the frequent sets L In Database D.

  • Find frequent set Lk − 1.

  • Join Step.

    • Ck is generated by joining Lk − 1with itself

  • Prune Step.

    • Any (k − 1) -itemset that is not frequent cannot be a subset of a frequent k -itemset, hence should be removed.

where

  • (Ck: Candidate itemset of size k)

  • (Lk: frequent itemset of size k)

Apriori Pseudocode

Apriori

large 1-itemsets that appear in more than transactions }

while

Generate(Lk − 1)

for transactions

Subset(Ck,t)

for candidates

return

 
Applications
Market Basket Analysis
• Consider shopping cart filled with several items
• Market basket analysis tries to answer the following questions:
– Who makes purchases?
– What do customers buy together?
– In what order do customers purchase items?
• Given a database of customer transactions, each transaction is a set of items – deduce association rules.

• Co-ocurrences – 80% of all customers purchase items a, b, and c together.
• Association Rules – 60% of all customers who purchase X and Y also buy Z.
• Sequential Patterns – 60% of customers who first buy X also purchase Y within two weeks.

 

Confidence and Support
• We prune the set of all possible association rules using two measures of interest:
– Confidence of a rule: X ->Y has confidence c if P(Y|X)=c.
– Support of a rule: X->Y has support s if P(XY)=s. Also, support of an item

 

Other Applications

• Direct Marketing
• Fraud Detection for Medical Insurance
• Floor/Shelf Planning
• Web Site Layout
• Cross-selling
• Frequent Itemsets applications:
– Classification
– Seeds for constructing Bayesian networks
– Web log analysis
– Collaborative filteringset XY