Return to data mining index

Association Analysis


This section will discuss association analysis. The first section is a description what it is and how associations are commonly built. The second section describes some common uses of associate analysis, as well as some benefits and drawbacks to the classic algorithm.

Building Associations

Association analysis is just that, an analysis of associations. In English, this means finding interesting relationships or associations between different elements in data. While similar to the goal of finding patterns in data, which is in general the purpose of data mining, association analysis is looking for associations, which are a specific type of pattern. Association analysis uncovers association rules, which define the presence of one item given another. The items that frequently occur together are known as frequent item sets. For example, a store might analyze purchase records to discover what items are frequently bought together. If the store found that when milk and beer are bough, diapers are also frequently bought, that would represent a frequent itemset, meaning a set of items that are frequently found together. This itemset would be written as $\{Milk, Diapers, Beer\}$. The association rule corresponding to this itemset would say if someone buys diapers and milk, they are likely to also buy beers. The rule can be written more succinctly in the form:

\(\{Milk, Diapers\} \rightarrow \{Beer\}\)

Frequent Itemset Generation

The above would represent an association rule. In order to find these rules, frequent itemsets mus first be found during a process called "frequent itemset generation". A good way to visualize itemsets is in a lattice, which starts with a root node of null , and grows items by one item each level, until the final leaf node is the set of all items. For example, if we had the set $I = \{a, b, c\}$, then the first node would be null, and the second level in the lattice would be single items sets (i.e., $\{a\}$, $\{b\}$, $\{c\}$, etc.). The third level would be 2-item itemsets (i.e., $\{ab\}$, $\{bc\}$, $\{cd\}$, etc.), then next level would be 3-item itemsets, and the final node would be the itemset $\{a, b, c\}$. The following picture illustrates this process.


This lattice enumerates all possible itemsets, however, all of these itemsets are not necessarily frequent. One approach to finding the frequent itemsets is to determine the support count for every possible, or candidate, itemset in the lattice. The support count, combined with the confidence, is the unit of measure used to determine how strong an association rule is. Support is the measure of how often a particular rule is applicable to the data set. In other words, how many records in the data set are covered by that rule, how many records contain the attributes that the rule is declaring some association about. The confidence is a measure of how frequently the items in the rule consequent appear in the rule antecedent. For example, with the rule mentioned above, the support would be the percentage of total records that contain milk, diapers, and beer. If there were 5 total records, and 2 contained the subset $\{Milk, Diapers, Beer\}$, then the support would be $2/5$. Confidence, on the other hand, is the measure of how many times diapers appear in the records that contain mile and beer. For the above example, we already determined there was 2 records that contained the subset $\{Milk, Diapers, Beer\}$. If there were 3 total transactions that contained the set $\{Milk, Diapers\}$, the rules antecedent, then the support would be $2/3$. So, where support is a measure of coverage, confidence is more of a measure of accuracy, or how good the rule is, how frequently the consequent is actually found in records containing the antecedent. Therefore, support is the measure used to determine how frequent an itemset is, since it is the measure of coverage, or frequency within the data set. The formulas for support and confidence are given by the following equations, where $X$ is the rule antecedent, $Y$ is the rule consequent, $N$ is the total number of records in the data set, and $\sigma (X)$ is the mathematical support count of $X$.

\(Support,\ s(X \rightarrow Y) = \frac{\sigma(X \cup Y)}{N}\)
\(Confidence,\ c(X \rightarrow Y) = \frac{\sigma (X \cup Y)}{\sigma (X)}\)

For large data sets, the brute force approach to frequent itemset generation is not ideal, due to it's inherent inefficiencies. The brute force approach considers all $k$-itemsets as potential candidates, and then prunes them based on support. A more common approach to generating frequent itemsets is to use the Apriori principle. This principle states, "If an itemset is frequent, then all of its subsets must also be frequent." 1 From this, it can also be derived that if an itemset is infrequent, all of it's superset will also be infrequent. Using these two principles, two different methods can be used to quickly generate the frequent itemsets in a data set. One is the $F_{k-1} \times F_{1}$ method, and the other is the $F_{k-1} \times F_{k-1}$ method. An algorithm using either method would first make a single pass over the data set and determine the support of each item, i.e. each one-item itemset. Then, it will repeatedly iterate through the data set, generating $k$-itemsets from the $k-1$ itemesets generated in the previous step. $k$ starts at 2, and with each iteration through the data, it is incremented by one until the specified $k$-size itemsets are found. The $F_{k-1} \times F_{1}$ and $F_{k-1} \times F_{k-1}$ methods define different ways for how $k$-size itemsets are generated during each iteration.

The $F_{k-1} \times F_{1}$ method combines frequent $k-1$ sized itemsets with frequent 1-itemsets to form frequent $k$-itemsets. So, as an example, if $k$ was 4, and there was a frequent itemset of $\{A, B, C\}$, and a frequent itemset of $\{D\}$, they would be combined to form the frequent itemset of $\{A, B, C, D\}$. This approach can lead to duplicate itemsets if precautions are not taken. Therefore it is best to keep itemsets sorted in alphabetical, or lexicographical, order as a means to prevent duplicates. The $F_{k-1} \times F_{k-1}$ method combines two $k-1$ itemsets with identical $k-2$ items. So, fore example, if searching for a 4-itemset, the two frequent itemsets $\{A, B, C\}$ and $\{A, B, D\}$ can be combined to form the frequent itemset $\{A, B, C, D\}$. Theses $k-1$ itemsets can be combined because they share the same $k-2$ subset of $\{A, B\}$. After generating frequent $k$-itemsets using either method, the algorithm prunes any itemset that doesn't meet the specified support count. This algorithm will continue to iterate through the data set until no new $k$-itemsets are generated.

Rule Generation

Association rules can be generated once the frequent itemsets are found. Association rules are extracted from frequent itemsets by dividing the itemset in two unique non-empty subsets, for example $X$ and $X-Y$, such that $X \rightarrow X-Y$ satisfies the defined confidence threshold. Since the rules are being extracted from frequent itemsets, the support threshold has already been met. The Apriori principle can also be applied to rule generation. Using the Apriori principle, rules are calculated by level, so the first set of rules generated are all the high-confidence rules that have only a single item in the rule consequent. Then two high confidence rules with single-item consequents can be combined to form a candidate rule with a 2-item consequent. For example, if the first pass through a set of frequent itemsets finds the following two rules, $\{acd\} \rightarrow \{b\}$ and $\{abd\} \rightarrow \{c\}$, these can be combined to form the candidate rule $\{ad\} \rightarrow \{bc\}$.

Low-confidence rules can be pruned according to the following rule: "If a rule $X \rightarrow Y-X$ does not satisfy the confidence threshold, then any rule $X\prime \rightarrow Y-X\prime", where $X\prime$ is a subset of $X$, must not satisfy the confidence threshold as well." 1 Therefore, if the confidence for $\{bcd\} \rightarrow \{a\}$ is low, then all rules containing $a$ in the consequent can be discarded. This would include the rules $\{cd\} \rightarrow \{ab\}$ and $\{d\} \rightarrow \{abc\}$.

Common Uses of Association Analysis

Association analysis requires binary data, since it is based on finding the presences of one item given the presence of another. Therefore, association analysis is commonly used in applications where binary data is collected. The most common example of this is market basket data. With market basket data, each record is a transaction, and each column represents an item that could be bought. For every item the customer did buy, there is a 1 in that column, and a 0 for every item he did not buy. Association analysis can be used in this contect to determine what items are frequently bought together.

Association analysis is also applicable to many other contexts. For example, it can be used to determine party affiliation based of voting records. It can be used to compare documents, maybe discovering associations between certain words used and who wrote it, or the bias of the author. It could be used to determine if an e-mail is Spam, based on the words contained in it. Other applications have included Web mining, network intrusion detection, and bioinformatics. 1

Works Cited

  1. Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Boston: Pearson Addison Wesley, 2005. Print.
Return to data mining index