Associative rules, or beer with diapers



    Introduction to Theory


    Learning on associative rules (hereinafter Associations rules learning - ARL) is, on the one hand, a simple method, and on the other hand, a method quite often applicable in real life to search for relationships (associations) in datasets, or, more precisely, itemests (itemsests). For the first time, Piatesky-Shapiro G [1] spoke in detail about this in the work “Discovery, Analysis, and Presentation of Strong Rules.” (1991) A more detailed topic was developed by Agrawal R, Imielinski T, Swami A in the works “Mining Association Rules between Sets of Items in Large Databases ”(1993) [2] and“ Fast Algorithms for Mining Association Rules. ”(1994) [3] .

    In general terms, ARL can be described as "Who bought x, also bought y." It is based on the analysis of transactions, each of which has its own unique itemset from the set of items. Using ARL alogrithms, the very same “rules” for matching items within one transaction are found, which are then sorted by their strength. Everything, in general, is simple.

    Behind this simplicity, however, are amazing things that common sense did not even suspect.

    A classic case of this cognitive dissonance is described in DJ Power's “Ask Dan!” Article published at DSSResources.com [4] .

    In 1992, Teradata’s retail consulting group led by Thomas Blishock conducted a study of 1.2 million transactions in 25 Osco Drug retailers (no, they didn’t sell drugs or even drugs, or rather, not only drugs. Drug Store is a multi-sized format shops at home). After analyzing all these transactions, the strongest rule was “Between 17:00 and 19:00 most often beer and diapers are bought together.” Unfortunately, such a rule seemed so counterintuitive to the Osco Drug management that they did not begin to put diapers on the shelves next to the beer. Although an explanation for a pair of beer-diapers was completely found: when both members of a young family returned home from work (just about 5 o’clock in the evening), wives usually sent husbands for diapers to the nearest store. And husbands, without hesitation,

    Description of Association rule


    So, we found out that beer and diapers are a good gentleman's set, and what's next?
    Suppose we have a certain dataset (or collection) D, such that$ D = {d_0 ... d_j} $, where d is a unique transaction-itemset (for example, a cash receipt). Inside each d there is a set of items (i - item) , and in the ideal case it is presented in binary form: d1 = [{Beer: 1}, {Water: 0}, {Cola: 1}, {...}], d2 = [{Beer: 0}, {Water: 1}, {Cola: 1}, {...}] . It is customary to describe each itemset through the number of non-zero values ​​( k-itemset ), for example, [{Beer: 1}, {Water: 0}, {Cola: 1}] is a 2-itemset .

    If the dataset is not initially presented in binary form, you can convert it with your own hands ( One Hot Encoding and pd.get_dummies to help you).

    Thus, the dataset is a sparse matrix with values ​​of {1,0}. This will be a binary dataset. There are other types of records - a vertical dataset (for each individual item it shows the transaction vector where it is present) and a transaction dataset (like in a cash receipt).

    OK, the data has converted, how to find the rules?
    There are a number of key (if you want - basic) concepts in ARL that these rules will help us derive.

    Support


    The first concept in ARL is support:

    $ supp (X) = \ frac {\ {t \ in T; \ X \ in t \}} {| T |} $



    , where X is an itemset containing i-items, and T is the number of transactions. Those. in general terms, this is an indicator of the "frequency" of this itemset in all analyzed transactions. But this only applies to X. We are rather interested in the option when we have x1 and x2 (for example) in one itemset. Well, everything is simple here too. Let x1 = {Beer}, and x2 = {Diapers} , so we need to calculate how many transactions this pair occurs.

    $ supp (x_1 \ cup x_2) = \ frac {\ sigma (x_1 \ cup x_2)} {| T |} $


    Where $ \ sigma $ - the number of transactions containing $ x_1 $ and $ x_2 $
    We will deal with this concept on a toy dataset.

    play_set # microdataset, where transaction numbers are indicated, and also in binary form it is presented what was bought on each transaction



    $ supp = \ frac {\ text {Transactions with beer and diapers}} {\ text {All transactions}} = P (\ text {Beer} \ cap \ text {Diapers}) $



    $ supp = \ frac {3} {5} = 60 \% $



    Confidence


    The next key concept is confidence. This is an indicator of how often our rule works for the entire dataset.

    $ conf (x_1 \ cup x_2) = \ frac {supp (x_1 \ cup x_2)} {supp (x_1)} $



    Let us give an example: we want to calculate confidence for the rule “who buys beer, he also buys diapers”.

    To do this, first calculate the support for the rule “buys beer”, then calculate the support for the rule “buys beer and diapers”, and simply divide one into another. Those. we will count in how many cases (transactions) the rule “bought beer” supp (X), “bought diapers and beer” works

    $ supp (X \ cup Y) $


    Doesn’t resemble anything? Bayes looks at all this somewhat puzzled and contemptuously.



    Let's look at our microdataset.

    $ conf (\ text {Beer} \ cup \ text {Diapers}) = \ frac {supp (\ text {Beer} \ cup \ text {Diapers})} {supp (\ text {Beer})} = P (\ text {Diapers} \ mid \ text {Beer}) $



    $ conf = \ frac {3} {4} = 75 \% $



    Lift (support)


    The next concept on our list is lift. Roughly speaking, lift is the relation of “dependency” items to their “independence”. Lift shows how items depend on each other. This is obvious from the formula:

    $ lift (x_1 \ cup x_2) = \ frac {supp (x_1 \ cup x_2)} {supp (x_1) \ times supp (x_2)} $



    For example, we want to understand the relationship between buying beer and buying diapers. To do this, we support the “bought beer and diapers” rules and divide it into the product of the rules “bought beer” and “bought diapers”. In case lift = 1 , we say that the items are independent and there are no rules for joint purchases here. If lift> 1, then the amount by which lift, in fact, is greater than this very unit, and will show us the "strength" of the rule. The more units, the steeper. If lift <1, then this will show that the base rule $ x_2 $ negatively affects the rule $ x_1 $. In another way, lift can be defined as the ratio of confidence to expected confidence, i.e. the ratio of the reliability of the rule when both (well, or whatever you want) elements are bought together to the reliability of the rule when one of the elements was bought (it doesn’t matter, with the second or without).

    $ lift = \ frac {\ text {Confidence}} {\ text {Expected confidence}} = \ frac {P (\ text {Diapers} \ mid \ text {Beer})} {P (\ text {Diapers})} $


    $ lift = \ frac {\ frac {3} {4}} {\ frac {3} {5}} = $ 1.25



    Those. our rule that beer is bought with diapers is 25% more powerful than the rule that diapers are just bought

    Conviction


    In general, Conviction is the “error rate” of our rule. That is, for example, how often they bought beer without diapers and vice versa.

    $ conv (x_1 \ cup x_2) = \ frac {1 - supp (x_2)} {1 - conf (x_1 \ cup x_2)} $



    $ conv (\ text {Beer} \ cup \ text {Diapers}) = \ frac {1 - supp (\ text {Diapers})} {1 - conf (\ text {Beer} \ cup \ text {Diapers})} = \ frac {1 - 0.6} {1 - 0.75} = $ 1.6



    The result of the formula above 1 is better. For example, if the conviction of buying beer and diapers together would be 1.2, this means that the rule “bought beer and diapers” would be 1.2 times (20%) more true than if the coincidence of these items in one transaction would be pure random. A little not an intuitive concept, but it is not used as often as the previous three.

    There are a number of commonly used classical algorithms that allow you to find rules in itemsets according to the concepts listed above - Naive or Bruteforce algorithm, Apriori- algorithm, ECLAT algorithm, FP-growth algorithm and others.

    Bruteforce


    Finding rules in itemsets is generally not difficult; it is difficult to do this effectively. The brute force algorithm is the simplest and, at the same time, the most inefficient way.

    In pseudo-code, the algorithm looks like this:

    ENTRANCE : Dataset D containing a list of transactions
    EXIT : Sets of itemsets$ F_1, F_2, F_3, ... F_q, $ Where $ F_i $- a set of itemsets of size I that occur at least s times in D
    APPROACH :

    1. R - an integer array containing all combinations of items in D, of size$ 2 ^ {| D |} $
    2. for n $ \ in $[1, | D |] do:
    F - all possible combinations from$ D_n $
    Increase each value in R according to the values ​​in each F []
    3. return All itemsets in R$ \ geq $ s

    Сложность брутфорс-алгоритма очевидна:

    Для того чтобы найти все возможные Association rules применяя брутфорс-алгоритм нам необходимо перечислить все подмножества X из набора I и для каждого подмножества X рассчитать supp(X). Данный подход будет состоять из следующих шагов:

    • генерация кандидатов. Данный шаг состоит из перебора всех возможных подмножеств X подмножества I. Данные подмножества называются кандидатами, поскольку каждое подмножество теоретически может являться часто встречаемым подмножеством, то множество потенциальных кандидатов будет состоять из

      $ 2 ^ {| D |} $

      элементов (здесь |D| обозначается число элементов множества I, а

      $ 2 ^ {| D |} $

      часто называется булеаном множества I, то есть множество всех подмножеств I)
    • calculation support . At this step, supp (X) of each candidate X is calculated

    Thus, the complexity of our algorithm will be:

    $ O (| I | * | D | * 2 ^ {| I |}) $

    where

    $ O (2 ^ {| I |}) $

    - number of possible candidates

    $ O (| I | * | D |) $

    - the complexity of calculating supp (X), because to calculate supp (X) we need to iterate over all the elements in I in each transaction

    $ t \ in T $



    In the O-big notation, we need O (N) operations, where N is the number of itemsets.
    Thus, to apply the brute force algorithm, we need$2^i $memory cells, where i is an individual itemset. Those. for storage and calculation of 34 items we need 128GB RAM (for 64-bit integer cells).

    Thus, bruteforce will help us in calculating transactions in a tent with shawarma at the station, but for something more serious it is completely unsuitable.

    Apriori algorithm


    Theory

    Used concepts:

    • A lot of objects (itemset):

      $X \subseteq I = \{x_1, x_2, ..., x_n\}$

    • Many transaction identifiers (tidset):

      $T = \{t_1, t_2, ..., t_m\}$

    • Many transactions:

      $\{(t,\ X):\ t\in T,\ X \in I\}$


    We introduce some more concepts.

    We will consider a prefix tree, where 2 elements X and Y are connected if X is a direct subset of Y. Thus, we can number all subsets of the set I. Figure below:


    So, we will consider the Apriori algorithm.

    Apriori uses the following statement: if$ X \subseteq Y$then $supp(X) \geq supp(Y) $.

    The following 2 properties follow from here:

    • if Y is common, then any subset

      $ X: X \subseteq Y $

      also common
    • if X is rare, then any superset

      $ Y: Y \supseteq X $

      also rare

    Apriori algorithm goes through the prefix tree in levels and calculates the frequency of occurrence of subsets of X in D. Thus, in accordance with the algorithm:

    • rare subsets and all their supersets are excluded
    • supp (X) is calculated for each suitable candidate X of size k at level k



    In the pseudo-notation code, the a priori algorithm is as follows:
    INPUT : Dataset D containing a list of transactions, and$\sigma$- user-defined threshold supp

    EXIT : List itemsets F (D,$\sigma$)

    APPROACH :

    1.$C_1 \leftarrow$[{i} | i in J]
    2. k$\leftarrow$1
    3. while$C_k \neq$1 do:
    4. We consider all support for all candidates
    for all transactions (tid, I)$\in$D do:
    for all candidates X$\in$$C_k$do:
    if X$\in$I:
    X.support ++
    5. # We pull out all the frequent itemsets:
    $F_k$ = {X | X.support> $\sigma$}
    6. # Generate new candidates
    $\forall$ X, Y $\in$$ F_i$, X [i] = Y [i] for 1 $\leq$ i $\leq$ k-1 and X [k] $\leq$Y [k] do:
    I = X$\cup${Y | k |}
    if$\forall$ J $\subseteq$I, | J | = k: J$\in$$F_k$ then
    $C_k+1$$\in$$C_k+1$$\cup$I
    k ++

    Thus, Apriori first looks for all single (containing 1 element) itemsets that satisfy the supp specified by the user, then makes pairs of them according to the principle of hierarchical monotony, i.e. if$x_1$ is common and$ x_2$ is common then and $[x_1, x_2]$is common.

    A clear minus of this approach is that you need to "scan" the entire dataset, count all supp at all levels of the breadth-first search ( wide search )
    This can also eat up RAM on large datasets, although the algorithm in terms of speed is still much more efficient than brute force.

    Implementation in Python

    from sklearn import ... umm ... but there is nothing to import. There are currently no modules for ALR in sklearn. Well, nothing, google or write your own, right?

    A number of implementations walk around the network, for example [here] , [here] , and even [here] .

    In practice, we adhere to the apyori algorithm written by Yu Mochizuki. We won’t provide the full code, those who wish can see it [here] , but we will show the architecture of the solution and an example of use.

    Mochizuki’s solution can be conditionally divided into 4 parts: Data structure, Internal functions, API and Application functions.

    The first part of the module (Data Structure) works with the original dataset. The TransactionManager class is implemented, the methods of which combine transactions into a matrix, form a list of candidate rules and consider support for each rule. Internal functions additionally for support form lists of rules and accordingly rank them. The API logically allows you to work directly with datasets, and the application functions allow you to process transactions and display the result in a readable form. No rocketscience.

    Let's see how to use the module on a real (well, in this case - a toy) dataset.

    Data loading
    import pandas as pd
    # загрузим данные
    dataset = pd.read_csv('data/Market_Basket_Optimisation.csv', header = None)
    # посомтрим на датасет
    dataset.head()




    We see that our dataset represents a sparse matrix, where in the rows we have a set of items in each transaction.

    Replace NaN with the last value inside the transaction, so that later it will be easier to process the entire dataset.

    The code
    dataset.fillna(method = 'ffill',axis = 1, inplace = True)
    dataset.head()



    #создаим из них матрицу
    transactions = []
    for i in range(0, 7501): 
        transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])
    #загружаем apriori
    import apriori
    %%time
    # и обучимся правилам. Обратите внимание, что пороговые значения мы вибираем сами в зависимости от того,
    # насколкьо "сильные" правила мы хотим получить
    # min_support -- минимальный support для правил (dtype = float).
    # min_confidence -- минимальное значение confidence для правил (dtype = float)
    # min_lift -- минимальный lift (dtype = float)
    # max_length -- максимальная длина itemset (вспоминаем про k-itemset)  (dtype = integer)
    result = list(apriori.apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 4, min_length = 2))


    Visualize output

    Code magic
    import shutil, os 
    try:
        from StringIO import StringIO
    except ImportError:
        from io import StringIO
    import json #преобразовывать будем в json, используя встроенные в модуль методы
    output = []
    for RelationRecord in result:
        o = StringIO()
        apriori.dump_as_json(RelationRecord, o)
        output.append(json.loads(o.getvalue()))
    data_df = pd.DataFrame(output)
    # и взгялнем на итоги
    pd.set_option('display.max_colwidth', -1)
    from IPython.display import display, HTML
    display(HTML(data_df.to_html()))




    Итого мы видим:

    1. Пары items
    2. items_base — первый элемент пары
    3. items_add — второй (добавленный алгоритмом) элемент пары
    4. confidence — значение confidence для пары
    5. lift — значение lift для пары
    6. support — значение support для пары. При желании, по нему можно отсортировать

    Результаты логичные: эскалоп и макароны, эскалоп и сливочно-грибной соус, курица и нежирная сметана, мягкий сыр и мед и т.д. — все это вполне логичные и, главное, вкусные сочетания.

    Реализация в R

    ARL тот случай, когда R-филы могут злорадно похихикать (java-филы вообще смотрят на нас смертных с презрением, но об этом ниже). В R реализована библиотека arules, где присутствует и apriori, и другие алгоритмы. Официальную доку можно посмотреть [here]

    Let's look at it in action:

    First, install it (if not already installed).

    Library installation
    install.packages('arules')


    We read the data and transform it into a transaction matrix.
    Read data
    library(arules)
    dataset = read.csv('Market_Basket_Optimisation.csv', header = FALSE)
    dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)


    Let's look at the data:
    summary(dataset)
    itemFrequencyPlot(dataset, topN = 10)

    The graph shows the frequency of individual items in transactions.



    Let's learn our rules:

    In general, the apriori call function looks like this:
    apriori(data, parameter = NULL, appearance = NULL, control = NULL)
    , where

    data is our
    parameter dataset - a list of parameters for the model: minimum support, confidence and lift
    appearance - is responsible for displaying data. It can take the values lhs, rhs, both, items, none , which determine the position of items in the output
    control - it is responsible for sorting the output (ascending, descending, without sorting), and also whether or not to display the progress bar (verbose parameter)

    Let's train the model :

    rules = apriori(data = dataset, parameter = list(support = 0.004, confidence = 0.2))

    And look at the results:

    inspect(sort(rules, by = 'lift')[1:10])

    We will verify that the output has approximately the same results as when using the apyori module in Python:

    1. {light cream} => {chicken} 0.004532729
    2. {pasta} => {escalope} 0.005865885
    3. {pasta} => { shrimp} 0.005065991
    4. {eggs, ground beef} => {herb & pepper} 0.004132782
    5. {whole wheat pasta} => {olive oil} 0.007998933
    6. {herb & pepper, spaghetti} => {ground beef} 0.006399147
    7 . {herb & pepper, mineral water} => {ground beef} 0.006665778
    8. {tomato sauce} => {ground beef} 0.005332622
    9. {mushroom cream sauce} => {escalope} 0.005732569
    10. {frozen vegetables, mineral water , spaghetti} => {ground beef} 0.004399413

    ECLAT Algorithm


    Theory

    The idea of ​​the ECLAT (Equivalence CLAss Transformation) algorithm is to speed up the counting of supp (X). To do this, we need to index our database D so that it allows us to quickly calculate supp (X).

    It is easy to see that if t (X) denotes the set of all transactions where a subset X occurs, then
    t (XY) = t (X)$\cup$t (Y)
    and
    supp (XY) = | t (XY) |
    that is, supp (XY) is equal to the cardinality (size) of the set t (XY)



    This approach can be significantly improved by reducing the size of the intermediate sets of transaction identifiers (tidsets). Namely, we can not store the entire set of transactions at an intermediate level, but only the many differences of these transactions. Let's pretend that

    $X_a = \{x_1, x_2,..., x_{n-1}, a\} X_b = \{x_1, x_2,..., x_{n-1}, b\}$


    Then, we get:

    $X_{ab} = \{x_1, x_2,..., x_{n-1}, a, b\}$


    $diffset(X_{ab})$ this is the set of all transaction id that contain the prefix X_a, but do not contain the element b:

    $d(X_{ab}) =t(X_a)/t(X_{ab})=t(X_a)/t(X_{b})$





    Unlike the Apriori algorithm, ECLAT searches in depth (DFS, [more details here] ). It is sometimes called “vertical” (as opposed to “horizontal” for Apriori).

    INPUT : Dataset D, containing a list of transactions,$\sigma$ - user-defined threshold supp and new element prefix $I \subseteq J$

    EXIT : List of itemsets F [I] (D,$\sigma$) for the corresponding prefix I

    APPROACH:
    1.$F[i] \leftarrow ${}
    2.$\forall$i $\in $J $\in $D do:
    F [I]: = F [I]$\in $ {I $\in ${i}}
    3. Create$D_i$
    $D_i \leftarrow$ {}
    $\forall$j $\in$ J $\in$ D $\in$j> i do:
    C$\cup$ ({i} $\cup${j})
    if | C |$\geq \sigma$ then
    $D_i \leftarrow D_i \in {C,i}$
    4. DFS - recursion:
    Counting$F|I| \in {i}| (D_i, \sigma)$
    F [I]: F [I] $\in$ F [I $\in$i]

    The key concept for the ECLAT algorithm is the I-prefix. At the beginning, an empty set I is generated, this allows us to select all frequency itemsets on the first pass. Then the algorithm will call itself and increase I by 1 at each step until the length specified by the user is reached I.

    To store the values, a prefix tree is used (trie (not tree :)), [more details here]. First, the zero root of the tree is constructed (that is, the empty set I), then, as you go through itemsets, the algorithm registers the items contained in each itesmsets, while the leftmost branch is the child of the zero root and further down. There are as many branches as items are found in itemsets. This approach allows you to write itemset in memory only once, which makes ECLAT faster than Apriori.

    Implementation in Python

    There are several implementations of this algorithm in Python, those who wish can google and even try to get them to work, but we will write our own, as simple and, most important, working as possible.

    import numpy as np
    """
    Класс инициируется 3мя параметрами:
    - min_supp - минимальный support  который мы рассматриваем для ItemSet. Рассчитывается как % от количества транзакций
    - max_items - максимальное количество елементов в нашем ItemSet
    - min_items - минимальное количество элементов ItemSet
    """
    class Eclat:
        #инициализация объекта класса
        def __init__(self, min_support = 0.01, max_items = 5, min_items = 2):
            self.min_support = min_support
            self.max_items = max_items
            self.min_items = min_items
            self.item_lst = list()
            self.item_len = 0
            self.item_dict = dict()
            self.final_dict = dict()
            self.data_size = 0
        #создание словаря из ненулевых объектов из всех транзакций (вертикальный датасет)
        def read_data(self, dataset):
            for index, row in dataset.iterrows():
                row_wo_na = set(row[0])
                for item in row_wo_na:
                    item = item.strip()
                    if item in self.item_dict:
                        self.item_dict[item][0] += 1
                    else:
                        self.item_dict.setdefault(item, []).append(1)
                    self.item_dict[item].append(index)
            #задаем переменные экземпляра (instance variables)
            self.data_size = dataset.shape[0]
            self.item_lst = list(self.item_dict.keys())
            self.item_len = len(self.item_lst)
            self.min_support = self.min_support * self.data_size
            #print ("min_supp", self.min_support)
        #рекурсивный метод для поиска всех ItemSet по алгоритму Eclat
        #структура данных: {Item: [Supp number, tid1, tid2, tid3, ...]}
        def recur_eclat(self, item_name, tids_array, minsupp, num_items, k_start):
            if tids_array[0] >= minsupp and num_items <= self.max_items:
                for k in range(k_start+1, self.item_len):
                    if self.item_dict[self.item_lst[k]][0] >= minsupp:
                        new_item = item_name + " | " + self.item_lst[k]
                        new_tids = np.intersect1d(tids_array[1:], self.item_dict[self.item_lst[k]][1:])
                        new_tids_size = new_tids.size
                        new_tids = np.insert(new_tids, 0, new_tids_size)
                        if new_tids_size >= minsupp:
                            if num_items >= self.min_items: self.final_dict.update({new_item: new_tids})
                            self.recur_eclat(new_item, new_tids, minsupp, num_items+1, k)
        #последовательный вызов функций определенных выше
        def fit(self, dataset):
            i = 0
            self.read_data(dataset)
            for w in self.item_lst:
                self.recur_eclat(w, self.item_dict[w], self.min_support, 2, i)
                i+=1
            return self
        #вывод в форме словаря {ItemSet: support(ItemSet)}
        def transform(self):
            return {k: "{0:.4f}%".format((v[0]+0.0)/self.data_size*100) for k, v in self.final_dict.items()}

    Testing on a toy sample:

    #создадим экземпляр класса с нужными нам параметрами
    model = Eclat(min_support = 0.01, max_items = 4, min_items = 3)

    #обучим
    model.fit(dataset)

    #и визуализируем результаты
    model.transform()




    Meanwhile in real-life ...

    So, the algorithm works. But is it applicable in real life? Let's check.
    There is a real business problem that we received from a large grocery retailer of the premium segment (we will not disclose the name and product names, corporate secret-s): to look at the most frequent sets in grocery baskets.

    Download data from the unloading from the POS system (Point-of-Sale - a system that processes transactions at the checkout)

    Data loading
    df = pd.read_csv('data/tranprod1.csv', delimiter = ';', header = 0)
    df.columns = ['trans', 'item']
    print(df.shape)
    df.head()




    Change the table format to “transaction | list »all item in transaction

    Transformations
    df.trans = pd.to_numeric(df.trans, errors='coerce')
    df.dropna(axis = 0, how = 'all', inplace = True)
    df.trans = df.trans.astype(int)


    df = df.groupby('trans').agg(lambda x: x.tolist())

    df.head()




    Run the algorithm

    model = Eclat(min_support = 0.0001, max_items = 4, min_items = 3)

    %%time
    model.fit(df)

    Data read successfully
    min_supp 9.755
    CPU times: user 6h 47min 9s, sys: 22.2 s, total: 6h 47min 31s
    Wall time: 6h 47min 28s


    model.transform()



    On a regular computer, the solution to this problem took essentially night. On cloud platforms it would be faster. But, most importantly, the customer is satisfied.

    As you can see, implementing the algorithm on your own is quite simple, although it is worth working with efficiency.

    Implementation in R
    Once again, users of R rejoice, for them no dancing with a tambourine is necessary, all by analogy with apriori.

    We start the library and read the data (for acceleration we take the same toy dataset on which apriori was considered):

    Data preparation
    library(arules)
    dataset = read.csv('Market_Basket_Optimisation.csv')
    dataset = read.transactions('Market_Basket_Optimisation.csv', sep = ',', rm.duplicates = TRUE)


    A quick look at the dataset:

    summary(dataset)
    itemFrequencyPlot(dataset, topN = 10)

    The same frequencies as before this



    Rule:

    rules = eclat(data = dataset, parameter = list(support = 0.003, minlen = 2))

    Please note that we only configure support and the minimum length (k in k-itemset)

    and look at the results:

    inspect(sort(rules, by = 'support')[1:10])



    FP-Growth Algorithm


    Theory

    FP-Growth (Frequent Pattern Growth) is the youngest algorithm of our trinity; it was first described in 2000 in [7] .

    FP-Growth offers a radical thing - to abandon the generation of candidates (recall, the generation of candidates is the basis of Apriori and ECLAT). Theoretically, this approach will allow to further increase the speed of the algorithm and use even less memory.

    This is achieved by storing in the memory of the prefix tree (trie) not from combinations of candidates, but from the transactions themselves.

    In this case, FP-Growth generates a header table for each item whose supp is higher than the one specified by the user. This header table stores a linked list of all the nodes of the same type in the prefix tree. Thus, the algorithm combines the advantagesBFS through the header table and DFS through the construction of trie. The algorithm pseudo-code is similar to ECALT, with a few exceptions.

    ENTRANCE : Dataset D containing a list of transactions,$\sigma$ - user-defined threshold supp and prefix $I \subseteq J$

    EXIT : List of itemsets F [I] (D,$\sigma$) for the corresponding prefix I

    APPROACH :

    1. F [i]$\leftarrow${}
    2.$\forall$i $\in$ J $\in$D do:
    F [I]: = F [I]$\in$ {I $\in${i}}
    3. Create$D_i$
    $D_i \leftarrow D_i $ {}
    $H_i \leftarrow$ {}
    $\forall$j $\in$J $\in$ D $\in$j> i do:
    if supp (I$\in$ {i, j})$\geq \sigma$then:
    H$\leftarrow$ H $\in$ {j}
    $\forall$(tid, X) $\in$D $\subseteq$ I $\in$ X do:
    $D_i \leftarrow D_i $$\subseteq$ ({tid, X $\in$H})
    4. DFS - recursion:
    Count F | I$\in${i} | ($D_i, \sigma$)
    F [I]$\leftarrow$ F [I] $\subseteq$ F [I $\subseteq$i]

    Python

    Implementation The FP-Growth implementations in Python were no luckier than other ALR algorithms. There are no standard libraries for it.

    Not bad FP_Growth is presented in pyspark, look [here]
    On gitHub you can also find several solutions of the Neolithic era, for example [here] and [here]

    We will test the second option:

    Installation and Import
    !pip install pyfpgrowth

    import pyfpgrowth


    #Сгенериуем паттерны
    patterns = pyfpgrowth.find_frequent_patterns(transactions, 2)
    #Выучим правила
    rules = pyfpgrowth.generate_association_rules(patterns, 30);
    #Покажем
    rules



    Implementation in R

    In this case, R does not lag behind Python: in such a convenient and native arules library FP-Growth is missing.

    At the same time, as for Python, the implementation exists in Spark - [Link]

    And in fact...


    In fact, if you want to use ARL in your business tasks, we strongly recommend learning Java .

    Weka (Waikato Environment for Knowledge Analysis). It is a free machine learning software written in Java. It was developed at Waikato University in New Zealand in 1993. Weka has both a GUI and the ability to work from the command line. The advantages include the ease of use of the graphical interface - there is no need to write code to solve application problems. To use Weka libraries in Python, you can install the shell for Weka in Python: python-weka-wrapper3. The shell uses the javabridge package to access the Weka API. Detailed installation instructions can be found [here] .

    SPMFThis is an open source data mining library written in Java that specializes in finding patterns in data ( [link] ). It is stated that in SPMF implemented more than 55 algorithms for mining data. Unfortunately, there is no official SPMF shell for Python (at least at the time of this writing)

    Conclusion or once again about efficiency


    In conclusion, let's empirically compare the effectiveness of the runtime metric depending on the density of the dataset and transaction lengths of the dataset [9] .

    By dataset density is meant the ratio of transactions containing frequent items to the total number of transactions.

    Efficiency of Algorithms at Different Dataset Density



    It is obvious from the graph that the efficiency (the less runtime is, the more efficient) the Apriori algorithm decreases with increasing dataset density.

    Under length transaction number refers to items in the itemset

    Efficiency algorithms of varying length datasets



    Obviously, with an increase in transaction length, Apriori also does much worse.

    К вопросу о подборе гиперпараметров


    It's time to talk a little about how to select the hyperparameters of our models (they are the same support, confidence, etc.)

    Since the ARules algorithms are sensitive to hyperparameters, it is necessary to correctly approach the issue of choice. The difference in execution time for different combinations of hyperparameters can vary significantly.

    The main hyperparameter in any ARules algorithm is min support. Roughly speaking, he is responsible for the minimum relative frequency of the associative rule in the sample. Choosing this indicator, you must first focus on the task. Most often, the customer needs a small number of top combinations, so you can set a high value of min support. However, in this case, some outliers (bundle-products , for example) and not get any interesting combinations. In general, the higher you set the caliper value, the more powerful the rules you find, and the faster the algorithm is considered. We recommend setting a lower value during the first run to understand which pairs, triples, etc. goods are generally found in the dataset. Then gradually increase the step, reaching the desired value (set by the customer, for example). In practice, 0.1% (with a very large dataset) will also be a good value for min supp.

    Below we give an approximate graph of the dependence of the execution time of the algorithm on this indicator.



    In general, everything as usual depends on the task and the data.

    Summary


    So, we got acquainted with the basic theory of ARL (“who bought x, also bought y”) and the basic concepts and metrics (support, confidence, lift and conviction).

    We looked at the 3 most popular (and as old as the world) algorithms (Apriori, ECLAT, FP-Growth), envied R users and arules libraries, tried to implement ECLAT ourselves.

    Highlights:

    1. ARLs are the basis of recommendation systems
    2. ARLs are widely applicable - from traditional retail and online retail (from Ozon to Steam), regular purchases of goods and materials to banks and telecoms (connected services)
    3. ARL is relatively easy to use, there are implementations of different levels of development for different tasks.
    4. ARLs are well interpreted and do not require special skills.
    5. At the same time, algorithms, especially classical ones, cannot be called super-efficient. If you work with them out of the box on large datasets, you may need more processing power. But nothing prevents us from finishing them, right?

    In addition to the considered everyday algorithms, there are modifications and branches:

    The CHARM algorithm for finding closed itemsets. This algorithm perfectly reduces the complexity of finding rules from exponential (i.e., increasing with increasing dataset, for example) to polynomial. A closed itemset is an itemet for which there is no superset (i.e., a set that includes our itemset + other items) with the same frequency (= support).

    Here it’s worth a little clarification - until now, we have considered just frequent (frequent) itemsets. There is also the concept of closed (see above) and maximum . The maximum itemset is an itemset for which there is no frequent (= frequent) superset.

    The relationship between these three types of itemsets is shown in the picture below:



    AprioriDP (Dynamic Programming) - allows you to store supp in a special data structure, works a little faster than the classic Apriori

    FP Bonsai - improved FP-Growth with pruning a prefix tree (example of an algorithm with restrictions )

    In conclusion may not mention the gloomy genius of ARL Dr. Christian Borgeltfrom the University of Constance.

    Christian implemented the algorithms we mentioned in C, Python, Java, and R. Well, or almost all. There is even a GUI for its authorship, where in a couple of clicks you can download the dataset, select the desired algorithm and find the rules. This is provided that it works for you.

    For simple tasks, what we examined in this article is sufficient. And if not enough - we urge you to write the implementation yourself!

    References:

    Discovery, analysis and presentation of strong rules. G. Piatetsky-Shapiro. Knowledge Discovery in Databases, AAAI Press, (1991)
    Mining Association Rules between Sets of Items in Large Databases
    Fast Algorithms for Mining Association Rules
    Ask Dan!
    Introduction to arules - A computational environment for mining association rules and frequent item sets
    Dr. Borgelt's Publications
    J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in ACM SIGMOD Record, vol. 29, no. 2. ACM, 2000, pp. 1-12.
    Shimon, Sh. Improving Data mining algorithms using constraints. The Open University of Israel, 2012.
    Jeff Heaton. Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms
    Sources

    Authors: Pavel Golubev , Nikolai Bashlykov

    Also popular now: