Map / Reduce: Real-World Solution - TF-IDF

Published on November 10, 2009

Map / Reduce: Real-World Solution - TF-IDF

    Yesterday I asked a question in my HabraBlog - are people interested in learning what Hadoop is in terms of its real application? It turned out interesting. The matter is short-lived - I wrote the article quite quickly (at least its first part) - at least because I already knew what I was going to write about (because I still remember quite well how I poked around looking for information when I started using Hadoop ) The first article will discuss the basics - but not the ones that people usually talk about :-)

    Before reading the article, I highly recommend that you study at least the first and last sources from the reading list - understanding them or at least reading them practically guarantees that the article will be understood without problems. Here we go?

    What is Hadoop?




    Well, tell me, what's the point of writing about it? This has been said more than once, posts on the topic of Hadoop, HDFS and others have begun to be written several times. Unfortunately, usually it all ended with a rather lengthy introduction and the phrase “To be continued”. So: this is a continuation. To some, the topic covered in this article may seem completely trivial and uninteresting, but the dashing trouble has begun - any complex tasks must be solved in parts. This statement, in particular, we implement in the course of the article. I’ll immediately notice that I’ll try to avoid writing code in the framework of this specific article - it can wait, but you can understand the principles of building programs that work with Map / Reduce “on cats” (in addition, with the current frequency of dramatic changes to the Hadoop API, any code becomes obsolete in about a month).

    When I started to deal with Hadup, the initial understanding of the Map / Reduce ideology became a great difficulty for me personally (I prefer to write this phrase in such a way as to emphasize that this is not a product, but a principle). The essence and value of the method will become clear at the very end - after we solve a simple problem.
    1. The number of words in the case
    2. The number of terms in the case (hereinafter I will understand the term token as a unique term)
    3. The number of documents in the case
    4. How many times each term appears in each document
    5. How many documents each term contains.
    Let's face it - the task is simple and can be solved very easily - I don’t think it will take you more than half an hour to write such a program. Everything becomes a little more complicated if the text becomes larger , the number of words grows unlimited , the number of terms approaches the number of words in the text language, and dictionaries cease to fit in memory. It's time to remember that the first principle of not only developing complex software, but also solving problems in general, has been formulated for quite some time and is being formulated as “divide and conquer”. So, we divide the task into its constituent elements.

    To begin with, we will make the assumption (to simplify, and later we will consider how to solve this problem in a more general case) that our input data is presented in a text file format with a very simple structure:

    <w11> <w12> <w13> ... <w1N>
    <w21> <w22> <w23> ... <w2M>
    ...

    In other words, each document is a set of words that are possible (and very likely) repeated, and each such set of words is located on one line of a text file. This assumption is small - almost every document can be presented in this form.

    Counting the number of words in the corpus is a simple task, and is solved in linear time, however, it may not even have to be solved separately. The same applies to the number of terms - they will be calculated on their own. The most interesting task at this stage is to calculate how many times each term occurs in the body of the text.

    Yes, yes, you are not mistaken, it is she - the most popular and already tortured by all task, the first example in each Hadoop manual is the WordCount program. It is so popular, how simple it is - I won’t even bring it here, you can see it in the official Hadoup tutorial. If very briefly, then at the map step the program generates the following pairs:

    map1:
        <term1> 1
        <term2> 1
        <term3> 1
    map2:
        <term2> 1
        <term3> 1
        <term3> 1
    map3:
        <term1> 1
        ...

    In the reduce step, each reduce-task receives a key (i.e., <term1>, <term2> and so on) and a list of all the values ​​associated with this key obtained from all map tasks (in this case, just a list of units). It will look like:

    <term1> 1,1
    <term2> 1,1
    <term3> 1,1,1
    




    Summing up these units (and in fact - just counting the number of elements in the list) we get the number of occurrences of each term in the corpus:

    the 19283
    to 3432
    from 343
    ...
    ...
    london 14

    This is already something, although the value of this data is not obvious. However, a simple count of the number of lines in the resulting file gives us a number of unique terms. And summing up all the values ​​from the second column, we get the total number of tokens. It seems that they didn’t do anything - but already received several fundamental characteristics of the case.

    Next begins the classic information retrieval. To begin with, based on the results of the work, WordCountwe build a dictionary - that is, a general list of terms of the corpus. Our next task is to establish how often and which dictionary terms are used in each of the documents. To do this, we are already implementing a slightly modified versionWordCount, which considers the number of terms applicable to a particular document. Probably the easiest way to achieve this is to use a key in the results of map tasks that consists of a document identifier (mapper's input key) and the term:

    map1:
        1_the 1
        1_the 1
        1_the 1
        1_to 1
        ...
    map2:
        2_the 1
        2_the 1
        2_from 1
        ...
    map3:
        37_london 1
        ...
    

    Reduce for this task will be identical to the classic one WordCount- it will simply summarize the values ​​with the same key. As a result, we get:

    1_the 3
    1_to 1
    ...
    2_the 2
    2_from 1
    ...
    37_london 1
    

    So what did we do? And we got the so-called term frequency - which is much better known as term frequency , in abbreviated form - tf (t, d) (here t and d mean that the value is considered applicable to a specific document and a specific term). For example, in an article about London, the value of tf for a word londonwill probably be higher than in an article about pig farming (and possibly equal to zero - zero frequency is also a frequency). Probably, it should be noted that we have obtained an unnormalized version of this characteristic, to normalize the obtained values ​​should be divided by the total number of tokens in the case.

    As part of our example, we developed an algorithm for computing one of the most popular statistical characteristics in information retrieval. The value of this method is that it can be expanded to a case of almost any size - and this can be calculated even on one machine (or you can parallelize it to a cluster of one and a half to two thousand nodes). Thus, the answer to the question formulated at the very beginning of the article is as follows: the ideology of Map / Reduce allows you to break a computationally difficult task into small blocks that can be counted separately, and then combine the results. These blocks can be considered in parallel, they can be considered sequentially and it does not matter: the bottom line is that we turned one extremely resource-intensive task into a large number of tasks,

    Perhaps, here I still should pronounce the sacred phrase - "To be continued." In the next post, we will consider the calculation of the second part of tf-idf - namely, the inverse document frequency , after which we will smoothly move on to solving this problem for a real large (nobody will see enough) data set.

    PS A small note that seemed important to me: when writing the Russian version of the article (and it was originally written almost simultaneously in two languages) I tried to write in Russian as much as possible, but I did not translate many stable combinations (such as Map / Reduce) and did not even try to translate the names of the phases of this process, from there appeared map tasks and reduce tasks. Unfortunately, Russian terminology has not quite settled down applicable to this subject, but the great and mighty is so great and powerful that any student can decline the word “task” in cases - not to mention the programmers, who are the target audience of this post.

    If something seemed incomprehensible to you - please write. After you have been working in a certain field for a long time, your brains are “blurred” to some extent, and you take many things for granted. If somewhere there was a place to be - write, and I will correct.

    _________________________________________________________________

    References for home reading:
    1. Yahoo! Hadoop Tutorial - I recommend reading first, because there is simply no better documentation at the moment, including the official site.
    2. Hadoop quickstart guide
    3. Hadoop map / reduce tutorial
    4. Hadoop and Distributed Computing at Yahoo!
    5. Term frequency-inverse document frequency - Wikipedia article.


    The original photo was published under a Creative Commons license: www.flickr.com/photos/antichrist/3427853501

    Update: since UFOs temporarily disabled the ability to create new blogs, published them in algorithms - after all, Hadoop is far from the only Map / Reduce implementation, but not a single line of code is here. When the UFO has mercy, I will create a Hadoop blog and transfer it along with the new articles that are being written.

    Update 2: I said that to be continued? Well, here it is - this is the continuation - read and comment!