MapReduce or calculations beyond the capabilities of the memory and the processor (I will try without zaumi)

    I have long wanted to talk about MapReduce, but no matter how you look at it, it’s such a thought that it just takes horror, but in fact it’s a very simple and useful approach for many purposes. And to realize it yourself is not so difficult.

    I must say right away - a topic - for those who have not figured out what MapReduce is. For those who figured out - there will be nothing useful here.

    Let's start with how the idea of ​​MapReduce was personally born to me (although I did not know that it was called that way, and, of course, it came to me much later than the Google’s).

    First, I will describe how she was born (the approach was wrong), and then how to do it right.

    How to count all the words on Wikipedia (wrong approach)


    And she was born, as, probably, everywhere - for counting the frequency of words when usual memory is not enough (counting the frequency of all words on Wikipedia). Instead of the word “frequency” there should rather be “the number of occurrences”, but for simplicity I’ll leave “frequency”.

    In the simplest case, we can create a hash (dict, map, hash, associative array, array () in PHP) and read the words in it.

    $dict['word1'] += 1

    But what to do when the hash memory runs out, and we counted only one hundredth of all the words?

    I solved this problem by counting part of the words until the memory runs out, saving the hash to disk. That is, directly line by line to the file: There was a problem - but how to merge these files? After all, each of them occupies the entire RAM.

    aardvark | 5
    aachen | 2




    At first there was an idea to take only the most popular 1,000,000 words from each file and combine them - this will fit into the RAM and count at least the top of the list (the most popular words). This, of course, worked, but it turned out that millions of lower words were lost, and there were much more.

    The idea came to sort files.

    Then we take 20 sorted files, read the first 1000 lines from each of them, they will be about the same words (sorted files). We summarize and form a new hash, it will contain only words starting with “aaa ...” and the like, save it in new files. We read the next 1000 lines, all the same. There, in almost all files there will be the words "aab ..."

    Thus, a new file is formed much smaller. However, words will still be repeated in it. Sort it again, read it in 1000 lines, add up. It will turn out to be an almost correct file (some words can still be beyond 1000 lines), repeat a couple of times ... in the end we get a file in which there are very few errors (but they are).

    A dreary, long, but better option did not occur.

    The weak point of the wrong approach

    There was one weak point in this approach - namely, combining the original 20 files. How to make it better?

    The problem arises from the fact that some words will not be in some files or they will be in different blocks of 1000 lines. That is, if I could take from all 20 files not the first 1000 lines, but only one line, but with the same word, I could combine all 20 files in one pass.



    How to do it? In general, this is ↑ the last step of the MergeSort algorithm - combining sorted lists. If you know, skip it.

    We take the first line of all 20 files, look for the minimum first element (word) - it will be the most minimal in everything, since we have sorted the files. Let's say this is the word "aardvark" We take from all 20 lines that we read only those that relate to this word "aardvark". And from there, from where we remove it - only in those files we read the second line. Again, we are looking for the minimum among these 20. By analogy - further, until we reach the end of all files.

    MapReduce in its simplest form


    Actually, so I almost invented for myself what Google invented before me a decade ago and called MapReduce.

    The invention of bicycles continues to this day.

    So there is a line : "foo bar baz bar".

    It is necessary to get the output: { foo: 1, bar: 2, baz: 1 }.

    Step one, take a string , break it into words and produce such arrays (or rather: “tuples” - “tuples”): (Next, I will omit the brackets and quotation marks where it will be clear already) Take them, sort them: We notice, that the bar is twice in a row, so that combine in such a form - it is like a nested array, that is, technically - it is so: . Then we simply add the second elements of the arrays

    [ 'foo', 1 ]
    [ 'bar', 1 ]
    [ 'baz', 1 ]
    [ 'bar', 1 ]





    bar, 1
    bar, 1
    baz, 1
    foo, 1




    bar, (1,1)
    baz, (1)
    foo, (1)


    (1,1)["bar", [1,1]]

    . We get : Exactly what they wanted. The main question is what for a goat button accordion ... or what did we do here and why?

    bar, 2
    baz, 1
    foo, 1






    Back to the past


    If we imagine that we have a computer that gets only 2 lines and it can perform only one operation with a line per minute . (To stop giggling! After at least once counting all the words on Wikipedia - you will have the right to laugh at the set memory restrictions, it still will not fit, even though you have how many gigs there will be, and if it breaks in, consider the whole Internet :)).

    We can (of "foo bar baz bar") make two files in this way: We have two lines in our memory - everything is in order, we fit into the memory limits. Now, using a step from MergeSort , we can combine these files line by line: In this case, in our memory each time we only have two lines of 2 files — no more.

    file1.txt
    [ 'bar', 1 ]
    [ 'foo', 1 ]

    file2.txt
    [ 'bar', 1 ]
    [ 'baz', 1 ]






    bar, (1,1)
    baz, (1)
    foo, (1)




    Actually, what we did is already MapReduce.

    That step, which gives out arrays with ones ( слово, 1) from words - this step is called “Map” .
    The step that summarizes (1,1)is the “Reduce” step .

    The remaining steps will be done by the algorithm itself (sorting and combining through MergeSort).

    Map, Reduce? What is it?



    These steps themselves do not necessarily consist in issuing units in the case of “Map” or folding in the case of “Reduce”. These are simply functions that can accept and give out something. Depending on the purpose.

    In this case, “Map” is a function you wrote that takes a single word and issues it (слово, 1).

    And “Reduce” is a function you wrote that takes an array (слово, (1,1))and returns it (слово, 2).

    Simply put in Python:

    words = ["foo", "bar", "baz"]
    def map1 (word):
      return [word, 1]
    arr = ["foo", [1,1]]
    def reduce1 (arr):
      return [arr [0], sum (arr [1])]


    or PHP:

    $ words = array ("foo", "bar", "baz")
    function map1 ($ word) {
      return array ($ word, 1);
    }
    arr = array ("foo", array (1,1))
    function reduce1 (arr) {
      return array ($ arr [0], array_sum ($ arr [1]));
    }


    So, we have bypassed the memory limit, but how to bypass the speed limit?

    Imagine that we have two such computers. We give each of them an initial line and say to the first (more precisely, MapReduce says): count only words in odd places, and for the second - count words only in even places.

    The first produces: The second produces: We (more precisely MapReduce) take the results from both, sort, then run through MergeSort, as above: Exactly the same result as when one computer counted! Now we (MapReduce) are distributing it again to two computers: the first we give only odd lines, the second - even and we ask each computer to take the Reduce step (add the second digits).
    "foo bar baz bar":
    foo, 1
    baz, 1



    "foo bar baz bar":
    bar, 1
    bar, 1




    bar, (1,1)
    baz, (1)
    foo, (1)






    Actually, it is clear that these lines are independent of each other, so the result will again be what you need.

    The main thing is that two computers worked in parallel and, therefore, two times faster than one of them (except for the loss of time for transferring data from one to the other).

    Premature withdrawal


    Fuf! So MapReduce - it is needed in order to read something that either needs to be done faster, or for which there is not enough memory (either that, and that).

    A more interesting example is sorting by popularity (cascades)


    Suppose we want to count the number of words on Wikipedia and at the same time build a list in the reverse order of their popularity - from the most popular to the most unpopular.

    It is clear that all the words of Wikipedia will not fit into memory, and then for return sorting this giant array will not fit into memory. We need a cascade of MapReduce - the result of the first MapReduce will go to the input of the second MapReduce.

    To be honest - I don’t know if the word "cascade" is correct, it applies specifically to MapReduce. I use this word for myself because it explains like no other what needs to be done (the result of one waterfall of words falls into MapReduce and cascades immediately into the second MapReduce).

    Okay, how to count the words - we already know:

    “foo bar baz foo”

    The Map step written by us gives: Next, MapReduce combines (yourself, not you, as a programmer) them into: And the Reduce step written by us produces: Now let's imagine that we counted all Wikipedia and this array contains billions and billions of words. Sort it in memory does not work. Let's take another MapReduce , this time Map will do the following trick: -> map () returns -> -> map () returns -> -> map () returns -> -> map () returns -> What is this for ?
    foo, 1
    bar, 1
    baz, 1
    foo, 1



    bar, (1)
    baz, (1)
    foo, (1,1)



    bar, 1
    baz, 1
    foo, 2




    [слово, 15][-15, слово]
    [слово2, 15][-15, слово2]
    [слово3, 120][-120, слово3]
    [слово4, 1][-1, слово4]



    MapReduce, before going to your Reduce, will sort all these arrays by the first element of the array (which is equal to a negative number). MapReduce will be able to sort even if the entire amount of data does not fit into memory - that's the beauty. For all Wikipedia words, you simply cannot do it arsort($words), and MapReduce can.

    Why is the minus in front of the numbers?

    Because MapReduce will always sort in ascending order , but we need to in descending order. How then, using sorting in ascending order only, sort the numbers in decreasing order? Multiply by minus one before sorting and again by minus one after.

    Ascending positive numbers: 1, 15, 120
    Ascending negative numbers:-120, -15, -1(what we need is only with a minus sign, which we then simply remove by multiplying by -1)

    The following thing will come to the input of Reduce: Lovely, but our two words had a “frequency” of 15 and they were grouped by MergeSort. We will correct it. Now in our Reduce we only need to multiply the first number by -1, and then produce one array for the first row, two arrays for the second, and one again for the third. In fact, depending on which MapReduce implementation you use, you may not be able to output two arrays in the Reduce step, because only one array will be required at the output - then just do it after your Reduce step in your program. We get: Beauty! What was needed.

    -120, (слово3)
    -15, (слово, слово2) <-- два слова на строке - MergeSort же сгруппировал все по первому ключу!
    -1, (слово4)










    120, слово3
    15, слово,
    15, слово2
    1, слово4




    Again, remember that the main thing that we circumvented here is that the example is of four lines, and Wikipedia has billions of words that do not fit into memory.

    How to make a simple MapReduce to play with?


    In PHP : the simplest example .
    On Python simplest example (see. Below about the version of Python).

    In the code, I indicated what and where it should be in order to make a more or less full-fledged MapReduce with black ... in the sense of files and MergeSort. However, this is a reference implementation , so to speak , to play around and understand how MapReduce works. This is still MapReduce, just this particular implementation from the point of view of memory is no more profitable than a regular hash.

    I chose PHP, although it is not the most reasonable for these purposes, because almost any programmer can read PHP, and translate it into the desired language will be easier.

    Hints and cheats


    Yes, I recommend saving the JSON representation of arrays (json_encode) line by line - there will be less problems with spaces in words, with Unicode, numbers and data types, that is: Tip : the last MergeSort step has already been implemented in Python - this is . That is, to connect 10 files with JSON representations is enough:
    ["foo", 1]
    ["bar", 1]
    ["foo", 1]


    heapq.merge(*iterables)



    items = list (itertools.imap (json.loads, open (filename)) for filename in files)
    for item in heapq.merge (* items):
      # .... reduce (item) ....


    In PHP with the implementation of MergeSort, I suspect I have to bother with fifty lines. Unless, of course, in the comments no one tells me a better option.

    In Python yieldand __iter__for MapReduce - will allow you to do very interesting things! For example, such:

    x = MapReduce ()
    for word in "foo bar bar" .split ():
       x.send ((word, 1))
    for word, ones in x:
       print word, sum (ones)


    class MapReduce- you will have to write it yourself (I kept within 24 lines in the simplest working form, maybe less - simplifying iter_group is an analog of the group_tuples_by_first_element function from the PHP example).

    Caution - this method is not quite classic for MapReduce and it will be difficult to parallelize it on many machines (however, it is quite trivial in this method to make work with data volumes more than memory available). The method map_reduce(source_data, map1, reduce1)where map1 and reduce1 are functions is more correct.

    The implementation of MapReduce on Hadoop is the most popular solution. (I have not tried it, I just know what is the most popular).

    Afterword


    So here, I hope my story about MapReduce Without Zaumi will be useful.

    MapReduce is a very useful thing for any large-scale calculations. Almost any SQL query, even into several tables, is not very difficult to decompose into MapReduce + join_iterator (more on that another time).

    If you have the strength, in the next topic I will describe how to use MapReduce to calculate more interesting tasks than commonplace words - for example, how to read links on the Internet, the frequency of words in the context of other words, people in cities, prices for hundreds of companies, etc. .

    Yes, everyone here! MapReduce is patented by Google, but rather for protective purposes - the same Hadoop they officially allowed to use this method. So - handle with care.

    Part Two: More Advanced Examples .

    Yoi Haji,
    View, as always, from Habr ,
    2010

    (someday I will learn to explain briefly ....)

    Also popular now: