Linguistic Epic fail, or How to cram the entire dictionary of the Russian language in Power BI

    image

    Somehow, solving the problem of linguistic analysis in Power BI and at the same time looking for examples for my previous article , I remembered a task that Excel tried to solve several years ago: it was necessary to introduce a Russian language dictionary into the analytical system for linguistic analysis of a large number of queries in natural language. Moreover, it was desirable to use standard office tools. The vast majority of people immediately would undertake to solve this problem in Excel, and I once followed the same path. As a dictionary, I used the open corpus of the Russian language ( http://opencorpora.org/ ).

    But I was disappointed - the dictionary consisted of 300 thousand word forms, more than 5 million entries, and for Excel this is basically an impossible volume. Even if you push “only” 1 million lines into it, you can perform any manipulations with them or, God forbid, calculations, only a very patient person who never takes the time and cannot do it can. But this time I decided to set a more suitable tool on the task - Power BI.

    What is Power BI?


    I find this product largely undervalued by the professional community. Power BI is a set of business analysis tools that was created for users who own Excel at a slightly higher level than “lump up the amount in columns”. If a person is able to write formulas of average complexity in Excel, he will master Power BI in a couple of evenings.

    This is not a single product with some kind of internal programmer logic, but a system of three components:

    image

    • Power Query . This is an ETL in which you use your full-fledged functional programming language for writing queries. For the sake of fairness, it’s most likely that you don’t need to program it on the average user: most functions are available directly through the menu or wizard in the component interface. The M language is completely different from the DAX (PowerPivot) query language. However, Microsoft brought them together. There is a certain sense from the point of view of development: ETL is sharpened for receiving and initial data saturation (not fast), and DAX - for calculations that help us visualize this data (quickly). That is, DAX is designed for the frontend, and Power Query for the backend, for the process of extracting and formatting data.
    • PowerPivot . Module in-memory-processing, which is based on the engine xVelocity. Uses the DAX query language, very similar to the Excel formula language.
    • Component visualization . It is very useful for use in systems where you need to visualize data: on the company's website, or technical support portal (for example, a query cloud), or on an internal corporate resource. There are tools to do this without Power BI, but many of them will not help when the number of records is in the millions and the data needs to be aggregated somehow. And with other tools of this kind, Power BI competes due to its simplicity and low cost in-memory processing. It is clear that if we are talking about terabytes of data, a different approach will be needed here. And for such cases, Microsoft already has something to offer, but this is a topic for a separate article.

    The learning curve at the first stage increases very sharply: if you are good at Excel, then 80% of the capabilities of Power BI will be revealed to you after a short study. This is a very powerful tool, fairly easy to use, but - until a certain point. To use it at full capacity, you already need experience and deep knowledge of M and DAX languages.

    What is Power BI Desktop for?


    Who can he use? First of all, any business users who have to process and analyze fairly large amounts of data when Excel can no longer cope or puffs to the limits of its capabilities. I emphasize - Power BI Desktop is designed for a wide range of users who solve a very diverse range of tasks . For example, in my case it was about normalizing 5 million text records for the subsequent determination of the frequency of keywords.

    This is in demand when processing questionnaires, search engine requests, advertisements, dictations / essays, some kind of statistical arrays, etc. Or to solve crossword puzzles ...

    Another case and implementation option is discussed in the articleabout the "recognizer" Dmitry Tumaykina. Implemented in classic Excel, but using macros ...

    Another popular scenario of this use of Power BI is the calculation of the ratio of indicators for the current and previous periods. For example, we have pre-aggregated revenue data, and we need to compare it by day with the previous quarter, or year, or a similar period. And you want / need to insert the result of the comparison in the next column in the form of values, and not formulas. It would seem that for Excel the simplest task is to write a simple comparison formula and stretch the columns across all the cells. But not if you have several million rows in a table. In the DAX itself, this problem is solved even easier than in Excel, but also only with the help of post calculations.

    There are many other practical scenarios for using Power BI, but I think you already understood the essence. Of course, all these tasks are not a problem for a programmer who owns, for example, Python or R, but such specialists are a priori less orders of magnitude than Excel experts. Excel is only limited in its capabilities, which cannot be said about Power BI, which uses the formula language DAX, which is very similar to the formula language Excel, and is capable of processing millions and tens of millions of records on the fly. And then you need to increase the RAM (at least to 100, at least to 300 GB).

    We help technical support to process requests


    But back to my task. It was necessary to think of how the zero-line technical support automatically evaluate the subject of user requests. To begin with, I decided to isolate certain word forms and, by the frequency of their appearance in appeals, determine the most important topics that users most often raise.

    The source vocabulary is a simple text file that has a regular structure and looks like this:

    image

    For statistical purposes, it was necessary to determine the initial form for each word form: for nouns - the only number of the nominative case, for verbs - the indefinite form, etc. For programmers, this task was easier: for each word in the left column, find a correspondence to the form immediately after the number of this word in the dictionary.

    Only an average business user who does not own Python, specialized tools and development skills cannot solve this problem without using self-analytics BI or similar user-friendly tools. Moreover, if the data needs to be processed for its internal needs or there is no confidential information in them that requires protection, then Power BI will also be free in this case *.

    Hidden text
    *) Здесь имеется в виду версия Power BI Desktop и версия Power BI Services для персонального использования по тарифу Free.

    To analyze the data, I needed in Power Query, in a table of 5 million records, to add a new column shifted by one position. First, I tried to apply the classic approach using Power Query, which is describedon the Power BI community portal, by Marcel Beug, the author of the original online Power Query guide (also written in Power Query). The article proposes two different algorithms: one is inspired by the ideas of Matt Ellington — the well-known guru and trainer in Power BI, and the second approach is the original idea of ​​Marcel himself, using an additional function. Despite the fact that in order to increase performance, I completely cached the original data, both approaches required a tremendous amount of time — it was already eighth day, and the process did not end. The size of the source file was 270 MB, and the current size of the processed data was 17 TB. I'm sure few Power BI users have seen such numbers in the download window for data from a file source.

    image

    Why volume is so swollen, it is not clear; even the Cartesian product of all records turns out to be much less than 16 TB. Here the internal optimizer was clearly not up to par. And, for example, DAX-Studio does not allow to trace Power Query queries, only DAX. Maybe someone will share their experience in PQ trablshuting?

    Without waiting for the completion of the first process, I decided on another machine to try to solve the problem using DAX by means of a self-written request. The request was executed ... in about 180 seconds, and memory consumption increased slightly.

    image

    DAX request source code:

    KeyWord =
    CALCULATE(
    TOPN(1;
    CALCULATETABLE(
    VALUES(ShiftedList[Word])
    ;ALLEXCEPT(ShiftedList;ShiftedList[Word Nr])
    )
    )//TOPN
    )//CALCULATE



    That is, for each row in the new [KeyWord] column, the very first value of the [Word] column containing all variants of word forms, having the same number of the base word form (column [Word Nr]) is searched. As long as the format of the original file remains unchanged, the request should work out without errors and on all subsequent releases of the dictionary.

    The request code in Power Query, which forms the source table in the required format, was formed “on the machine” and executed in less than a minute:

    image

    After a column of keywords has been formed in the PowerPivot interface in three minutes, the search for any word forms in the Power BI interface takes no more than 4 seconds. Moreover, the control search of the same data in your favorite Notepad ++ x64 could take 20 seconds or more. But this is not a stone in the NPP garden - it is more difficult (and longer) to search for the entire data array than for the already marked data.

    By the way, the above DAX request was not born the first time, and the intermediate options consumed all the available memory, worked for a long time and ended with either a data error or an irrelevant result.

    image

    As a result, the size of the saved PBIX file was 60% smaller than the original text dictionary (112 MB), but more than 4 times the size of the ZIP archive with the same dictionary.

    Returning to the battle between Power Query and DAX: the difference in the duration of the execution of the same operation in different components indicates that Power BI is not a scrap, against which there is no reception. He has his own character and features of the application that must be considered in his work. Actually, like any tool. And the recommendations of even recognized gurus should be treated with caution.

    It seems that the Nobel laureate Richard Smalley used to say, paraphrasing Clark's first law: “When experts say that something is feasible, then probably they are right (they just don’t know when). When they say that this is impossible, then, most likely, they are mistaken. ”

    Test machine specifications:
    Processor: Intel Core i7 4770 @ 3.4 GHz (4 cores)
    RAM: 16 GB
    OS: Windows 7 Corporate SP1 x64


    Ready Dictionary in Power BI format can be downloaded from here .

    And here is a modified online version of the dictionary. You can solve crossword puzzles at your leisure with it :)

    image

    ... By the way, then, a few years ago, the task was solved in Excel, though not 100%. Just for the analysis of the texts was not used the whole body of the Russian language, and the frequency dictionary. For the basic text clearing, any of the top100 frequency lists of several tens of kilobytes available here is quite suitable .

    Yuri Kolmakov, Expert, Department of Systems of Consolidation and Data Visualization, Jet Infosystems ( McCow )

    Only registered users can participate in the survey. Sign in , please.

    For the sake of curiosity, a small survey: what was the maximum data size (by the number of records) that you managed to process in Power BI?


    Also popular now: