Series: Big Data - like a dream. 6th series. BD (Bolt Data) - Fast Big Data Data

    In previous series: Big Data is not just a lot of data. Big Data is a positive feedback process. The Obama Button as the embodiment of rtBD & A. Big Data Development Philosophy. In the new series for the first time we will mention the new E-ngine - the embodiment of the dreams of IBM, Google, etc.

    Only the lazy (including the scriptwriters of our series) did not express his opinion about “Who is Big Data?” Today, let's discuss not about volumes, but about the rate of data flows. The English word Bolt has so many meanings that it is easy to choose a different meaning for the two-letter BD instead of Big Data - Bolt Data , including: lightning strike, fly out, blurt out, speak quickly and slurred.

    A fashionable habit of paying attention only to volumes (Big) has already led to mass disappointment of the general population. Here comes the next representative of the next portal at the next conference, say, with a database of resumes: “We have a real huge Big Date! 20 million resumes! "Last month we moved to a new 8-64-192-core server with 4-8-32 TB of memory!”
     
    We breathe evenly and present the picture of Ancient Egypt: 20,000 slaves drag huge stone blocks and erect another, 105th, Cheops Pyramid. Since the TASK determines the solution, and not the DECISION, comes up with a problem , then for the local Tutankhamun and the "ancient Egyptian resume portal" such a data volume (20 million cards) is to spit and grind.

    Imagine a picture: scratching a thick abdomen, he goes out onto the balcony of MantesumHeops-XXI in the morning and orders: “By evening, I had to find 5 new foot washers, I had to feed the lions yesterday.” Turns and leaves, and the work begins to boil: each of the 20,000 slaves throws stone blocks, grabs 1,000 resumes, quickly looks through each in 20 seconds, and by noon the Chief Eunuch already has 20-30 resumes for an interview. MantesumHeops-XXI and his hungry lions are satisfied, well-fed and happy. And the slaves also took a break from dragging terrabytes of stones ("cores").

    As you can see, the result was achieved on time and without unnecessary clever words . And whether someone calls this process Big Data or not - to the ancient Egyptians by papyrus. So when you see another cliche, then relax, and think about Ancient Egypt :-)

    Today the next Direct line with V.V. Putin. The task from a technological point of view is much more interesting (we already discussed in the last series about the “Obama Button”) than the pyramid of the resume, in the sense that for the younger scientific and technical generation and for those interested in the new Egyptians, Bolt Data can be discussed using a real example and talk about linguistics.

    Here's a graph of the reaction (see one of the translations of the word Bolt above to speak quickly and inarticulate ) to hundreds of thousands of Russian-speaking social media users: journalists, politicians, economists, mothers, dads, grandmothers and grandchildren:



    Is it possible to process such a "stream of consciousness" with the help of 20,000 ancient Egyptian slaves? Does not work. After all, only 2-3% of discussions / comments take place in public places (large groups in VK or the FB, text broadcasts of federal agencies and the media), the rest of the “popular cries” occur in the mouths of personal accounts for friends and girlfriends. Watching each of the billions of Twitter, Facebook, and VKontakte accounts isn’t enough on Earth.

    These are the tasks we call rtBD & A - real-time Big Data & Analytics (in Russian, for example: analytics of unstructured data of large volumes in real time). With " rt " - clear, with BD (Big / Bolt Data)- it’s also clear, just introduced a time limit factor (in radio engineering there is the corresponding term “duty cycle”), let's open A-Analytics a little . Let us leave aside the issue of “Hearing” millions and billions of public messages (we talked about these systems in the previous series), we will talk about the problem of “Hearing” , as well as the need to “understand” the language of birds, animals and people.

    This is where the cool system of E-ngine modules comes in handy (the name of the system is of course different, but for now we’ll dwell on this, it’s not important for our series): according to the “live stream” of data generated by millions of people, you need to:

    - Define message language ;
    - Spendlinguistic text processing ;
    - Determine that the text is about “Putin” and not about “putin” (if someone is not in the know, this is the time of fishing);
    - Classify the message (identify existing topics or propose a new one);
    - Identify NER objects (named entities - surnames, settlements, plant names, etc.), using non-dictionary methods (well, the Chelyabinsk meteorite object was not in the dictionaries and Wikipedia before the disaster);
    - Determine the tone of the statement (positive-neutrality-negative), moreover, an important object tone, and not just “as is usually done”;
    - and all the little things ...
    - For dessert: literacy and punctuation of our texts in social media - well, you yourself know :-)

    To strengthen the performance, let's figure it out on fingers: in 4 hours (time of the Direct Line) in public popular social media (microblogs, social networks, news and comments, forums) , blogs, videos, reviews, reviews) users generate about 8-10 million Russian-language (Cyrillic) messages (our public real-time statistics on social media ). Those. for processing on the fly you need to manage to process up to 1,000 unstructured messages SEC and thrash such a stream with E-ngine modules.

    The average “over the hospital” message length on the Russian-speaking Internet is ~ 1 Kb. You can evaluate the speed of E-ngine yourself. For evaluation, you can use the presentation data of the Compreno system (developed by our friends and the wonderful Abbyy team) - a very strong and wonderful tool that took thousands of person-years to develop: it takes 5-10 seconds to process 1 Kb of text, but the quality of processing the “book language ”- very high.

    So, the summary of the series:
    1. We don’t catch on the term “Big Data” that has already been beaten up and sometimes even “killed” - the term clearly awaits the fate of the proud term of the 90s “Portal”, which can be found in the title everywhere, such as “Evening Club Portal” dances in the village of Podosinoviki. "
    2. Through the squint, we estimate the magnificent length of the legs of the new PR-client, chirping about “our petabytes” to no one needs data. The data you need is needed .
    3. And on time .
    4. Intelligent solutions, methods and algorithms are all the more valuable, the higher the speed of decisions, methods and algorithms . Not all tasks can be stolen into 20,000 ancient Egyptian slaves.

    And between the series, you can speculate at your leisure about the new path of the Blue Giant: IBM sold Lenovo the PC division, became friends with Twitter, sent 10,000 employees for retraining at Data Scientist, and recently bought AlchemyAPI (a wonderful E-ngine engine for several Westerners languages).

    Against the background of the long-lived and “eternally young” IBM (throws away the old, swiftly changes to the new), the transient life of the once great and ambitious Sun Microsystems (the wonderful servers were, by the way, and Java livelier than all living ones) is not at all surprising, and now there’s new news, that once the world’s Finnish leader in the mobile world Nokia (recently acquired by Microsoft) decided to pocket the “unsinkable and eternal” Lucent / Alcatel, which even in pair could not resist the Chinese.

    Do not stop for long under beautiful signs, no matter how Big Data they are called - these are just beautiful untwisted names. Move - solve problems, not memorize solutions . We wish to constantly change and open new roads - it is so interesting to give new solutions new names.

    PS Does your company have an understanding of how to solve problems like the above “non-Egyptian way”? Do you feel the makings of a Data Scientist and roughly understand how to “identify” the situation with the Chelyabinsk meteorite in 3 minutes, not 3 hours (as the press reacted)? Are you able to algorithmically identify new Twitter spam bot techniques? Then you are on one of many, but certainly the right path - you have a great future.

    In the following series: NoSQL or column DBMS, from where the legs of hearing grow, that "data ends", mankind is like a worldwide scavenger.

    1st series. Big Data - Like a Dream
    Series 2: Big Data Negative or Positive?
    3rd series: “Obama Button”
    4th series. Brain Revolution
    Series 5: A Great Game. Private opinion 

    Also popular now: