Series: Big Data - like a dream. 9th series: Why IBM was forced to buy Alchemists for $ 100 million

    In previous series: Big Data is not just a lot of data. Big Data is a positive feedback process. The Obama Button as the embodiment of rtBD & A. Big Data Development Philosophy. In this series, we will talk about linguistic analytics of high-speed flows of unstructured texts and social media messages and introduce “Eureka” - our answer to “Alchemists”.

    The Internet, in its current perception by society, is a related set of messages: personal correspondence in messengers, links between articles in the media, blog discussions, game chats, thematic series on Habré, or, as it has changed in the worldview of new generations - links to search engine answers after a set of the query "What to do today?"

    If you look closely, then the basis of the basics: Communications and Topics. We will not talk about analytics of “connections” (this is to the NSA, whose “electronic almighty US Senate” has refused to attempt today on electronic surveillance). But Thematic analytics (which recently got its name - Brand Analytics - in a press release between Facebook and DataSift, and in Russia there is already 3 years in the form of a project name) and the various goodies associated with it - a great topic (! :-)) for the new series.

    In order not to inflate the series, we cite the current “level of threat” and links to specific cases for which new solutions and approaches were required, for those wishing to explore more deeply:

    - The volume of communication messages generated by humanity is approaching 20 billion per day, the main stream is non-public (various instant messengers, mail).

    - The volume of public Russian-language messages on social media (social networks, Twitter, comments in the media, blogs, forums, photo and video hosting sites, review sites, etc.) - 1 billion per month . The volume of “classic” editorial and “literate” media reports is less than 1% of the total data stream (up to 10 million out of 1 billion).
    Open real-time statistics of social media and media data flows are available at br-analytics.ru/statistics

    - To process 30-40 million messages per day (1,000 messages per second at its peak), new data processing techniques and algorithms are needed. Social media streams are unstructured “illiterate” (non-classical media), loosely coupled, with a lot of spelling and punctuation errors, often multi-meaning and multilingual messages.

    Tasks and problems that need to be solved in a modern dynamic world (practical cases of previous years):

    - Worldwide ” campaign(case from October 1, 2013) - the task of the “Operational Sociology” class: real-time monitoring of the reaction to a dynamically changing, influenced by popular media people, interested and most of the society; identification of iconic, unpredictable, modulating active distribution in society messages for a quick reaction from the structures involved in the discussion (in this case - TV channels and mobile operators).

    --- “A straight line with Putin” (case of April 25, 2013) - the task of the Obama Button class: real-time highlighting of unknown active topics and determining the tonality of each topic. - “Love and hate” on the map of Russia


    , winter 2014-2015: study of the emotional state of 35 million social media users in all regions of Russia.

    - Today’s: thematic widgets for websites as part of the MinCult special project on “Museum Night”

    From the tapes (social networks, Instagram photos, YouTube videos):

    We are waiting for you at Museum Night in Lumieres 2.0. We start at 20:00 with a tour of the exhibition "Soviet Photo" from ... t.co/evIDYZVltl
    twitter.comThe Lumiere Center 1 min. back

    And yesterday we went to the museums night))) It was very interesting
    vk.com - Elena Ivanova - 2 min. back

    Who wants a night of museums today ?? write me or call) the company will be 89260860xxx
    vk.com - Nadezhda Porodzinskaya - 3 min. back

    An hour later I leave the house for the night of museums) Who wants it too - write)
    vk.com Daria Klimovich - 3 min. back

    ... monologues, Lydia Masterkova about Vladimir Nemukhin and about himself. We are waiting for everyone, the entrance ...
    instagram.com - Moscow Museum Of Modern Art - 6 min. back

    Museum Night in St. Petersburg: quest in Mikhailovsky Castle, St. Petersburg, May 17, 2015
    youtube.com - Today’s News - 3 hours ago.


    To solve problems of this class, it was necessary to develop completely new approaches and solutions. Over the past 10-20 years, IBM, SAP, Microsoft, Samsung and other giants have spent billions of money on technologies for processing “classic” texts (media, corporate documents, archival data).

    But these billions and achievements do not help in solving new problems. And here the winner is the one who makes decisions faster (see the Big Game series - megamozg.ru/company/palitrumlab/blog/14154 about Apple and Twitter in the fight for suppliers of unstructured Big Data). In the continuation of the IBM Big Game, “spitting” on previously spent funds (unlike the same SAP, which has been trying to solve the problems of Russian linguistics for two years already with the help of its European centers) acquired in March the AlchemyAPI project , which already has high-speed technologies for processing billions of texts in several western languages.

    As “advertising in the series”, or rather “for those who have been looking for” for a long time:

    Our “report to Chamberlain” (which we mentioned in the 6th series) followed immediately: in May 2015, we highlighted new technologies as an independent separate public solution for use by third-party companies - Eureka Engine (http://EurekaEngine.ru ), which represents a high-load cloud solution and an industrial API for inclusion in existing or developed by teams, companies and organizations technological complexes.

    Eureka is already working for the good of RIA Novosti and Samsung, Mail.ru and RosTourism, Atonomy and Brand Analytics, agencies and companies in different countries. If you have the task of processing large flows of unstructured data(thematic plotting for the editors, sorting heaps of incoming documents into the correct departments, determining the language of texts, identifying named entities, etc.) - welcome!

    There will always be a solution, right?

    Also popular now: