Missed BigData Features

    The fact that there is an incredible future behind BigData multiplied by artificial intelligence has been written almost more than the collected works of the Strugatsky brothers and Jules Verne combined. All of them, and not entirely without good reason, argue that the huge amounts of data collected, processed using, for example, Deep Learning, will be able to identify all fraudsters today, prevent dubious transactions and predict the most profitable markets. The financial industry itself will become fully automated under the control of wise artificial intelligence.

    Perhaps this will be to some extent. Already today, the degree of automation has reached a level that 10 years ago seemed fantastic. Everything is so ... But, as you know, "little things" can bring a lot of surprises. One of such trifles is the fact that the lion's share of all the data that could and should be used in the fight against fraud, market forecasting is text data. The amount of daily written, video, and other data generated is billions of lines, the analysis of which with the help of operators is practically useless. Someone might argue that everything is wrong and most of the data are ordinary tables that are well processed by statistical methods. And, it would seem, he will be right. Banks from TOP-30 report on the widespread use of BigData.Alfa Bank primarily deals with structured transaction data.

    But even in the analysis of structured data, we will see that all these mountains of numbers abut in separate columns that carry additional meaning. They contain the names of the goods, the names of organizations without indicating any TIN, last names and others, let’s say “unstructured data”.

    Another huge layer is data arrays with price lists, advertisements for the sale of apartments, cars, and much more. And here again someone will say: “but almost everywhere there are product catalogs, there are TN FEA, OKVED-2 and much more.” And this comment already contains the answer to many questions. All these directories are industry-specific, incomplete, there are no complete descriptions and rules for assignment, and human imagination sometimes has no borders. As for other areas, such as arrays of contracts, job advertisements and Internet posts, there are no directories at all.

    Combining all these problems is the recognition of the fact that by any statistical methods, even neural networks, it is simply impossible to solve this problem without search and analytical systems of semantic and semiotic analysis. A simple example is the task of combating fraud in the field of mortgage lending or issuing a car loan to buy a used car. The set of data that I would like to receive, I think, is understandable to everyone: Is there an apartment or car for which you need to issue a loan in the lists for sale? And what is the cost per square meter in the same or neighboring house, or the price of a similar car? And what is the cost within the settlement, and within the metropolitan area, etc.?

    Downloading data from sites “as is” today is not a technically difficult task. Having received such a database, we have millions of records with unstructured information and a database of the BigData category in its entirety. An analysis of the bases of job offers in order to verify the adequacy of the wages indicated in the certificate or an analysis of the reliability of the young generation without analyzing social networks is an impossible task.

    Recently, more and more different kinds of government bodies have become interested in the topic of semantic data analysis. An example is the electronic auction placed in May 2017 on the government procurement website for the development of the “analytical subsystem of the AIS FTS”, which includes a subsystem of semantic text analysis.

    Unfortunately, for some reason, behind the winning stories, there is a full pool of problems and missed opportunities related to this. Let’s try to understand at least some of them.

    The first is the availability of data in and of itself. The volume and speed of incoming data today excludes the possibility of their processing by operators. The consequence is the urgent need for products on the market that provide solutions to Data Quality and Data Mining tasks in automatic mode with a recovery level of at least 80-90 percent at a very high processing speed. And not least, the number of errors should be no more than 1-1.5 percent. An attentive reader can say that there are various distributed solutions that can solve low-performance issues, such as Hadoop and so on. All is true, but many forget that such processes are cyclical in nature. And, what has just been extracted should be added to directories, search indexes, etc. Data that does not overlap within the same stream, may overlap with data from another stream. Therefore, the number of parallel branches should be kept to a maximum minimum, and the performance within one thread should be maximized.

    Secondly, this is the real percentage that is used. According to some Western sources, the share of "dark" or hidden data in different countries reaches half or more. The main reasons for the impossibility of their use are their poor structuring against the background of low quality. Here, I immediately want to clarify that the problem of structuredness and low quality are two completely different problems. It is difficult to decompose unstructured data into components and build any dependencies, it is difficult to compare, but at the same time they can be absolutely reliable and valid in nature. Invalid, or poor quality data, can be perfectly structured, but not correspond to the objects of the "real" world. For example, a mailing address can be remarkably laid out in fields, but not exist in nature.



    Thirdly, this is the lack of Western systems competence in the field of semantics of the Russian language. This problem is often overlooked by analysts themselves when choosing systems for working with data. Solution providers and system integrators warmly promise that this is an issue that can be easily resolved because "our solution is already present in many countries." But as a rule, the fact that it is either international organizations working in English or the language of the same romance group, or the implementation is not completely localized, is hushed up. In our experience, all attempts to localize semantic search tasks known on the Russian market have not been successful, reaching a quality level of no higher than 60-70 percent of the possible.

    Fourthly, different participants in the process may have different ideas about the classification rules of any entities. In this case, we are not talking about the fact that there are several systems within the information landscape. Often, within the same system, the same inherently objects are described and classified in different ways. And the reason is not the carelessness or negligence of some employees. The main reason is in the context or conditions in which the action took place. National traditions, different cultural code. It is simply impossible to make an unambiguous regulation of the rules in these conditions.

    Thus, the task of using big data, artificial intelligence, etc. actually requires a broader view, united rather by the term Data Science. And in the process of designing solutions in the field of BigData, separate and equally important importance should be given to the issues of data cleaning and extraction. Otherwise, following a well-known saying, an automated mess is still a mess.

    Also popular now: