“Data mining is now an advantage in the market”: about SmartData conference and big data



    Conferences on the same topic can look completely different. And when a completely new event is planned, it is not clear in advance what to expect. If the conference is dedicated to “big and smart data”, will it not turn out to be designed for giant companies and there is nothing for small employees to do there? And will there be such a bias in data science that it’s better for people without a degree to not enter?

    In anticipation of the SmartData conference , which will be held for the first time in St. Petersburg on October 21, we decided to clarify and questioned two members of its program committee: Vitaly Khudobakhshov (Odnoklassniki) and Roman p0b0rchy Wicked. They dispelled many fears, and the conversation turned out not only about the conference, but also about the state of the industry: what is happening around machine learning, why small companies go into data mining, why managers buy tickets for a technical conference about big data.

    JUG.ru: There is machine learning in the list of topics on the site , and this area now looks booming. Could it be that the conference reports become obsolete during the preparation?

    Vitaliy: In fact, technologically everything is changing not so fast. More importantly, now most companies are “catching up”.

    There is a “cutting edge”, like DeepMind, which no longer tells anyone, but does something. But even they often do not so much that they are very complicated things, just because of the large budget they are not soaring and can afford to get profit by hitting their head against the same wall for a long time.

    And, of course, not everyone can afford to invest like that. But at the same time, firstly, now there is a lot of open source, good developed code that can already be used, and secondly, there is a lot of information available. Therefore, most people are just starting to use it. If you look, for example, at the value of NVIDIA shares in the last three years, it will be clear that the real deep learning begins only now. Just on demand for video cards: it is clear that cryptocurrencies affected it, but now sales of video cards for deep learning have already surpassed sales of video cards in order to play. And this is a good marker, showing that deep learning, in spite of its “bazvordnost” - is really a working thing.

    We at Odnoklassniki tried to use deep learning for the first time a year and a half ago, and now, when students come to us, they say: “Oh, and you have a video for us to make a net” - and we take out a Tesla P40 from a pocket. And if a few years ago an article that taught the net how to play Atari 2600 classic games was published in a very serious journal Nature, now some MIPT student is able to write a model that will play the same games better. Nothing so complicated, as a matter of fact, isn’t in this; one did something - now anyone can repeat it. And even many can do better.

    That is, the objective situation is such that enormous technological changes do not occur, the main thing is already known and accessible, and now the question is how much the audience is able to process and adopt all this. And just the conference is what helps in this.

    Roman: I want to say not specifically about the neural network, but about the “big and smart data” as a whole: the stratification is great in terms of how much technology has already been adopted. Some things have been used in some places since 2008, let’s say, but somewhere they’re only now recognizing them, oddly enough. Despite the fact that everyone seems to be following the articles, but I see that the real stratification in the industry is very large.




    JUG.ru: Well, probably not everyone is following the articles. There are many small companies that do not pretend to world domination and revolutionary innovations, but are engaged in fairly standard things and do not really follow the “front line”. Will those on SmartData find benefit or not?

    Vitaliy: They will actually find out.

    Strictly speaking, the conference shows that you can solve just such a range of problems in this way. We are probably not dealing with string theory. And maybe even not very advanced companies can take any of this into service. And to get due to this some advantage in the market.

    Because now data mining is nothing but an advantage in the market. It allows you to be better. And for a small company to be better cheaply is very valuable. Let's look like this: you can hire some kind of smart person who will make decisions about whom to sell to. And you can download and train the model on random forest, which will do better.

    There is a wonderful story, very old: how Amazon got this whole idea with recommendations. Amazon once had a staff of experts who made recommendations manually. But then a student just came, wrote a collaborative filtering algorithm, and everyone was fired because he was simply better.

    And I have a whole series of reports where I showed students and beginners in dataming (and not only beginners) how to make item-to-item collaborative filtering on MapReduce in one slide. I show that anyone can do it. And this is what we want to convey: it is not necessary to be Yandex, Mail.Ru Group or Google. This, of course, is very cool when you google. But it is very cool that we can take the open source of some large large companies and take advantage of this. Use this algorithm in everyday life and show that you can gain an advantage in the market, even if your company has five people. This is quite our audience.

    JUG.ru: Since data science is stated in the topics, I want to clarify: how much will be “academic” at the conference, and how much “industrial”?

    Vitaliy:There are two different plots. One is big data and smart data in production. This is what people who go to the Joker conference, for example, are used to. And from data science, when you work with production every day, not so many things are applied: linear regression, logistic regression, deep learning.

    And another plot is what is done in theory, as well as in point problems: to do analytics, get the result, take this analytics to the boss, he will look and make a decision.

    There is a huge gap between what needs to be done every day automatically and what can be done once. You need to understand that these are two completely different situations, two different audiences, and what is constantly being done in production is what was drawn 30, 40 or 50 years ago, and maybe 100-200 years ago. But at the same time, what will happen in production tomorrow is what is being drawn now.

    We, the program committee, proceed from the practical benefits for the audience of the conference. And we ask the speakers a question: “But what are you telling and want to show us - is it something that is already in production, or was it just you outlined somewhere?”

    But at the same time, I personally think that both of these stories are important. And do not hammer in some kind of scientific hardcore simply because the people who are now writing in Java have never heard of it. The real value of tomorrow is what hardcore is now.

    There are different opinions on this issue within the program committee. The question here is representativeness: how much is it possible to make hardcore accessible to the public, so that it is understandable not only to a narrow group of people who come specifically for this, but also to people who, relatively speaking, just want to write their MapReduce on Hadoop. Even if there will not be as many formulas as it could, then the possible value will be clear, but who cares more - he opens the article and reads.

    JUG.ru: And how do you see the audience of the conference?

    Vitaliy:We see it as real professionals who come from practice, from programming, want to absorb the culture of data science, data mining, and maybe implement it in their own company.

    In addition, many large companies have an R&D department - those people who have a much greater amount of knowledge than simple developers, and do something that does not get into production or does not get right away. For example, I'm in Odnoklassniki, in fact, in the R&D department. They usually come to me with some kind of question, to which no one knows the answer. And such people are, of course, also our customers, even if they are in the minority. So, maybe about production and not very interesting, but I want to listen to the reportAleksei Potapov or other famous people in the field of science who are looking forward in data analysis, in artificial intelligence.

    And besides this, our audience is also managers who want to penetrate and learn to set goals. Because after all, usually tasks come from above. And in order to do some kind of data mining, the manager must understand which tasks are generally solved by the method of data analysis, and which are not solved. And with this already come to the engineers, to the date of the miner, and talk to him about it. For example, banking is an ultra-conservative business where data mining may be useful, but they have a management that knows little about data mining now. They have algorithmic trading, this is a slightly different story, but in general they are very conservative.

    Some managers have already bought tickets, they told me about it themselves. Maybe not every manager will be interested, because many simply do not understand that this is important. But many understand.

    JUG.ru: Words about managers for many may be unexpected, because usually conferences from the JUG.ru Group are not associated with them. Are the reports themselves primarily aimed at techies? Do not have to modify them so that managers understand it?

    Vitaliy:First of all, for techies, of course. But you need to understand that this is still not Joker. There is no Shipilev with his "right now we will put on gloves, get into the guts of the JVM and see what is there." We are talking about real cases of using real data. Such tasks, relatively speaking, have an engineering component and an objective component, and we are just doing everything so that the reports are more substantive.

    Novel:I would like to add this: the layer of people who somehow work with machine learning or are faced with the problem of truly large amounts of data is now much thinner than the layer of Java programmers. Therefore, in the case of Java, even if you select a subset from the layer of Java programmers and make a conference based on this narrow segment, you can still gather a large audience. And in our case, so far it seems that it is more logical to include more different things for different people. In addition, we are currently exploring the audience, we still have the first time. When we’ll spend it, we’ll see how it is, using the methods of working with data that we just know.

    JUG.ru: Roman, many know you as a trainer of speakers, you already thoroughly analyzed reports on Habréexplaining things like “laser pointers are evil.” Since you participate in the SmartData PC and watch reports, then, in addition to the content, keep an eye out for any laser pointers?

    Roman: Yes, I eradicate them wherever I meet, but really, is it better without them? But alright laser pointers, there are other things. So I see that the wonderful slide deck Codeware has spread about how to format code on slides. Of course, we will try to ensure that the code in all presentations is designed in accordance with this. Well, we’ll try so that people don’t forget to tell what problem they are solving, what recommendations they want to give to the audience based on the results of their story, so that everything that they put on slides is more or less relevant and can be seen. These things - yes, of course, we will try to do.

    JUG.ru: What will happen at the conference has become clearer. Finally, the question is: what on SmartData will not and cannot be?

    Roman: We are trying very hard to weed out bulshitting. If you want about big data, you can say a lot of effective words, while not saying anything in essence. We are for the specifics.

    Vitaliy: There are two types of inappropriate reports. One is bulshit, and the second is when in the struggle for the fifth decimal place they forget about the practical benefits, when stacking and blending begin for the sake of stacking and blending, and not for the achievement of specific goals.

    Both that, and another are bad because that is divorced from reality. And we want to make the conference so that it is connected with reality.



    SmartData will be held on October 21, conference tickets are already on sale on the conference website (and they become more expensive over time). Its main topics are:

    • Data and its processing (Spark, Kafka, Storm, Flink)
    • Storages (Databases, NoSQL, IMDG, Hadoop, cloud storage)
    • Data Science (Machine learning, neural networks, data analysis)

    Also popular now: