Creator of Open Data Science on Slack, xgboost and GPU

    The Open Data Science (ODS) community is already known on Habré on an open machine learning course ( OpenML ). Today we will talk with its creator about the history of ODS, people and the most popular methods of machine learning (according to Kaggle and industry projects). For interesting facts and technical expertise - please, under cat.




    Alexey Natekin natekin . Founder of a number of projects related to machine learning and data analysis. Dictator and coordinator of Open Data Science, the largest online community of Data Scientists in Eastern Europe.

    - How did the idea of ​​creating a community come about? How did it all start? Why did the choice fall on Slack and who stood at the origins?

    Alexey Natekin:It was in 2014, when the third set for our free educational program at DM Labs ( DM: Data Mining ) was coming to an end . We have already passed the material and wanted to move on to working together on projects. This was the first iteration of fun tasks - analysis of porn tags, determination of depression in social networks, game data from Dota. Immediately there was a concept that projects should be made open and involved not only participants in the courses, but also more rummaging guys.

    Having experimented with collective farm chat rooms on VKontakte and the self-made WordPress, we found out that all this is not suitable for normal work. Fortunately, already at the beginning of 2015, Slack began to slowly gain fame, and we used it as a platform for projects and communication. It turned out to be very convenient, and almost immediately we stopped at it.

    At that moment, we didn’t think much about what all this had to be wrapped in some beautiful ideology, and we simply called this business Open Data Science . This phrase was found only in one conference (the so-called Open Data Science Conference), and a bunch of open science and co-education in DS ( Data Science) - to teach others and to learn something myself - was a good basic option, what we wanted. Still, Open Data and Open Science were already there, it remained only to invent and implement the missing link.



    We started as a project chat room, in which many expert and technological channels quickly emerged, also related to DS discussions. We can say that the focus has shifted more to the local Stack Overflow in Russian. Projects lived their own lives, and activity turned into general discussions.

    Fortunately, we quickly gathered a large mass of expertise in the main DS-related areas and technologies, and the first skeleton of people was formed in ODS. At any moment, you could ask something in the direction you are interested in and get good advice from a person versed in this.

    At that time, professional DS communities in the form in which they are now, if they were, either represented an audience of some regular meetings in one direction, or turned out to be closed and strongly attached to a specific place (for example, university students) .

    We were immediately in favor of uniting and uniting, therefore we began to integrate with mitaps and different groups: Moscow Independent DS Meetup, ML-training (which began before ODS after the SNA-hackathon in St. Petersburg), mitaps on R, Big Data and then Deephack, Deep Learning Meetup, DS Meetup from Mail.ru and many others.

    Interesting fact: at one of the winter gatherings, we learned that the payment ran out on this meeting, and so that the next Big Data spammers wouldn’t take it in our hands, we were suddenly blown up to pay for the lost MDSM Mail.ru account - the payment is still dripping from my card :)

    It so happened historically that the most active guys with whom we jointly built ODS were also organizers. Therefore, we not only helped each other with speakers, PR and organizational issues, but for our part we quickly began to come up with and conduct new events and formats. Among them are DataFest, as the largest DS conference in Eastern Europe and the CIS, and DS Case Club, as a series of events invaluable in content about the real benefits of DS in business, and data breakfasts with their own unusual, but loved by people format.

    And, of course, we cooperated with companies: for example, with Yandex we made a series of Data & Science events, and with Sberbank - Sberbank Data Science Journey. According to recent estimates, we have accumulated more than 20 regular events throughout the CIS.

    The expansion was not long in coming - we shared our achievements and started our events and the development of DS in other cities and countries: first Moscow with St. Petersburg, then Yekaterinburg, Novosibirsk, Nizhny Novgorod and Kaliningrad, then Ukraine with Kiev, Lvov and Odessa, Belarus with Minsk and still more and more new cities in the CIS.

    Now in the admin team there are 35 people from 4 countries. There are active participants gathering for meetings in the USA, Germany, Norway and Israel, but we are still working on globalization. So, we have 7.5 thousand people from 20 time zones in Slack, more than 3 thousand of which visit at least once a week. So there is global potential.

    Do we have analogues and competitors? It would be very cool if, in the form in which we grew up, there would be at least one analogue in the world - we would cooperate with him. Unfortunately, in the USA, which are considered the leader in DS / ML, there is nothing similar in spirit to us and is unlikely to appear.

    Mitapas are littered with paid marketing garbage, local communities are very rigidly tied to universities and companies (a separate hangout in Google, a separate hangout in Amazon, a separate hangout in Facebook, and so on). And at Machine Learning Meetup it is difficult to find those who are engaged in machine learning proper - for 100 people, as a rule, 10 will not be found: those who are very interested, onlookers, evangelists and PR people. But at the relevant conferences it is truly a world level, and there are almost no random people.

    AI Researchers slack, created immediately after last year's NIPS conference, which gathered 1,000 people in 9 months and where Ian Goodfellow was seen in the first week, is essentially dead: it has 24 thousand messages. We write about 30 thousand messages a week, and in total they are almost 1.5 million. There are KDnuggets, like a kind of DS blog platform. It can be said that the largest community in the DS community lives on Kaggle (I won’t be surprised if we have more messages than they have). But we have not yet seen an analogue combining the venue, events, and other initiatives like training.

    - The phrase "Run xgboost" has become a meme, so what is xgboost and why should it be stacked?

    Alexey Natekin:Xgboost is a concrete implementation of the gradient boosting algorithm. The algorithm itself is used universally as an extremely powerful general-purpose model - both in production, including Yandex search engines, Mail.ru, Microsoft and Yahoo, and in competition venues - like Kaggle. Xgboost is, firstly, a very efficiently written code that allows you to learn models faster and more productively. And secondly, additional buns and regularization are added to xgboost, so that the trees themselves are more stable.

    Xgboost itself is good enough to use it without any fraud. On Kaggle, and outside it, it became a no-brainer solution that is taken off the shelf and they know that it usually gives a very good result. Here so: free lunch, of course, does not exist ( note , this is a reference toNo Free Lunch Theorem ) , but xgboost is worth a try in the next task.

    However, at Kaggle the main task is to get the best result in the ranking table. Repeatedly, this best result came down to dozens of people fighting for the third, fourth and fifth decimal places, trying to beat out at least a little extra accuracy. There has long been a known method for accurately scraping such crumbs - using stacking algorithms and multi-level ensembles.

    If you train a few dozen very carefully and competently instead of one model, and a couple more dozen over their predictions, this way you can scrape off a little more accuracy. The price of this is a lot of computational costs that no one in their right mind will repeat in practice, but at Kaggle no one promised a sound mind and realistic tasks.

    Thus, often the decision of the winner - the person who won the first place, consisted of several layers of stacked xgboosts. Xgboost - because it's powerful and fast. Run over - because you need to take first place.

    The phrase “stack xgboosts” is, in essence, a mockery of the senseless and merciless essence of Kaggle contests, many of which can be solved by brute force and a terrible, from the point of view of practical use, solution. But, nevertheless, winning at Kaggle.

    - They say that xgboost shows excellent results in practical applications. Can you give an example where xgboost is really much stronger than competitors? And is there any kind of rationalization, why is this happening on such data?

    Alexey Natekin:It depends on what is considered a competitor. In general, gradient boosting lies under the hood of such a wide number of applications that it is difficult to collect all of them: antifraud / antispam, all kinds of forecasts in financial companies, and search engines of the largest companies, as I wrote above. Well, Kaggle tasks, in spite of all their toy play in the productions, in most contests where the data is not very sparse and not pictures, they are also usually won by gradient boosting + some kind of ensemble on top, especially when you need to squeeze extra accuracy.

    It is impossible to say that boosting is an order of magnitude stronger than competitors, since no one will tell in any of the business applications what results these models had on real data. And the result of boosting itself is often not a cut above the closest competitors: random forest, neuron or SVM.

    It’s just that boosting is best designed from the point of view of implementations and is a very flexible family of methods, so adjusting boosting for a specific task is not a big deal. And about rationalization - why exactly does boosting work and what is the trick - I can recommend a couple of my ( 1 , 2 ) tutorials and one - Alexander Dyakonov.

    - In the announcement of your report on Smartdataconfit is noted that people expect from xgboost the same GPU acceleration as neural networks. Can you intuitively explain why neural networks get such a performance boost?

    Alexey Natekin: How does the ordinary date think of a Satanist-Stakhanovite of intellectual labor? There are neural networks, they made cool hardware and top-end video cards with thousands of specially optimized cores for them. And Nvidia itself says: our hardware will advance and accelerate AI, no matter what it means. Hence the unrealistic expectations that the GPU can be used in a wider range of tasks.

    In neural networks, the bulk of the operations performed both in training and in prediction are matrix multiplications. Now it’s more honest to say that this is work with tensors, but the essence is the same: you need a lot of routine matrix operations.

    GPUs are ideally suited for this, since there are more computing cores at lower cost and power consumption. And the absence of an expanded set of functions can be safely ignored.

    But because of this, it becomes difficult on the GPU to sort arrays and do recursive calculations, but it’s very cheap to calculate the convolution and multiply the matrix. This raises the urgent question: what will happen if we try on the GPU to start learning decision trees, so universally used in practice? And how meaningful is this idea by design, what is called?

    - And why does this not happen in the case of xgboost and other machine learning algorithms, and what needs to be done to use the GPU for training?

    Alexey Natekin:
    The GPU is great for those algorithms that are adapted for the GPU. You will not, as I already wrote, build indexes on the GPU? For a wider application of the GPU in other machine learning tasks, in addition to neural networks, it is necessary to come up with effective implementations of algorithms for the GPU. Or, more likely, inspired by existing versions of CPU algorithms, to create new algorithms based on them that are suitable for efficient GPU computing. Or wait for a broader release of Intel phi, about which there are different legends. But this is not the GPU and a completely different story.

    - And finally, a question about hardware: what parameters should you pay attention to when buying GPU-cards for machine learning? What are people from top Kaggle currently using?

    Alexey Natekin:In fact, they buy mainly 1080Ti, since they have the most optimal ratio of price, speed and 11 gigabytes of memory. Of the Titan X and 1080Ti, the latter is chosen. All the same, datasets have not been stored in memory for a long time, so the more memory, the more you can stuff into processing. But in general, everyone is interested in what will happen in the next generation of cards: as soon as they appear, they will need to be bought very quickly.



    If you suffer from machine learning and data analysis as much as we do, you may find these reports of our upcoming SmartData 2017 conference to be held in St. Petersburg on October 21 interesting :


    Also popular now: