OpenDataScience and Mail.Ru Group will conduct an open machine learning course

    On September 6, 2017, the 2nd launch of the OpenDataScience open course on data analysis and machine learning starts. This time, live lectures will be held, the site will be the Moscow office of Mail.Ru Group.



    In short, the course consists of a series of articles on Habré ( here is the first), reproduced materials (Jupyter notebooks, here is the github repository of the course), homework, Kaggle Inclass competitions, tutorials and individual data analysis projects. Here you can sign up for a course, and here - join the OpenDataScience community, where all communication will take place during the course (channel #mlcourse_open in Slack ODS). And if in more detail, then this is for you under the cat.


    Article outline



    What is the feature of the course



    The goal of the course is to help quickly refresh your existing knowledge and find topics for further study. The course is unlikely to fit exactly as the first on this topic. We did not set ourselves the task of creating a comprehensive course on data analysis and machine learning, but wanted to create a course with the perfect combination of theory and practice. Therefore, the algorithms are explained in sufficient detail with mathematics, and practical skills are supported by homework, competitions and individual projects.


    A big plus of this particular course is active life on the forum (Slack community of OpenDataScience). In a nutshell, OpenDataScience is the largest Russian-speaking community of DataScientists, which does a lot of cool things, including organizing a Data Fest . At the same time, the community actively lives in Slack, where any participant can find answers to their DS-questions, find like-minded people and colleagues for projects, find work, etc. A separate channel has been created for the open course, in which 3-4 hundred people studying the same as you will help in mastering new topics.


    Choosing a format of presentation of material, we settled on articles on Habré and Jupyter notebooks. Now, live lectures and their videos will be added.


    Who is the course for and how to prepare for it


    Prerequisites: you need to know mathematics (linear algebra, analytic geometry, mathematical analysis, probability theory and statistics) at the 2nd year level of a technical university. You need to be able to program a little in Python.


    If you do not have enough knowledge or skills, then in the first article of the series we describe how to repeat math and refresh (or acquire) Python programming skills.


    Yes, knowledge of English, as well as a good sense of humor, will not hurt.



    What the course includes


    Articles


    We made a bet on Habr and submission of material in the form of an article. So you can quickly and easily find the right part of the material at any time. Articles are already ready, in September-November they will be partially updated, and another article about gradient boosting will be added.


    List of articles in the series:


    1. Primary Data Analysis with Pandas
    2. Visual data analysis with Python
    3. Classification, Decision Trees, and the Nearest Neighbor Method
    4. Linear classification and regression models
    5. Songs: Bagging, Random Forest
    6. Construction and selection of signs. Applications for word processing, image processing and geodata
    7. Teacherless Learning: PCA, Clustering
    8. Gigabyte-based training with Vowpal Wabbit
    9. Python time series analysis
    10. Gradient Boost

    Lectures


    Lectures will be held in the Moscow office of Mail.Ru Group on Wednesdays from 19.00 to 22.00, from September 6 to November 8. The lectures will analyze the theory as a whole according to the same plan that is described in the article. But there will also be live discussions of tasks by lecturers, and the last hour of each lecture will be devoted to practice - students will analyze the data themselves (yes, write code directly), and lecturers will help them in this. The top 30 participants of the course at the current rating will be able to attend the lecture. The ranking will be influenced by homework, competitions and data analysis projects. Lecture broadcasts will also be organized.



    Lecturers:


    • Yuri Kashnitsky . Programmer-researcher at Mail.Ru Group and senior lecturer at the HSE Faculty of Computer Science, as well as a teacher at the HSE's annual program of further education in data analysis.
    • Alexey Natekin . Founder of the OpenDataScience community and DM Labs, Chief Data Officer at Diginetica. In the past, head of Deloitte analytics. The ideological leader of the OpenDataScience community, the organizer of DataFest.
    • Dmitry Sergeev . Data Scientist at Zeptolab, lecturer at the Center for Mathematical Finance at Moscow State University.

    You can read about all authors of the course articles here .


    Hometasks


    Each of the 10 topics is accompanied by homework, for which 1 week is given. The task is in the form of a Jupyter notebook, into which you need to add the code and based on this select the correct answer in the form of Google. Homework is the first thing that will begin to affect the rating of participants in the course and, accordingly, who will be able to attend lectures live.


    Now in the course repository you can see 10 homework with solutions. In the new launch of the course, homework will be new.


    Tutorials


    One of the creative tasks during the course is to choose a topic from the field of data analysis and machine learning and write a tutorial on it. You can get acquainted with examples of how it was here . The experience turned out to be successful, the course participants themselves wrote several very solid articles on topics that were not considered in the course.


    Kaggle Inclass Competitions



    Of course, without practice in analyzing data anywhere, and it is in competitions that you can quickly learn something and learn how to do it. In addition, the motivation in the form of various buns (money and rating in the "big" Kaggle and simply in the form of a rating we know) contribute to a very active study of new methods and algorithms during the data analysis competition. The first launch of the course offered two competitions in which very interesting problems were solved:


    • Identification of an attacker by his behavior on the Internet. There was real data about users visiting various sites, and it was necessary to understand from the sequence of sites visited in 30 minutes whether it was someone Alice or someone else.
    • The forecast of popularity of an article on Habré. In this task, according to the text, time and other signs of publication on Habré, it was necessary to predict the popularity of this article - the number of additions to favorites.

    Individual projects



    From the Vkontakte public "Memes about machine learning for adult men."


    The course is designed for 2.5 months, and a lot of activities are planned. But be sure to consider the possibility of completing your own data analysis project, from start to finish, according to the plan proposed by teachers, but with your own data. Projects can be discussed with colleagues, and at the end of the course a peer-review of projects will be arranged.
    Details about the projects will be later, but for now you can think about what data you should take in order to "predict something for them." But if there are no ideas, it’s okay, we will advise some interesting tasks and data for analysis, and they can be different in terms of complexity.


    How do I enroll in a course?


    To participate in the course, fill out this survey and join the OpenDataScience community (in the column "How did you find out about OpenDataScience?" Answer "mlcourse_open"). Most of the communication during the course will take place at Slack OpenDataScience in the #mlcourse_open channel.


    How was the first run of the course


    The first launch took place from February to June 2017, about a thousand people signed up, the first homework was done by 520, and the last - 150 people. Life on the forum was in full swing, several thousand parcels were made in Kaggle competitions, course participants wrote a dozen tutorials. And judging by the reviews, we got an excellent experience with which you can further plunge into neural networks, competitions on Kaggle or in the theory of machine learning.


    A bonus for the top 100 finalists of the course was a mitap in the Moscow office of Mail.Ru Group, which had 3 lectures on topics that were relevant in modern DS:


    • Big data processing with Apache Spark (Vitaliy Khudobakhshov, Odnoklassniki). Video: part1 , part2 ;
    • Fundamentals of Neural Networks and Deep Learning (Alexey Ozerin, Reason8.ai), video ;
    • Deep Learning in solving sentiment analysis problems (Vitaliy Radchenko, Ciklum), video .

    Bonus: co-cs231n course


    And the last thing we will please so far: from the middle of November 2017, right after the introduction of the machine learning introductory course, in the same place on the #mlcourse_open channel in Slack ODS together we will take one of the best neural network courses together - the Stanford course cs231n “Convolutional Neural Networks for Visual Recognition. ”


    Good luck in learning this wonderful discipline - machine learning! And these two comrades here - for motivation.



    Andrew Ng interviews Andrej Karpathy as part of the Deep Learning specialization.


    Also popular now: