How Data Mining Companies Live: Tasks and Research

    Hello, Habr!

    Finally, the hands reached. It's time to tell what our company DM Labs is doing in the field of data analysis, in addition to educational activities (we already wrote about it 1 ).

    Over the past year, we began to work closely with the Fortiss Institute of Robotics at the Technical University of Munich (TUM) (together we teach robots not to kill people), released a prototype antifraud system, participated in international machine learning conferences, and, most importantly, were able to form a strong team of analysts .

    Now DM Labs already combines three areas: a research laboratory, the development of ready-made commercial solutions and training. In today's post, we will talk about them in more detail, sum up the past year and share our goals for the future.


    Training


    Starting the educational direction, we wanted to create a program for the exchange of knowledge between young specialists and experts, and, as already mentioned, to help form the Data Science community in Russia.

    This year we managed to release the first stream of students and now we are conducting a program for the second set.

    20132013/2014
    Students1825
    The experts1930+
    Program Data mining in industry Data Mining in Industry + individual courses in R, Machine Learning, Big Data
    Lectures60 hours Data Mining in Industry: 70+ hours, Courses: 80+ hours
    CompaniesIBM, EMC, Siemens, fortiss, etc. all the same + Delloite, Accenture, Classmates, etc.

    The curriculum has changed a lot, but we realized that the three elements that underlie the philosophy of our education will not be changed:

    • Communication with experts.
    • Practice. Students take part in kaggle competitions and solve problems posed by experts from different fields ( 1 , 2 and 3 ).
    • Proactivity. We are trying to interest students in sharing knowledge with each other and organizing internal seminars on various topics, including those related not only to data analysis.


    In addition to continuing the curriculum, in 2014 we will conduct even more various educational initiatives:

    • Data Mining Sauna - for the Christmas holidays, we invited students and experts to a private contact zoo near St. Petersburg in order to informally share ideas with each other and discuss research (we will write more about this event soon).
    • Now we are preparing a hackathon for the analysis of social networks in St. Petersburg.
    • In the coming year, we would also like to organize a conference on Data Mining.

    Projects


    After the start of the training area, the project activity and the new direction of data mining Projects became a logical continuation, because with the help of machine learning you can solve many interesting problems in various fields:

    Our team is currently working on various commercial projects, including analyzing financial transaction traffic, detecting anomalies based on web service log files, predicting user returns, etc.
    At TechCrunch Moscow, we outlined how we can help a company become data -driven.
    We will write about specific cases of projects and our product, antifraud system in the following articles.

    Research


    Design work is good, but the soul of a data scientist always asks for more: I want the models to be more accurate, the algorithms to work faster, and their field of application grows. So the third direction was created - Data Mining R&D.

    Now we are working on various tasks related to Gradient Boosting Machines [ 1 , 2 , 3 ]. These algorithms are actively used by companies such as Yahoo !, Yandex in their Matrixnet, Microsoft and others . If to explain “on the fingers”, then the main idea of ​​the algorithm is to build a set of decision trees in such a way that with each new tree the total output of the algorithm becomes more and more accurate. For example, as in this picture:

    Everything seems to be simple, but there is great scope for creativity: how to make it so that to achieve the same accuracy fewer trees are required (how to reduce their number)? What will happen if you make a “deep” ensemble? Or an ensemble of semi- “deep” contraptions? ”

    The second important area of ​​work is Data Fusion methods. The idea is to use data from different areas as part of solving one problem: text, video, audio, graphs, sensors, as well as various combinations thereof If you run the same GBM algorithm head-on on all the data, the distributions will be too different, and the number of signs will be unreasonably large. In general, a description of the reasons why this will not work is a topic worthy of a separate article.

    An example that we encountered in this area was the task of determining financial risks. For this task, they usually use quantitative information about quotes from the exchange - by looking at the volatility of the company's stock prices, one can fairly accurately predict the risks for the next year. However, given the information from the annual financial statements of companies, this accuracy can be improved.

    The main question is how to do this most efficiently in order to use all the information contained in the data? How to stitch models built on different data subspaces? Sew only models or some intermediate layers with representation, similar to what they propose to do in D-Wave:

    Our research does not end there. We, for example, are very concerned about the questions:

    • How to select significant signs when there are a lot of them: tens and hundreds of thousands?
    • How to search for anomalies in large dimensions?
    • How to run the GBM algorithm on a billion points? And in the trillion? This is rather a general question for those gradient methods where SGD and minibatch are not applied (a similar story with ICA )


    Finally


    It was a year rich in events, new good people and interesting tasks. We hope that 2014 will bring a lot of great ideas and even more strength to bring them to life and write about each article on Habr. Yes, we already want to tell so much now that we decided to conduct a small survey

    Only registered users can participate in the survey. Please come in.

    What would you like to read on our blog?

    • 41.9% The educational process: events and tasks 81
    • 61.1% The training materials we use 118
    • 56.4% Research in detail with matan and other hardcore 109
    • 49.2% Research without details: tutorials and manuals 95
    • 67.8% Design Ideas: Where We Apply Machine Learning 131
    • 32.6% Processes in a team: what we are doing now and what problems we face 63
    • 67.8% Machine Learning Application and Industry News 131

    Also popular now: