Heritage Health Prize Data Mining Competition Ended
The largest competition in the field of analysis of large data arrays since Netflix Prize has come to an end. And although the official results of the top ten and the winner will be announced in two months, the results can already be summed up.
The goal was to predict hospitalization of patients over the next year based on data from the previous two years of treatment. According to the sponsor, this will allow more attention to be paid to those patients who need him most, thereby saving part of the $ 30 billion spent annually in the USA on hospitalization.
The prize of $ 3,000,000 declared by the organizers was unattainable due to the established accuracy limit of 0.4 RMSLE (less is better; best result achieved is 0.46; the difference between the first and hundredth places is 0.008; RMSLE is the standard deviation of the logarithms) and the data provided - they simply there was not enough information to reach this level of accuracy. Therefore, in fact, the struggle went on for $ 500,000, which went to the best team, a fund of intermediate finishes and invaluable experience.
Despite the complexity of the task, more than one and a half thousand wished to try their hand. They say that two Nobel laureates even participated in the competition, but who they were and what successes are not recognized. Given that in the field of mathematics and programming they do not exist, medicine remains - as a consultant or economics.
The competition lasted two years and had three intermediate finishes, each of which had two prizes. Winners, according to the terms of the competition, laid out a description of their methods. However, this did not help much to the rivals, the fact is that the main algorithms are well known - these are decision trees , Random Forest (random forest) , Gradient Boosting , Gradient descent , Ridge Regression (ridge regression, Tikhonov regularization)their modifications and combinations. The differences were in the intricacies of the implementation, use, combination and small variations of the algorithms themselves. However, there were so many details that it was not clear - due to which the result is actually achieved. That is, what the winners do - it’s clear, it’s not clear why they do just that, and why what they do works.
The winners were divided into intermediate finishes as follows:
- 1. Market Makers 2. Willem Mestrom
- 1. Market Makers 2. Edward & Willem
- 1. Edward & Willem 2. Crescendo
Oddities began before the third interim finish - all three teams almost did not use the model’s once-daily-validated model check on 30% of the test data and the leader changed without a fight. The reason was the union in one team, while it was impossible to exceed the limit of the sent models for all the time since the start of the competition - they miraculously met.
On the day of the finish, preliminary results for 30% of the test data looked like this .
But the most interesting was in the results on the hidden part , published a few days later, reflecting the true estimates of the operation of the algorithms.
Summary table for the first 50 places:
The main enemy was the effect most clearly observed among the Almata team, which took first place in the open rating. This is overfitting. They extracted all the useful information from the data on which the rating was considered, based on the rating estimates, and with it captured harmful information specific to the set. As a result, the estimate for unknown data worsens (or at least does not improve). The result - a move from 1st to 19th place.
The winner and scores of the first 10 participants will be officially announced in early June at the Health Datapalooza IV conference. However, there is almost no doubt about the victory of POWERDOT - a team formed by the merger of the winners of intermediate finishes. Having at their disposal the 3 best results, they got the opportunity to implicitly learn from the hidden part of the rating, after which it became impossible to deal with them.
But there was something to learn. For me, this resulted in a move from 261st place following the results of the last intermediate finish to the final 27th. It could have been higher - an understanding of the processes was too late, but next time it would be more interesting.
Description of the methods of the winners of intermediate finishes (the winner algorithm will probably be composed of their combination) can be read here(a lot of math and maneuvers that I still don't understand).
UPD 2013.07.15. As predicted, POWERDOT won with a score of 0.461197. After stripping from violators of the rules that used multiple accounts, the appearance of the final table changed . And the organizers promise the second part of the competition with invitations based on the results of the first.