Prediction of the results of the 2018 Football World Cup using random forests algorithm

A regression tree pattern for 2002–2014 World Cup data. The number of heads is used as the response variable. The
German machine learning specialists compared three different models to predict the results of the future 2018 World Cup:
- Poisson regression models;
- random forests methods;
- ranking methods (by the strength of teams based on the sparring for 2010-2018, and on the odds in the bookmaker offices).
The first two are based on information about covariates, and the latter directly on the actual actual strength of the teams. Scientists have concluded that within the framework of this comparison, ranking methods and random forests are the most effective methods of forecasting on training data. But using the combined approach — combining the properties of random forests with the team’s ranking — the scientists managed to significantly improve the predictive power of the system.
The researchers chose this combination of methods as the final model. Based on her ratings, all matches of the 2018 World Cup were modeled many times. The probabilities for each match are calculated, the probabilities of each team passing into each subsequent stage of the tournament and the most likely outcome of the tournament.
The authors note that several successful models that predict the results of the World and European Championships were published in the scientific press earlier. The developers of these models also used the system to predict the outcome of the 2018 World Cup.
So, the computer model Zeileis, Leitner and Hornik (2018) gives the highest statistical probability of victory for Brazil (16.6%), Germany (15.8%) and Spain (12.5%).
The model of experts from the Swiss bank UBS (Audran, Bolliger, Kolb, Mariscal, Pilloud, 2018) calculated the most likely winner: Germany (24.0%), Brazil (19.8%) and Spain (16.1%). This statistical model used four factors as input data, after which the probabilities were calculated from the results of 10,000 Monte Carlo simulations.
The random forests method is a fundamentally new approach. The algorithm of random trees is to use an ensemble of decisive trees. Here the bagging method and the method of random subspaces for the problems of classification, regression and clustering are combined. In other words, for predicting matches of the 2018 World Cup, it fits very well. The main idea of the random tree method is to use a large ensemble of decision trees, each of which in itself gives a very low quality of classification, but due to their large number, the result is good.
German specialists carefully analyzed all the proposed models and their final predictive power. Then specific predictive factors were selected that maximize the predictive power of the model. In the end, after such preparatory work, scientists applied the designed model (random forests + ranking) on the 2018 World Cup data.
For each match, the model can give the expected number of goals scored by each team. Based on this information, the outcome of all 48 matches in the group stage was calculated. The final position of the teams in the groups was calculated in strict accordance with the FIFA regulations. Then in the same way calculated the results of matches in the play-off stage. To account for additional time, the result of the program, according to the number of goals scored by each team, was multiplied by 1.33. If, as a result of the extra time, a draw was again recorded, the program simulated a penalty shootout by “throwing a coin”.
This strategy was used for 100,000 simulations of all championship matches. Based on these data, the probability of leaving the group and winning the tournament was calculated.
According to the results of the group stage, the program gave the following picture:

The Russian team has quite high chances to reach the 1/8 finals (50.4%), but there it must meet the Spanish team, where the latter will win with 87% probability. The table shows the most probable play-off grid for 100,000 simulations.

Russia's overall chances of reaching the quarter-finals are 10.5%, the semi-finals 2.4%, and the final 0.4%.

For the winner of the tournament, this model produced a result that is different from the result of previous modeling programs. She gave the maximum probability of Spain (17.8%). It is followed by Germany, Brazil, France and Belgium.
The scientific article was published on June 8, 2018 on the site of preprints arXiv.org (arXiv: 1806.03208v3).