MLBootCamp "Performance Assessment". Very simple and quick solution.
In this article I want to share my idea of solving the MLBootCamp "Performance Assessment" task from Mail.ru. The main advantage of this method is its simplicity and speed of script execution. And although he will not be able to compete exactly with the winners of the competition (congratulations!), But it can be useful in practice if a few tenths of a percent are not critical, or a starting point for further development. The script is written in R.
Briefly about the competition, here is the task:
It is worth noting immediately that, strictly speaking, this is not so. There is no computer system in the test sample, which would not be in the training. The average percentage error ( MAPE ) is used as the metric .
And now a little more. In both samples, for each observation, there are 951 signs. The first three of them are the dimensions of the multiplied matrices, the rest are the parameters of the systems on which the measurements were performed. It turns out that if we discard the parameters of the multiplied matrices, then in the training sample there are 92 unique configurations of the used computing systems, i.e. the remaining 948 parameters have only 92 combinations. Replacing these 948 parameters with one, the configuration identifier, we get that in our data there are only 4 attributes for each observation. We will prepare the data for further work by adding another product of all matrix sizes. Now there are signs in the data, m , k , n , m * k * n and conf(computer identifier). Also, the training sample contains the target variable time - the time for which the matrix multiplication was performed.
Now you can look at the data and see the expected dependence of the target variable time on m * k * n . For example, configuration 39 (left). It is well approximated by linear regression. However, not for all configurations the linear model can give a good result, for example, configuration 11 (on the right):

Noise and emissions make a significant negative contribution to the model, so one of the options for dealing with them is to remove such observations. The disadvantage of this method is that it decides how far the observation should be “wrong” so that it can be removed. Moreover, for some systems there is very little data and throwing out even noisy data is wasteful.
An alternative is to use models that are resistant to emissions and noise. Examples of such models are quantile and robust regressions.
In R, they can be found in the quantreg and MASS packages, respectively. Each of them has its own settings,and / or more advanced variations. Here, these methods were used with default parameters. Here's what the graph will look like with all the regressions for the noisy configuration 11:

The scatter of the values of the target variable varies widely for each of the systems, so the solution suggests itself: to train a stack of models, one for each configuration. Let the dependence in each model be a little more complicated, namely, the time spent on matrix multiplication will depend not only on the product of their sizes, but also on each of them individually, i.e.time ~ m + k + n + mkn :
Now it is easy to calculate the predictions by sorting out the models and selecting data from the dataset that corresponds to each individual configuration and its model.
Due to the imperfection of the initial data and, therefore, the obtained models, the predicted values will contain numbers less than 1, which, according to the conditions of the problem, should not be. It turns out to be useful to use the predictions on the training set and replace all the inappropriate values. A simple and effective method is to replace all numbers smaller than a certain constant, cut-off, by herself. The graph shows a metric graph (MAPE) depending on the cutoff cutoff value . The horizontal line is the metric without replacement, the vertical line is the cut-off at which the metric takes the smallest value.

Using such a replacement can slightly improve the value of the metric. As practice has shown, the cutoff level on the training and test samples is approximately the same.
The obvious drawback of the above code is that the models were built without cross-validation, therefore, the obtained metric value on the training set cannot be trusted, as well as the assessment on the public leaderboard. The deficiency is eliminated by an additional relevant code.
However, there is a clear advantage - in speed (the author completed the script in half a second), which allows you to quickly evaluate many different options for combinations of models. For comparison, models using gradient boosting, random forests or neural networks were built from at least several tens of minutes.
However, an algorithm with a choice between quantile and robust regression for each configuration showed the result MAPE = 5.23% on a public leaderboard. According to the results of the competition, there is no metric, but based on cross-validation data, a MAPE of about 5.4% is expected.
Thus, training the model stack and predicting the target variable is very fast, less than a second, and good accuracy is achieved. The described method can be improved due to more detailed attention to a particular configuration, for example, selecting a quantile for each regression or parameters using robust methods, removing explicit outliers, selecting a separate cut-off, etc., which, as a rule, does not affect much script execution time.
According to the modest opinion of the author, the use of quantile (or other emission-resistant) regression for predicting the execution time of computer experiments is very practical, accurate and much less computationally expensive than the method described in the article to a competition where linear regression was used with the removal of values (own bike?) and random forests.
1. Download and data preparation
Briefly about the competition, here is the task:
“We suggest that you teach a computer to predict how many seconds two matrices of size mxk and kxn will be multiplied on a given computer system, if you know how much this problem was solved on other computer systems with different matrix sizes.”
It is worth noting immediately that, strictly speaking, this is not so. There is no computer system in the test sample, which would not be in the training. The average percentage error ( MAPE ) is used as the metric .
And now a little more. In both samples, for each observation, there are 951 signs. The first three of them are the dimensions of the multiplied matrices, the rest are the parameters of the systems on which the measurements were performed. It turns out that if we discard the parameters of the multiplied matrices, then in the training sample there are 92 unique configurations of the used computing systems, i.e. the remaining 948 parameters have only 92 combinations. Replacing these 948 parameters with one, the configuration identifier, we get that in our data there are only 4 attributes for each observation. We will prepare the data for further work by adding another product of all matrix sizes. Now there are signs in the data, m , k , n , m * k * n and conf(computer identifier). Also, the training sample contains the target variable time - the time for which the matrix multiplication was performed.
library(quantreg)
library(dplyr)
# загрузка исходных данных
y <- read.csv('y_train.csv')[,1]
X.train <- read.csv('x_train.csv', stringsAsFactors = FALSE)
X.test <- read.csv('x_test.csv', stringsAsFactors = FALSE)
# создадим таблицу с уникальными конфигурациями
X.all <- rbind(X.train, X.test)
X.uni <- unique(X.all[,4:951])
# создадим датафрейм с уникальными конфигурациями и вектор с их именами
X.uni <- cbind(conf=paste0('conf', formatC(1:nrow(X.uni), 2, flag = '0')), X.uni)
confs <- unique(X.uni$conf)
# разметим тренировочную выборку именами и выкинем все ненужные столбцы# преобразуем целый тип к вещественным и добавим целевую переменную
cols <- c('m', 'k', 'n', 'conf')
X.train <- dplyr::inner_join(X.train, X.uni, by=colnames(X.uni)[2:ncol(X.uni)])[cols]
X.train[,1:3] <- lapply(X.train[,1:3], as.numeric) # int->numeric
X.train$mkn <- X.train$m * X.train$k * X.train$n
X.train$time <- y
# то же для тестовой выборки
X.test <- dplyr::inner_join(X.test, X.uni, by=colnames(X.uni)[2:ncol(X.uni)])[cols]
X.test[,1:3] <- lapply(X.test[,1:3], as.numeric) # int->numeric
X.test$mkn <- X.test$m * X.test$k * X.test$n
2. Preliminary data analysis and selection of a working model
Now you can look at the data and see the expected dependence of the target variable time on m * k * n . For example, configuration 39 (left). It is well approximated by linear regression. However, not for all configurations the linear model can give a good result, for example, configuration 11 (on the right):

Noise and emissions make a significant negative contribution to the model, so one of the options for dealing with them is to remove such observations. The disadvantage of this method is that it decides how far the observation should be “wrong” so that it can be removed. Moreover, for some systems there is very little data and throwing out even noisy data is wasteful.
An alternative is to use models that are resistant to emissions and noise. Examples of such models are quantile and robust regressions.
In R, they can be found in the quantreg and MASS packages, respectively. Each of them has its own settings,

3. Training and test models
The scatter of the values of the target variable varies widely for each of the systems, so the solution suggests itself: to train a stack of models, one for each configuration. Let the dependence in each model be a little more complicated, namely, the time spent on matrix multiplication will depend not only on the product of their sizes, but also on each of them individually, i.e.
models <- list()
for (i in1:length(confs)){
# выбор конфигурации
X.conf <- subset(X.train, conf==confs[i])
X.conf$conf <- NULL
# обучаем модель и добавляем ее в стек
fit <- rq(time ~ ., data=X.conf)
models[[confs[i]]] <- fit
}
Now it is easy to calculate the predictions by sorting out the models and selecting data from the dataset that corresponds to each individual configuration and its model.
y.test <- rep(1, nrow(X.test))
y.train <- rep(1, nrow(X.train))
# предсказания на тренировочной выборкеfor (i in1:length(confs)){
X.conf <- subset(X.train, conf==confs[i])
y.train[as.numeric(rownames(X.conf))] <- predict(models[[confs[i]]])
}
# предсказания на тестовой выборкеfor (i in1:length(confs)){
X.conf <- subset(X.test, conf==confs[i])
y.test[as.numeric(rownames(X.conf))] <- predict(models[[confs[i]]], newdata = X.conf)
}
Due to the imperfection of the initial data and, therefore, the obtained models, the predicted values will contain numbers less than 1, which, according to the conditions of the problem, should not be. It turns out to be useful to use the predictions on the training set and replace all the inappropriate values. A simple and effective method is to replace all numbers smaller than a certain constant, cut-off, by herself. The graph shows a metric graph (MAPE) depending on the cutoff cutoff value . The horizontal line is the metric without replacement, the vertical line is the cut-off at which the metric takes the smallest value.

Using such a replacement can slightly improve the value of the metric. As practice has shown, the cutoff level on the training and test samples is approximately the same.
The obvious drawback of the above code is that the models were built without cross-validation, therefore, the obtained metric value on the training set cannot be trusted, as well as the assessment on the public leaderboard. The deficiency is eliminated by an additional relevant code.
However, there is a clear advantage - in speed (the author completed the script in half a second), which allows you to quickly evaluate many different options for combinations of models. For comparison, models using gradient boosting, random forests or neural networks were built from at least several tens of minutes.
However, an algorithm with a choice between quantile and robust regression for each configuration showed the result MAPE = 5.23% on a public leaderboard. According to the results of the competition, there is no metric, but based on cross-validation data, a MAPE of about 5.4% is expected.
4. Conclusion
Thus, training the model stack and predicting the target variable is very fast, less than a second, and good accuracy is achieved. The described method can be improved due to more detailed attention to a particular configuration, for example, selecting a quantile for each regression or parameters using robust methods, removing explicit outliers, selecting a separate cut-off, etc., which, as a rule, does not affect much script execution time.
According to the modest opinion of the author, the use of quantile (or other emission-resistant) regression for predicting the execution time of computer experiments is very practical, accurate and much less computationally expensive than the method described in the article to a competition where linear regression was used with the removal of values (own bike?) and random forests.