qc-enior March 9, 2016 at 12:36

R Code Acceleration Strategies, Part 2

Transfer

The for loop in R can be very slow if it is applied in its pure form, without optimization, especially when dealing with large data sets. There are a number of ways to make your code faster, and you will probably be surprised to find out how much.

This article describes several approaches, including simple changes in logic, parallel processing and Rcpp, increasing the speed by several orders of magnitude, so that it will be possible to normally process 100 million rows of data or even more.

Let's try to speed up the code with a for loop and a conditional statement (if-else) to create a column that is added to the data set (data frame, df). The code below creates this initial dataset.

# Создание набора данных
col1 <- runif (12^5, 0, 2)
col2 <- rnorm (12^5, 0, 2)
col3 <- rpois (12^5, 3)
col4 <- rchisq (12^5, 2)
df <- data.frame (col1, col2, col3, col4)

In the first part : vectorization, only true conditions, ifelse.
In this part: which, apply, byte compilation, Rcpp, data.table, results.

Using which ()

Using the command which()to select the rows, one third of the speed can be achieved Rcpp.

# Спасибо Гейб Бекер
system.time({
  want = which(rowSums(df) > 4)
  output = rep("less than 4", times = nrow(df))
  output[want] = "greater than 4"
})

# количество строк = 3 миллиона (примерно)
   user  system elapsed 
  0.396   0.074   0.481

Use the apply function family instead of for loops

We use a function apply()to implement the same logic and compare it with a vectorized for loop. The results grow with an increase in the number of orders, but they are slower than the ifelse()versions where the verification was done outside the loop. This may be useful, but it may take some ingenuity for complex business logic.

# семейство apply
system.time({
  myfunc <- function(x) {
    if ((x['col1'] + x['col2'] + x['col3'] + x['col4']) > 4) {
      "greater_than_4"
    } else {
      "lesser_than_4"
    }
  }
  output <- apply(df[, c(1:4)], 1, FUN=myfunc)  # применить 'myfunc' к каждой строке
  df$output <- output
})

Using apply and the for loop in R

Use byte compilation for cmpfun () functions from compiler package instead of function itself

This is probably not the best example to illustrate the efficiency of byte compilation, since the resulting time is slightly higher than the usual form. However, for more complex functions, byte compilation has proven effective. I think it's worth a try on occasion.

# побайтовая компиляция кода
library(compiler)
myFuncCmp <- cmpfun(myfunc)
system.time({
  output <- apply(df[, c (1:4)], 1, FUN=myFuncCmp)
})

Apply, for loop and byte code compilation

Use rcpp

Let's get to a new level. Prior to this, we increased speed and productivity through various strategies and found that use was ifelse()most effective. What if we add another zero? Below we implement the same logic with Rcpp, with a data set of 100 million rows. We will compare speeds Rcppand ifelse().

library(Rcpp)
sourceCpp("MyFunc.cpp")
system.time (output <- myFunc(df)) # функция Rcpp ниже

Below is the same logic implemented in C ++ using the Rcpp package. Save the code below as “MyFunc.cpp” in your R session working directory (or you will have to use sourceCpp using the full path). Please note that a comment is // [[Rcpp::export]]required and must be placed immediately before the function that you want to execute from R.

// Источник для MyFunc.cpp
#include 
using namespace Rcpp;
// [[Rcpp::export]]
CharacterVector myFunc(DataFrame x) {
  NumericVector col1 = as(x["col1"]);
  NumericVector col2 = as(x["col2"]);
  NumericVector col3 = as(x["col3"]);
  NumericVector col4 = as(x["col4"]);
  int n = col1.size();
  CharacterVector out(n);
  for (int i=0; i 4){
      out[i] = "greater_than_4";
    } else {
      out[i] = "lesser_than_4";
    }
  }
  return out;
}

Performance Rcppandifelse

Use parallel processing if you have a multi-core computer

Parallel processing:

# параллельная обработка
library(foreach)
library(doSNOW)
cl <- makeCluster(4, type="SOCK") # for 4 cores machine
registerDoSNOW (cl)
condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4
# параллелизация с векторизацией
system.time({
  output <- foreach(i = 1:nrow(df), .combine=c) %dopar% {
    if (condition[i]) {
      return("greater_than_4")
    } else {
      return("lesser_than_4")
    }
  }
})
df$output <- output

Delete variables and clear memory as early as possible

Remove more unnecessary objects in your code using rm()as soon as possible, especially before long loops. Sometimes it can help gc()at the end of each iteration of the loop.

Use data structures that take up less memory

Data.table()Is a great example because it does not overload memory. This speeds up operations like data federation.

dt <- data.table(df)  # создать data.table
system.time({
  for (i in 1:nrow (dt)) {
    if ((dt[i, col1] + dt[i, col2] + dt[i, col3] + dt[i, col4]) > 4) {
      dt[i, col5:="greater_than_4"]  # присвоить значение в 5-й колонке
    } else {
      dt[i, col5:="lesser_than_4"]  # присвоить значение в 5-й колонке
    }
  }
})

Dataframe and data.table

Speed: Results

Method: Speed, number of lines in df / elapsed time = n lines per second
Original: 1X, 120,000 / 140.15 = 856.2255 lines per second (normalized to 1)
Vectorized: 738X, 120,000 / 0.19 = 631578.9 lines per second
Only true conditions: 1002X , 120000 / 0.14 = 857142.9 lines per second
ifelse: 1752X, 1200000 / 0.78 = 1500000 lines per second
which: 8806X, 2985984 / 0.396 = 7540364 lines per second
Rcpp: 13476X, 1200000 / 0.09 = 11538462 lines per second

The numbers above approximate and based on random starts. No calculation results fordata.table(), byte code compilation and parallelization, because they will be very different in each case and depending on how you use them.

Tags: