How to speed up work with APIs in the R language using parallel computing, using the Yandex.Direct API as an example (Part 1)

    The R language today is one of the most powerful and multifunctional tools for working with data, but as we know almost always, in any barrel of honey there is a fly in the ointment. The fact is that R is single stream by default.


    Most likely, it will not bother you for quite a long time, and you are unlikely to ask this question. But for example, if you are faced with the task of collecting data from a large number of advertising accounts from an API, such as Yandex.Direct, then you can significantly, at least two to three times, reduce the time for data collection using multithreading.


    image


    The topic of multithreading in R is not new, and has repeatedly raised Habré here , here and here , but the last publication dates from 2013, and as they say everything new is well forgotten old. In addition, multithreading was previously discussed for calculating models and teaching neural networks, and we will talk about the use of asynchrony to work with the API. Nevertheless, I would like to take this opportunity to thank the authors of the above articles because in writing this article they helped me a lot with their publications.


    Content



    The second part of the article, in which we are talking about more modern embodiments of multithreading in R, is available here .


    What is multithreading


    Singleline (Sequential Computing) - the calculation mode in which all actions (tasks) are performed sequentially, the total duration of all specified operations in this case will be equal to the sum of the duration of all operations.


    Multithreading (Parallel Computing) is a computation mode in which the specified actions (tasks) are performed in parallel, i.e. at the same time, while the total time of performing all operations will not be equal to the sum of the duration of performing all operations.


    To simplify the perception, let's consider the following table:


    image


    The first line of the table is conditional time units, in this case, it does not matter to us that it is seconds, minutes, or any other time intervals.


    In this example, we need to perform 4 operations, each operation has a different calculation duration, in single-threaded mode, all 4 operations will be performed sequentially one after the other, therefore the total time for their execution will be t1 + t2 + t3 + t4, 3 + 1 + 5 + 4 = 13.


    In multithreaded mode, all 4 tasks will be executed in parallel, i.e. to start the next task, there is no need to wait until the previous one is completed, so if we start the execution of our task in 4 threads, then the total calculation time will be equal to the calculation time of the largest task, in our case this is task t3, the duration of which is in our example 5 time units, respectively, and the execution time of all 4x operations in this case will be equal to 5 time units.


    What packages will we use


    For computing in multi-threaded mode, we will use packages foreach, doSNOWand doParallel.


    The package foreachallows you to use a construct foreachthat is an advanced for loop.


    Packages doSNOWand doParallelessentially twin brothers allow you to create virtual clusters and perform parallel computations with their help.


    At the end of the article using the package, rbenchmarkwe measure and compare the duration of data collection operations from the Yandex.Direct API using all the methods described below.


    To work with the Yandex.Direct API, we will use the ryandexdirect package, in this article we will use it as an example, you can learn more about its capabilities and functions from the official documentation .


    Code to install all the necessary packages:


    install.packages("foreach")
    install.packages("doSNOW")
    install.packages("doParallel")
    install.packages("rbenchmark")
    install.packages("ryandexdirect")

    Task


    You need to write a code that will request a list of keywords from any number of advertising accounts Yandex.Direct. The result must be collected in a single date frame in which there will be an additional field with the login account of the advertising account to which the keyword belongs.


    In this case, our task is to write the code that will perform this operation as quickly as possible on any number of advertising accounts.


    Authorization in Yandex.Direct


    To work with the Yandex.Direct advertising platform API, you initially need to be authorized under each account from which we plan to request a list of keywords.


    All the code given in this article reflects an example of working with regular Yandex.Direct advertising accounts, if you work under an agent account, then you need to use the AgencyAccount argument and pass the login name of the agency account to it. You can learn more about working with Yandex.Direct agent accounts using the ryandexdirect package here .


    For authorization you need to perform the function yadirAuthfrom the package ryandexdirect, repeat the code below is necessary for each account from which you will request a list of keywords and their parameters.


    ryandexdirect::yadirAuth(Login = "логин рекламного аккаунта на Яндексе")

    The authorization process in Yandex.Direct through the package is ryandexdirectcompletely safe, despite the fact that it passes through a third-party site. In detail about the safety of its use, I have already told in the article "How safe is it to use R packages for working with the API of advertising systems" .


    After authorization, a login.yadirAuth.RData file will be created under each account in your working directory , in which the account data for each account will be stored. The file name will start with the login specified in the Login argument . If you need to save files not in the current working directory, but in some other folder, use the TokenPath argument , but in this case, when requesting keywords using a function, yadirGetKeyWordsyou also need to use the TokenPath argument and specify the path to the folder where you saved the files with credentials.


    Solution in a single-threaded, sequential mode using a for loop


    The easiest way to collect data from multiple accounts at once is to use a loop for. Simple but not the most effective, because One of the principles of R-language development is to avoid using loops in code.


    Below is a sample code to collect data from 4 accounts using a for loop, in fact, you can use this example to collect data from any number of ad accounts.


    Code 1: We process 4 accounts using the usual for loop
    library(ryandexdirect)
    # вектор логинов
    logins <- c("login_1", "login_2", "login_3", "login_4")
    # результирующий дата фрейм
    res1 <- data.frame()
    # цикл сбора данных
    for (login in logins) {  
      temp <- yadirGetKeyWords(Login = login)
      temp$login <- login
      res1 <- rbind(res1, temp)
     }

    Measuring runtime using the system.time function showed the following result:


    Work time:
    user: 178.83
    system: 0.63
    passed: 320.39


    The collection of keywords for 4 accounts took 320 seconds, while from the informational messages that the function displays during operation, yadirGetKeyWordsthe largest account from which 5970 keywords were received was processed 142 seconds.


    Solution using multithreading in R


    I already wrote above that for multithreading we will use packages doSNOWand doParallel.


    I want to draw attention to the fact that almost every API has its limitations, and the Yandex.Direct API is not an exception. In fact, in the help guide for the Yandex.Direct API it says:


    Allowed no more than five simultaneous requests to the API on behalf of one user.

    Therefore, in spite of the fact that in this case we will consider an example with the creation of 4 threads, you can create 5 threads with Yandex.Direct, even if you send all requests under the same user. But the most efficient use of 1 thread per 1 core of your processor, you can determine the number of physical cores of the processor using the command parallel::detectCores(logical = FALSE), the number of logical cores can be found using parallel::detectCores(logical = TRUE). More details on what a physical and logical core can be on Wikipedia .


    In addition to the limit on the number of requests, there is a daily limit on the number of points for accessing the Yandex.Direct API, it may be different for all accounts, each request also consumes a different number of points depending on the operation being performed. For example, for requesting a list of keywords, 15 points will be deducted from you for a completed request and 3 points for every 2000 words, about how points are written off can be found in the official certificate . You also see information about the number of points written and available, as well as about their daily limit in informational messages returned to the console by a function yadirGetKeyWords.


    Number of API points spent when executing the request: 60
    Available balance of daily limit API points: 993530
    Daily limit of API points:996000

    Let's deal with doSNOWand doParallelin order.


    Package doSNOW and features of work in multithreaded mode


    Let's rewrite the same operation to the multithreaded mode of calculations, create 4 threads at the same time, and instead of a loop forwe will use the construction foreach.


    Code 2: Parallel computing with the doSNOW package
    library(foreach)
    library(doSNOW)
    # вектор логинов
    logins <- c("login_1", "login_2", "login_3", "login_4")
    cl <- makeCluster(4)
    registerDoSNOW(cl)
    res2 <- foreach(login = logins, # переменная - счётчик
                    .combine = 'rbind', # функция объедения результатов полученных на всех итерациях
                    .packages = "ryandexdirect", # экспорт пакетов
                    .inorder=F ) %dopar% {cbind(yadirGetKeyWords(Login = login),  
                                           login) 
      }
    stopCluster(cl)

    In this case, the execution time measurement using the system.time function showed the following result:


    Work time:
    user: 0.17
    system: 0.08
    passed: 151.47


    The same result, i.e. we got the collection of keywords from 4 Yandex.Direct accounts in 151 seconds, i.e. 2 times faster. Besides, it’s not just in the past example that I wrote how long it took to load a list of keywords from the largest account (142 seconds), i.e. in this example, the total time is almost identical to the processing time of the largest account. The fact is that using the function, foreachwe simultaneously launched the process of collecting data in 4 streams, i.e. simultaneously collected data from all 4 accounts, respectively, the total work time is equal to the processing time of the largest account.


    I will give a small explanation of code 2 , the function makeClusteris responsible for the number of threads, in this case we created a cluster of 4 processor cores, but as I wrote earlier when working with the Yandex.Direct API, you can create 5 threads no matter how many accounts you need you need to process 5-15-100 or more, you can simultaneously send requests to API 5.


    Next, the function registerDoSNOWstarts the created cluster.


    After that, we use the construction foreach, as I said earlier, this construction is an advanced cycle for. You set the counter as the first argument, in the above example I called it login and it would iterate over the elements of the logins vector at each iteration , we would get the same result in a loop forif we wrote for ( login in logins).


    Next, you need to specify the function in the .combine argument , with which you will combine the results obtained at each iteration, the most frequent options are:


    • rbind - connect the resulting tables in rows one under the other;
    • cbind - connect the resulting tables in columns;
    • "+" - summarize the result obtained at each iteration.

    You can also use any other function, even self-written.


    Argument .inorder = F allows you to speed up the function a little more if you don’t have a fundamental way to combine the results, in this case, the order is not important to us.


    Next comes the operator %dopar%that starts the cycle in parallel computing mode, if you use the operator, %do%the iterations will be executed sequentially, as well as using the normal cycle for.


    The function stopClusterstops the cluster.


    Multithreading, or rather, constructions foreachin multithreaded mode have some peculiarities, in fact, in this case we start each parallel process in a new, clean R session. Therefore, in order to use self-written functions and objects that were defined outside the construction itself, foreachyou need to export them using the .export argument . This argument takes a text vector containing the names of the objects you will use inside foreach.


    Also foreach, in parallel mode, by default it does not see packages that were previously connected, so they will also need to be passed inside the foreach using the .packages argument . Packets must also be transferred by listing their names in a text vector, for example .packages = c("ryandexdirect", "dplyr", "lubridate"). In the code 2 example above , we just in this way foreachload the ryandexdirect package at each iteration .


    DoParallel package


    As I wrote above packages doSNOWand doParallel- twins, so the syntax they have just the same.


    Code 5: Parallel computing with the doParallel package
    library(foreach)
    library(doParallel)
    logins <- c("login_1", "login_2", "login_3", "login_4")
    cl <- makeCluster(4)
    registerDoParallel(cl)
    res3 <-  data.frame()
    res3 <- foreach(login=logins, .combine= 'rbind', .inorder=F) %dopar% 
        {cbind(ryandexdirect::yadirGetKeyWords(Login = login), 
               login)   
    stopCluster(cl)

    Work time:
    User: 0.25
    System: 0.01
    Passed: 173.28


    As you can see in this case, the execution time is slightly different from the previous example code of parallel computing using the package doSNOW.


    Speed ​​test between the three approaches considered


    Now run the speed test using the package rbenchmark.


    image


    As you can see, even on a test from 4 accounts, packages doSNOWand doParallelreceived data on keywords are 2 times faster than a sequential for loop, if you create a cluster of 5 cores and process 50 or 100 accounts, the difference will be even more significant.


    Code 6: Comparison script for multithreading speed and sequential calculations
    # подключаем библиотеки
    library(ryandexdirect)
    library(foreach)
    library(doParallel)
    library(doSNOW)
    library(rbenchmark)
    # создаём функцию сбора ключевых слов с использованием цикла for
    for_fun <- function(logins) {
      res1 <- data.frame()
      for (login in logins) {
        temp <- yadirGetKeyWords(Login = login)
        res1 <- rbind(res1, temp)
      }
      return(res1)
    }
    # создаём функцию сбора ключевых слов с использованием функции foreach и пакета doSNOW
    dosnow_fun <- function(logins) {
      cl <- makeCluster(4)
      registerDoSNOW(cl)
      res2 <-  data.frame()
      system.time({
        res2 <- foreach(login=logins, .combine= 'rbind') %dopar% {temp <- ryandexdirect::yadirGetKeyWords(Login = login
        }
      })
      stopCluster(cl)
      return(res2)
    }
    # создаём функцию сбора ключевых слов с использованием функции foreach и пакета doParallel
    dopar_fun <- function(logins) {
      cl <- makeCluster(4)
      registerDoParallel(cl)
      res2 <-  data.frame()
      system.time({
        res2 <- foreach(login=logins, .combine= 'rbind') %dopar% {temp <- ryandexdirect::yadirGetKeyWords(Login = login)
        }
      })
      stopCluster(cl)
      return(res2)
    }
    # запускаем тест скорости сбора данных по двум написанным функциям
    within(benchmark(for_cycle  = for_fun(logins = logins),
                     dosnow     = dosnow_fun(logins = logins),
                     doparallel = dopar_fun(logins = logins),
                     replications = c(20),
                     columns=c('test', 'replications', 'elapsed'),
                     order=c('elapsed', 'test')),
           { average = elapsed/replications })
    

    In conclusion, I will give an explanation of the above code 5 , with which we tested the speed of work.


    Initially, we created three functions:


    for_fun - function requesting keywords from a variety of accounts, sequentially going through them in the usual cycle.


    dosnow_fun- function requesting a list of keywords in multi-threaded mode, using the package doSNOW.


    dopar_fun- function requesting a list of keywords in multi-threaded mode, using the package doParallel.


    Further, within the structure withinwe run function benchmarkof the packet rbenchmark, indicate the name of test (for_cycle, dosnow, doparallel), and each test indicate functions respectively for_fun(logins = logins); dosnow_fun(logins = logins); dopar_fun(logins = logins).


    The replications argument is responsible for the number of tests, i.e. how many times we will run each function.


    The columns argument allows you to specify which columns you want to get, in our case, 'test', 'replications', 'elapsed' means returning the columns: the name of the test, the number of tests, the total run time of all tests.


    You can also add calculated columns, ( { average = elapsed/replications }), i.e. In the output there will be an average column that divides the total time by the number of tests, so we calculate the average time to complete each function.


    The order is responsible for sorting the test results.


    Conclusion


    In this article, in principle, a fairly universal method is described for accelerating work with the API, but each API has its limits, so specifically in this form, with so many threads, the given example is suitable for working with the Yandex.Direct API, for using it with the API other services initially need to read the documentation on the limits in the API on the number of simultaneously sent requests, otherwise you may get an error Too Many Requests.


    The continuation of this article is available here .


    Also popular now: