The book "The Art of Programming in R. Immersing in Big Data"

    imageHi, habrozhiteli! Many users use R for specific tasks - here to build a histogram, there to conduct a regression analysis or perform other separate operations related to statistical data processing. But this book is written for those who want to develop software in R. The programming skills of the intended readers of this book can range from a professional qualification to “I took a programming course in college,” but the key is to write R code for specific purposes . (A thorough knowledge of statistics is generally not necessary.)

    A few examples of readers who might benefit from this book:

    • An analyst (for example, working in a hospital or government agency) who has to regularly issue statistical reports and develop programs for this purpose.
    • A scientist involved in the development of a statistical methodology - new or integrating existing methods into integrated procedures. The methodology needs to be encoded so that it can be used in the research community.
    • Specialists in marketing, legal support, journalism, publishing, etc., involved in the development of code for building complex graphic representations of data.
    • Professional programmers with software development experience assigned to projects related to statistical analysis.
    • Students studying statistics and data processing.


    Thus, this book is not a reference to the countless statistical methods of the wonderful package R. In fact, it is devoted to programming and it deals with programming issues that are rarely found in other books about R. Even fundamental topics are considered from the angle of programming. A few examples of this approach:

    • This book contains sections of the "Advanced Examples". Usually, they provide complete general-purpose functions instead of isolated pieces of code based on specific data. Moreover, some of these functions may come in handy in your daily work with R. By studying these examples, you will not only learn how specific R constructs work, but also learn how to combine them into useful programs. In many cases, I provide descriptions of alternative solutions and answer the question: “Why was this done this way?”
    • The material is presented taking into account the perception of the programmer. For example, when describing data frames, I not only claim that the data frame in R is a list, but I also point out the consequences of this fact from a programming point of view. Also in the text, R is compared with other languages ​​where this may be useful (for readers who speak these languages).
    • Debugging plays a crucial role in programming in any language, but most books on R do not mention this topic. In this book, I devoted a whole chapter to debugging tools, used the principle of “advanced examples”, and presented fully developed demos of how programs are debugged in reality.
    • Nowadays, multi-core computers have appeared in all homes, and the programming of graphic processors (GPUs) is producing an imperceptible revolution in the field of scientific computing. More and more R applications require very large amounts of computation, and parallel processing has become relevant for R programmers. A whole chapter is devoted to this topic in the book, in addition to describing mechanics, advanced examples are also given.
    • A separate chapter talks about how to use information about internal implementation and other aspects of R to speed up the work of R code.
    • One of the chapters focuses on the interface of R with other programming languages ​​such as C and Python. Once again, special attention is paid to advanced examples and recommendations for debugging.

    Excerpt. 7.8.4. When should global variables be used?


    There is no consensus on the use of global variables in the programmer community. Obviously, there is no right answer to the question posed in the title of this section, since this is a question of personal preferences and style. Nevertheless, many programmers believe that a complete ban on global variables, which many programming teachers advocate, would be unnecessarily tough. In this section, we examine the possible benefits of global variables in the context of R. structures. The term “global variable” will mean any variable that is in the environment hierarchy above the level of code of interest.

    Using global variables in R is more common than you might expect. Surprisingly, R uses global variables very widely in its internal implementation (both in C code and in R functions). So, the super-assignment operator << - is used in many library functions of R (although it is usually used to write to a variable located only one level higher in the hierarchy of variables). The multithreaded code and GPU code used to write fast programs (see chapter 16) usually use global variables that provide the main mechanism of interaction between parallel executors.

    Now, for concreteness, let us return to an earlier example from section 7.7:

    f <- function(lxxyy) { # lxxyy — список, содержащий x и y
        ...
        lxxyy$x <- ...
        lxxyy$y <- ...
        return(lxxyy)
    }
    # Задать x и y
    lxy$x <- ...
    lxy$y <- ...
    lxy <- f(lxy)
    # Использовать новые x и y
    ... <- lxy$x
    ... <- lxy$y

    As mentioned earlier, this code can become cumbersome, especially if x and y are themselves lists.

    On the other hand, take a look at an alternative scheme using global variables:

    f <- function() {
         ...
         x <<- ...
         y <<- ...
    }
    # Задать x и y
    x <-...
    y <-...
    f() # Здесь x и y изменяются
    # Использовать новые x и y
    ... <- x
    ... <- y

    Perhaps the second version is much cleaner, less bulky and does not require list manipulation. Clear code usually creates fewer problems in writing, debugging, and maintenance.

    For these reasons - to simplify and reduce the bulkiness of the code - we decided to use global variables instead of returning lists in the DES code given earlier. Consider this example in more detail.

    Two global variables were used (both are lists containing different information): the sim variable is associated with the library code, and the mm1glbls variable is associated with the specific application code M / M / 1. Let's start with sim.

    Even programmers who are restrained about global variables agree that the use of such variables can be justified if they are truly global - in the sense that they are widely used in the program. All this relates to the sim variable from the DES example: it is used both in the library code (in schedevnt (), getnextevnt () and dosim ()) and in the M / M / 1 code (in mm1reactevnt ()). In this particular example, subsequent calls to sim are limited to reading, but recording is possible in some situations. A typical example of this kind is a possible implementation of event cancellation. For example, such a situation may occur when modeling the “earlier of the two” principle: two events are planned, and when one of them occurs, the other should be canceled.

    Thus, using sim as a global variable seems justified. However, if we resolutely refused to use global variables, sim could be placed in a local variable inside dosim (). This function will pass sim in the argument of all the functions mentioned in the previous paragraph (schedevnt (), getnextevnt (), etc.), and each of these functions will return a modified sim variable.
    For example, line 94:

    reactevnt(head)

    converted to the following form:

    sim <- reactevnt(head)

    After that, the following line should be added to the mm1reactevnt () function associated with a specific application:

    return(sim)

    You can do something similar with mm1glbls by including in dosim () a local variable with the name (for example) appvars. But if this is done with two variables, then they must be put on the list so that both variables can be returned from the function, as in the above example of the f () function. And then the bulky structure of lists inside lists arises, which was mentioned above, or rather, lists inside lists inside lists.

    On the other hand, opponents of using global variables notice that code simplicity is not in vain. They are worried that during the debugging process there are difficulties in finding places where the global variable changes the value, since the change can occur anywhere in the program. It would seem that in the world of modern text editors and integrated development tools that will help find all occurrences of a variable, the problem goes by the wayside (the original article urging to abandon the use of global variables was published in 1970!). Nevertheless, this factor must be taken into account.

    Another problem that critics mention is encountered when calling a function from several unrelated parts of a program with different values. For example, imagine that the function f () is called from different parts of the program, with each call receiving its own x and y values ​​instead of one value for each. The problem can be solved by creating vectors of x and y values ​​in which each instance of f () in your program has a separate element. However, this will lose the simplicity of using global variables.

    These problems are encountered not only in R, but also in a more general context. However, in R the use of global variables at the upper level creates an additional problem, since the user at this level usually has many variables. There is a danger that code using global variables could accidentally replace a completely extraneous variable with the same name.

    Of course, the problem is easily solved - it is enough to choose long names for global variables that are tied to a specific application. However, environments also provide a reasonable compromise, as in the following situation for the DES example.

    Inside the dosim () function, the line

    sim <<- list()

    can be replaced with a string

    assign("simenv",new.env(),envir=.GlobalEnv)

    It creates a new environment referenced by the simenv variable at the top level. This environment serves as a container for encapsulating global variables that can be accessed by calls to get () and assign (). For example, strings

    if (is.null(sim$evnts)) {
       sim$evnts <<- newevnt

    in schedevnt () take the form

    if (is.null(get("evnts",envir=simenv))) {
      assign("evnts",newevnt,envir=simenv)

    Yes, this solution is also cumbersome, but at least it is not as complicated as lists inside lists inside lists. And it protects against accidental writing to an extraneous variable at the top level. Using the super-assignment operator still gives less cumbersome code, but this trade-off should be taken into account.

    As usual, there is no single programming style that provides the best results in all situations. A solution with global variables is another option that should be included in your arsenal of programming tools.

    7.8.5. Short circuits


    Let me remind you that closures of R consist of arguments and the body of the function in conjunction with the environment at the time of the call. The fact of enabling the environment is involved in the programming paradigm, which uses the concept, also called closure (there is some overload of terminology here).

    A closure is a function that creates a local variable, and then creates another function that accesses this variable. The description is too abstract, so I’d better give an example.

    1 > counter
    2 function () {
    3      ctr <- 0
    4      f <- function() {
    5          ctr <<- ctr + 1
    6          cat("this count currently has value",ctr,"\n")
    7      }
    8      return(f)
    9 }

    Let's check how this code works before diving into the implementation details:

    > c1 <- counter()
    > c2 <- counter()
    > c1
    function() {
            ctr <<- ctr + 1
            cat("this count currently has value",ctr,"\n")
         }
    
    > c2
    function() {
            ctr <<- ctr + 1
            cat("this count currently has value",ctr,"\n")
         }
    
    > c1()
    this count currently has value 1
    > c1()
    this count currently has value 2
    > c2()
    this count currently has value 1
    > c2()
    this count currently has value 2
    > c2()
    this count currently has value 3
    > c1()
    this count currently has value 3

    Here, the counter () function is called twice, and the results are assigned c1 and c2. As expected, these two variables consist of functions, namely copies of f (). However, f () accesses the ctr variable via the super-assignment operator, and this variable will be a variable with the specified name local to counter (), since it will be the first on the path in the environment hierarchy. It is part of the environment f (), and as such is packaged into what returns to the call side of counter (). The key point is that with different calls to counter (), the ctr variable will be in different environments (in the example environment, it was stored in memory at the addresses 0x8d445c0 and 0x8d447d4). In other words, different calls to counter () will create physically different instances of ctr.

    As a result, the functions c1 () and c2 () work as completely independent counters. This can be seen from the example where each function is called several times.

    »More information on the book can be found on the publisher’s website
    » Contents
    » Excerpt

    For Khabrozhiteley 25% discount on coupon - R

    Upon payment of the paper version of the book, an electronic version of the book is sent by e-mail.

    Also popular now: