demche June 1, 2019 at 19:55

Entertaining Archeology: The Style Guide R Under the Magnifying Glass

From the sandbox

As you know, code is read much more often than it is written. So that at least someone other than the author can read it, and there are style guides. For R, this may be, for example, the Hadley manual.

A style guide is not just a tacit agreement between developers - many of the rules have a curious background. Why the arrow is <-better than the equal sign =, why the old-timers of R do not like the underscore, how the recommended line length is related to the punch card, and much more - more.

Disclaimer: R Style Guides

Unlike Python, R does not have a single standard. Accordingly, there is no single guide. In addition to the Hadley guide (or its extended version of tidyverse ), there are others, such as Google or Bioconductor .

However, the Hadley guide can be considered the most common (like the built-in RStudio check , for example), which is greatly facilitated by the popularity of libraries created by Hadley himself (dplyr, ggplot, tidyr, and others from the tidyverse collection).

1. Assignment operator: `<-`vs`=`

All available guides recommend using a non-standard operator <-, and not at all an equal sign =, familiar to other modern languages. Three other operator ( <<-, ->, ->>) are not even mentioned (as existed in earlier versions :=). It would seem, why do we need this non-standard arrow?

History reveals the cards to us: in R, the arrow came from S, which in turn inherited it from the APL. In APL, it allowed us to distinguish assignment from equality. In R, the equality operator is standard, so the difference is different. If the arrow was an assignment operator initially, then the equal sign assigned values only to named parameters. In 2001, the equal sign became the assignment operator, but never became synonymous with the arrow.

What allows us to consider a =full replacement for the arrow? First of all, =how the assignment operator works only at the top level. For example, inside the function, everything will work as before:

mean(x = 1:5)
# [1] 3
x 
# Error: object 'x' not found
mean (x <- 1:5)
# [1] 3
x
# [1] 1 2 3 4 5

Here, it =only sets the function parameter, while it <-also assigns the value to the variable x. We can achieve the same effect by placing the assignment operation in parentheses ~~(no, this is still not Lisp)~~ :

mean ((x = 1:5))
# [1] 3
x
# [1] 1 2 3 4 5

... or in braces:

mean ({x = 1:5})
# [1] 3
x
# [1] 1 2 3 4 5

In addition, the arrow takes precedence over the equal sign:

x <- y <- 1
# OK
x = y = 2
# OK
x = y <- 3
# OK
x <- y = 4
# Error in x <- y = 4 : could not find function "<-<-"

The last expression failed because it is equivalent (x <- y) = 4, and the parser interprets it as

`<-<-`(x, y = 4, value = 4)

In other words, we are trying to perform an incorrect operation, first assign the value of x y, and then attempt to assign the value of x and y 4. The expression will be processed without error only if you change the priority of operations brackets: x <- (y = 4).

2. Spacing

The guide recommends putting spaces between operators (except, of course, square brackets,:, :: and :: :), as well as before the opening bracket. Obviously, this is part of the GNU coding standards. However, this clause is closely related to use <-as an assignment operator. For instance,

x <-1

What is it? X is less than -1? Or set x to 1?

However, the extra space is no better than the missing one, for example:

x <- 0
ifelse(x <-1, T, F)
# [1] TRUE
x <- 0
ifelse(x < -1, T, F)
# [1] FALSE

In the first case, there is no space between <and -, which creates an assignment operator.

3. Names of functions and variables

Style guides disagree on the question of names: the Hadley guide recommends underscores for all names; Google Guide - separation by dots for variables and camel style with the first lowercase for functions; Bioconductor recommends lowerCamel for both functions and variables. There is no unity in the R community on this issue, and all possible styles can be found:

lowerCamel
period.separation
lower_case_with_underscores
allowercase
UpperCamel

There is no uniform style even for base R names (for example, rownames and row.names are different functions!). If you don’t take into account the unreadable allowercase (only Matlab users can love it), there are three most popular styles: lowerCamel, lower case with _, and lower case with dot separation.

The popularity of different styles for function names and parameters (one name can correspond to different styles). Source: Rasmus Bååth performance on useR! 2017.

Same thing in 2012

Source: Baath (2012). " The State of Naming Conventions in R ". The R Journal Vol. 4/2, P. 74-75.

Dot separation is ominously reminiscent of the use of methods in object-oriented programming, but is historically common. It is so common that this particular style can be considered truly R'vsky. For example, most of the basic functions use it specifically (and everyone just met with data.table and as.factor).

But the separation _ is one of the least popular styles (and here Hadley goes against the majority). For many R users, underscores will be annoying: in the popular Emacs Speaks Statistics extension, it is replaced by default with an assignment operator <-. And the default settings, of course, ~~almost~~ no one changes.

However, the influence of Emacs ESS is still an explanation from the category of "tail wags the dog." There is a more ancient reason: in earlier versions of R, underscores were synonymous with arrows <-. For example, in 2000 you could meet this:

# имело место в ранних версиях R
c <- c(1,2,3,4,5)
mean(c)
[1] 3
c_mean <- mean(c)
c
[1] 3

Here, instead of creating a variable, c_meanR assigned the value 3 first to the variable mean, and then to the variable c. In modern R, such metamorphoses, of course, will not occur.

Due to the unpopularity, _ functions of this style are almost not found among the basic ones:

# в 3.5.1 всего 25 функций
grep("^[^\\.]*$", apropos("_"), value = T)

Finally, the lowerCamel style is poorly readable when using long names:

# ой!
GrossNationalIncomePerCapitaAtlasMethodCurrentUnitedStatesDollars

Thus, in terms of names, guide recommendations cannot be considered unambiguous; after all, this is a matter of taste (as long as there is consistency in this).

4. Curly braces

According to the guide, a new line should follow the opening curly brace, and the closing one should be on a separate line (unless else follows it). Those. like that:

if (x >= 0) {
  log(x)
} else {
  message("Not applicable!")
}

Everything here is not very interesting: this is the standard indentation style of K&R, dating back to the C language and the famous book by Kernigan and Ritchie “The C Programming Language” (or K&R by the names of the authors).

The origins of this style are also quite obvious: it allows you to save lines, while maintaining readability. For early computers, vertical space was too much of a luxury. For example, C was developed on PDP-11, in the terminal of which there were only 24 lines. And when printing a K&R book, this style saved paper!

5. 80 character string

The recommended line length according to the guide is 80 characters. The magic number 80 is found not only in R, but also in a huge number of other languages (Java, Perl, PHP, etc., etc.). And not only languages: even the Windows command line consists of 80 characters.

For the first time in programming, this number appeared in 1928 instead of with the standard IBM punched card, where there were exactly 80 columns for data. A much more interesting question is why such a standard was chosen? After all, punch cards of a different length (for 24 or 45 columns) were previously used.

The most popular answer relates the length of a punch card to the line length of typewriters. The first machines were designed for the American standard paper 8½ x 11 inches, and allowed to print from 72 to 90 characters, depending on the size of the margins. Therefore, the version of 80 characters per line looks quite plausible, although not true in the last resort. It is possible that 80 characters is just the middle ground in terms of ergonomics.

6. Line indent: spaces vs tabs

The style recommended by the guide is two spaces, not a tab. Refusal of tabulation is quite understandable: the length of the TAB varies in different text editors (it can be anything from 2 to 8 spaces). Refusing them, we get two advantages at once: firstly, the code will look exactly the same as we typed it; secondly, there will be no accidental violation of the recommended string length. At the same time, we, of course, increase the file size (who wants to engage in such microoptimizations in 2k19?)

The dispute spaces vs tabs has a long history, and can be equated with religious ones (such as Win vs Linux, Android vs iOS, etc. ) However, we already know who won it: according to researchStack Overflow, developers who use spaces, earn more than those who use tabs. A more powerful argument than the rules of a style guide, right?

Instead of a conclusion: the rules of style guides may seem strange and illogical. Indeed, why the arrow <-, if there is a standard operator =? But if you dig deeper, then behind each rule there is some logic, often already forgotten.

Tags: