vladob February 12, 2013 at 01:01

My experience of introducing R or “I Love R”

From the sandbox

I am a scientist [ more about this here ]. "The proletariat of mental labor." By education a physicist. I have been working in the field of medical and biological information processing for 30+ years.
I have been working in R for exactly 10 years, having migrated to it after 15 years of close cooperation with Matlab. The root cause of the migration to another work platform was my own physical migration to the opposite side of the Earth in Auckland, New Zealand. Here, life from the first days pushed me into the arms of R, which I have not yet had to regret.

Increasingly, I observe outbreaks of interest in R in a professional Russian. Well, here on this reputable resource there are articles about him. Further, under the cut, my first attempt at a Russian-language introduction to R is the first (verbal) part of the presentation that I made for colleagues at Animal Science, Iova State University three years ago.
( aside: but how, it turns out, is it difficult to translate yourself ...)

"alt =" image "/>
In this post

What is R
Where did he come from
Why i love him
Why and how do I use it (examples)
Myths and Truth

What is R

The first R is a system for statistical and other scientific calculations, using the programming language the S .

S is a language written by statisticians for statisticians. by definition of author John Chambers. The language since its inception has been very well received and tested by generations of very picky statisticians. We can assume that it is widely known and accepted in the global statistical community. In S language , a number of critical epidemiological, environmental and financial models have been implemented and are still being used around the world and in many industries. As a language from the point of view of me, as a "writing user", Sis a very nice alternative to the SAS language .

From my own experience - Acquaintance and the first S lessons I myself received in the early 90s from WHO expert statisticians, whom I crossed over with scientific studies of that time.

By many estimates, R (as for me - and not very exaggerated) is one of the most successful open source projects, distributed freely from dozens of mirrors around the world under the GNU license standards.
The authors categorically reject all proposals for commercializing the project, although today there is reason to believe that the number of installed copies of R in the world exceeds the total number of copies of all other statistical analysis systems.

From the very beginning and to this day, the project has deeply respected (on the verge of admiration) stability, user support, code compatibility, etc., that I would combine culture in the concept .
However, the last sentence is more likely for subsequent subsections.

Where did S come from and how does this relate to R

Undoubtedly, Wikipedia will give you many more letters.
I just note what I consider important for understanding the place of S and R in ~~this life~~ in this world.

Bell laboratories (aka Bell Labs, AT&T Bell Labaratories) are quite famous in the history of science and technology, and IT in particular. Statistical studies there were always carried out very seriously and were also seriously supported by all available computer tools (read - tons of Fortran and Lisp code).

What later became the S language arose in the 70s on the initiative and under the leadership of John Chambers, as a set of scripts that facilitate the “feeding” of data to Fortran code. Those. the focus was on the task of interactive data manipulation, compactness, pleasant writing and readability of the code, and obtaining decent output on various devices of tables and graphs.

The syntax of the language provides for the construction of practically arbitrarily complex data structures, and the means for describing specific statistical problems and objects is stat. tests, models, etc.

Since 1984, the language acquired a name, its own “Bible” (a book by Chambers and Beckers: S: An Interactive Environment for Data Analysis and Graphics was published), by default it began to contain an almost complete “gentleman's set” of statistics and “probability” - distributions, random number generators, statistical tests, many standard statistical analyzes, work with matrices, etc., not to mention the developed system of scientific graphics. Most importantly, it has become available to users all over the world for a very reasonable price.

In 1988 (another book by The New S Langugage was published ) - modified using OOP, everything became objects with very reasonable default values, accessibility for modification, self-commenting elements, etc., etc.

At the same time, source codes were published by laboratories and " bell lab " Sbecame free for students and for scientific use. All this was somehow connected with the “dispossession” of AT&T, but I was not very interested in these details.

There were, and probably still are, commercial implementations of the S language . I came across the S-Plus and S2000 . They were supported at different times by different companies, mainly lived (live?) Due to the support of applications previously created on S. In these post-bell versions of S , a new version of the OOP engine has appeared, but for a pure user, it went almost bloodless in terms of compatibility of the historical code.

R - the only non-profit fully independent (from the original Bellowsky) language implementationThe S .

And according to a rare agreement in our time, in some way unimaginable to me, the developers of the current versions of commercial S and non-commercial R support their almost complete compatibility and continuity.

And now R

Behind any significant phenomenon in this life is some kind of charismatic personality. However, this can happen and there is a definition of the significance of the phenomenon.

In the case of R there are three such people.
About John Chambers, I already said.

Ross Ihaka ( Ross Ihaka ) - a student, and then a researcher at the Department of Statistics of the University of Auckland University as the topic of his dissertation (which was carried out at MIT, USA) chose the study of the possibility of building a virtual machine (VM) for statistical programming languages. As intermediate language has been selected Lisp ( Common Lisp, CL ), and it is implemented prototype VM, «understands" small subset of SAS and S .
Ross returned to Auckland to finalize his dissertation, where he soon met Robert Gentleman and became interested in the R. Project.
Ross did not defend his dissertation, but already has a degree from several universities “on the basis of merit”. Last year, he was awarded the title and received the position of Associate Professor (Associate Professor) at his home university.

Robert Gentleman , another statistician with a passion for programming, a native of Canada, while on an internship at Auckland University (he then worked in Australia), suggested that Ross “write some kind of tongue.”
According to a legend that I myself heard from these “founding fathers” in almost a month, they rewrote to CL in a fit of insane enthusiasmalmost all S teams , including a powerful linear modeling library.

Computing engine R , following the traditions of the prototype, the well-known, universally recognized and free BLAS library was selected (with the possibility of using ATLAS, etc. with the same interface).
Murrel Paul, one of the closest friends of Russia and also an employee of University Oklandskogo rasstaralsya wrote (I think C) from scratch graphics engine, fully reproducing the functionality itself in the S .

The result was a free, fully functional bag that instantly took a place in the educational process at the University of Oklan, fully complying with descriptions in very detailed and high-quality Chambers books, which were traditionally published in paperbacks and medium print quality, but were cheap and affordable.
Several GNU-based activist groups (such as the GIS) movement have adopted R as a platform for scientific computing.

But R became truly widely known in bioinformatics, when one of the “fathers” Robert Gentleman, who was involved in Affimmetrix at that time , duplicated all the functionality of the company's commercial software and launched (well, not one, of course) open source project Bioconductor. Currently, Bioconductor is the undisputed leader in bioinformatic open source for all "-mixes" (genomics, proteomics, metabolomics etc.).

Naturally, R became the single interface language for this riot of bioinformatic fantasies .

The circle is closed when a retired Chambers, creator of the language the S , joined as a full member of the group of active developers of the R .

Why I love him (list)

Interactivity, “Programming with data” - my favorite style of work
Elegant (for an amateur) language - I love lists, data frames, functional programming and lambda functions (a-la) y. Freedom of expression: you can solve the same problem in ten ways (softens the feeling of routine)
“Looks soberly at this world” - it rarely “crashes” or “hangs” a thread, logical operations with missing data, error handling at runtime (try-error), easy exchange with the system at the level of standard I / O, etc.
Complete set of ready-to-use statistical procedures
It is well documented and well maintained - compatibility, continuity, etc.
Gathered around him a humanly pleasant professional community (forums, user conferences, etc.)
A well-documented interface for external libraries and functions on anything - Fortran, C, Java. From here comes a sea of well-documented libraries on all aspects of statistics and data processing in almost all areas of science, but with the main emphasis on bioinformatics / biostatistics; everything is regularly and correctly updated, if there is author’s will
The lack of a mandatory GUI in the "basic configuration" - Well, I'm not a "mouse" person!

Off the list: I am just pleased that my main working tool has ... a soul.
What I, in fact, try to show in my article.

Why and how do I use it (examples)

I started writing in this section, but stopped.
Otherwise, I would never have finished.
Oh, probably later.

Myths and Truth

R slow

R - “thin”, uses blas / lapack / atlas libraries for calculations, try to write something faster than these good old Fortran (often) “workhorses”. All critical functions typically use vector operations and implemented in C .

R irrationally uses computing resources, in particular - memory

Yes, developers acknowledge such a sin. But the specialist’s working time is now more expensive than “iron”. Unload the toys from the modern desktop computer and you will have no problems with the majority of real data sets.

Free software cannot be reliable

May: Fortran, Linux, C, Lisp, Java etc.

Instead of the Epilogue

As stated above, the post below is actually a translation of my presentation for a sufficiently specific target audience, and I will briefly describe this audience.

Many “clean” IT companies will have to meet with such people, since food production to raise capital and generate profit has long been competing with oil and other energy carriers. And the capacity of the bioinformatic market in medicine and pharmacology is limited, whatever one may say.

So, my audience is people with a basic education in genetics and selection, veterinary medicine, less often - biology (mainly molecular). Uncles and aunts (the last ones are more), years from 20-30- ... programming (!) On FORTRANe or VB, famously managed with excel-tables of 100k rows / columns and periodically “dropping” their computing Linux + 500 + kernel cluster of 12TB of shared memory with their tasks (and their programming) and from time to time requiring disk expansion with another ten terabytes.

The methodological base is an explosive mixture of the ancients as the world of analysis of variance with mixed models solved by no other means than the maximum likelihood method, Bayesian networks that melt the brain, etc.

Data - data tables from units to tens of thousands of rows, sometimes including 1-5 columns with phenotypes, but increasingly, tens or hundreds of “Ka” columns of variables that are weakly correlated with each other and with phenotypes.

Well, yes, they also have a “good tradition” to consider everything in the aspect of family ties (genetics, after all). Family ties are traditionally presented in the form of a matrix of “family ties” (pedigree) with dimensions, for example, 40,000 x 40,000 (this is if 40,000 animals). Well, or (so far, fortunately, only in the project) 20,000,000 x 20,000,000 is to "cover" with a single model all 20 million historical animals that are in the database ( DB2 , if you are interested, and even Сobol yet " drank ”not everywhere ...)

On the tables littered with literature on (at the same time) Fortran, Java, C #, Scala, Octavia, Linux for Dummies, you can recognize recent bioinformatics graduates. But somehow quickly many of them leave science for “coders.”

However, I know the case of the reverse movement. So thatR to many more useful.

Tags: