Structural elements of a reliable enterprise R application
Those who work with R are well aware that the language was originally developed as a tool for interactive work. Naturally, methods that are convenient for a console step-by-step application by a person who is deep in the subject, are of little use for creating an application for the end user. The ability to get a detailed diagnosis immediately after the error, look through all the variables and traces, manually execute code elements (perhaps, partially changing the variables) - all this will be unavailable when the R application works offline in the enterprise environment. (we say R, we mean, basically, Shiny web applications).
However, not everything is so bad. The R environment (packages and approaches) has evolved so much that a number of very simple tricks allows you to elegantly solve the problem of ensuring the stability and reliability of user applications. A number of them will be described below.
It is a continuation of previous publications .
What is the complexity of the problem?
The main range of tasks for which R is often used is diverse data processing. And even a fully debugged algorithm, lined with tests from all sides and fully documented, can easily break and produce nonsense if it is slipped into the data curves at the input.
Data can come in from other information systems, as well as from users. And, if in the first case it is possible to demand compliance with the API and impose very strict restrictions on the stability of the information flow, then in the second case, there is no getting away from surprises. A person can make a mistake and slip the wrong file, write the wrong thing in it. 99% of users use Excel in their work and prefer to slip it to the system, it’s a lot of pages, with clever formatting. In this case, the task is even more complicated. Even visually, a valid document may look like a complete nonsense from the point of view of the car. Dates are leaving (very famous story "Excel's designer thought 1900 was a leap year, but it was not"). Numeric values are stored as text and set. Invisible cells and hidden formulas ... And much more. In principle, it will not work out all possible rakes - fantasy is not enough. What is worth only the duplication of records in various join-ah with curved sources.
As additional considerations, we take the following:
An excellent document “An introduction to data cleaning with R” , describing the process of preliminary data preparation. For the next steps, we will single out the presence of two validation phases: technical and logical.
- Technical validation consists in checking the correctness of the data source. Structure, types, quantitative indicators.
- Logical validation can be multi-stage, carried out in the course of the calculations, and is to verify the compliance of certain data elements or their combinations with different logical requirements.
- One of the basic rules in the development of user interfaces - the formation of the most complete diagnosis in case of user errors. That is, if the user has already uploaded the file, then it is necessary to check it for correctness as much as possible and give a full summary with all errors (it is also advisable to explain what is wrong), and not to fall at the very first problem with the message “Incorrect input value @ line 528493, pos 17 "and require downloading a new file with this bug fixed. This approach allows to reduce the number of iterations to form the correct source and to improve the quality of the final result.
Validation technologies and methods
Come from the end. For logical validation, there are a number of packages. In our practice, we stopped at the following approaches.
- Already classic
dplyr
. In simple cases, it is convenient to simply draw a pipe with a series of checks and analysis of the final result. - Package
validate
for checking technically correct objects for compliance with specified rules.
For technical validation, we stopped at the following approaches:
- A package
checkmate
with a wide range of fast functions for carrying out various technical checks. - Explicit work with exceptions: "Advanced R. Debugging, condition handling, and defensive programming" , "Advanced R. Beyond Exception Handling: Conditions and Restarts", both for carrying out the full scope of validation in one step, and for ensuring the stability of the application.
- Use
purr
wrappers for exceptions. Very useful when used inside a pipe.
In the code, broken into functions, an important element of “defensive programming” is the verification of the input and output parameters of the functions. In the case of languages with dynamic typing, type checking has to be done independently. The checkmate package is ideal for basic types, especially its functions qtest
\ qassert
. For checking, we data.frame
stopped at approximately the following construction (checking names and types). The trick with the merging of name and type reduces the number of lines in the check.
ff <- function(dataframe1, dataframe2){
# достали имя текущей функции для задач логирования
calledFun <- deparse(as.list(sys.call())[[1]])
tic("Calculating XYZ")
# проверяем содержимое всех входных дата фреймов (class, а не typeof, чтобы Date отловить)
list(dataframe1=c("name :: character", "val :: numeric", "ship_date :: Date"),
dataframe2=c("out :: character", "label :: character")) %>%
purrr::iwalk(~{
flog.info(glue::glue("Function {calledFun}: checking '{.y}' parameter with expected structure '{collapse(.x, sep=', ')}'"))
rlang::eval_bare(rlang::sym(.y)) %>%
assertDataFrame(min.rows=1, min.cols=length(.x)) %>%
{assertSetEqual(.x, stri_join(names(.), map_chr(., class), sep=" :: "), .var.name=.y)}
# {assertSubset(.x, stri_join(names(.), map_chr(., typeof), sep=" :: "))}
})
…
}
In terms of the type checking function, you can choose a method to your taste, in accordance with the expected data. class
He was chosen because it is he who gives the date as Date
, and not as a number (internal representation). In detail, the question of determining data types is dealt with in the dialogue “mode 'and' class' and 'typeof' are insufficient . ”
assertSetEqual
or assertSubset
are chosen from considerations of a clear coincidence of columns or minimally sufficient.
For practical purposes, such a small set completely covers most of the needs.
Previous publication - R as a lifeline for the system administrator .