Data Science «спецназ» собственными силами
Практика показывает, что многие enterprise компании сталкиваются с трудностью в реализации аналитических проектов.
The thing is that, unlike classical projects for the supply of iron or the implementation of vendor solutions that fit into the linear model of execution, the tasks associated with advanced analytics (data science) are very difficult to formalize as a clear and unambiguous TK in the form of sufficient for transmission to the performer. The situation is aggravated by the fact that the implementation of the task requires the integration of a mass of various internal IT systems and data sources, some questions and answers can appear only after working with the data begins and the real state of affairs is revealed, which is very different from the documentary world view. This all means that in order to write competent TK, it is necessary to carry out a preliminary part of the work comparable to half the project, dedicated to the study and formalization of real needs, analysis of data sources, their relationships, structures and gaps. Within organizations, employees who are capable of cranking such a large-scale work are practically non-existent. It turns out that the contests lay out quite raw requirements. At best, contests are canceled (sent for revision) after a cycle of clarifying questions. In the worst case, for a golome budget and a long timeframe, it turns out something completely different from the plans of the authors of the claims. And they remain at the broken trough. completely different from the plans of the authors of the requirements. And they remain at the broken trough. completely different from the plans of the authors of the requirements. And they remain at the broken trough.
A sensible alternative is to create data science (DS) teams within the company. If you do not threaten the construction of the Egyptian pyramids, the team and 2-3 competent professionals can do very, very much. But then another question arises, how to prepare these specialists. Below I want to share a set of successfully tested considerations for the rapid preparation of such a "special forces" with R as a weapon.
It is a continuation of previous publications .
At the moment, searching the market for competent competent professionals is a big problem. Therefore, it is very useful to consider the training strategy of just literate and adequate. At the same time, the specificity of the required training is observed:
- there is no possibility to study for months; the result must be obtained as quickly as possible;
- it is necessary to put emphasis on the real tasks of the company;
- in industrial DS, there are many more data processing tasks than AI \ ML;
- Industrial DS is not an art-house, but a structured activity, embodied in the form of a stable working application code.
With all the wonderful Coursera, Datacamp, various books, as well as programs on ML, none of the sets of courses did not allow to obtain the required set of characteristics. They serve as excellent sources for mastery, but are quick to start. The main task at a quick start is to indicate paths, swamps, traps; familiarize yourself with the range of existing tools; show how the company's tasks can be solved by using the tool; throw into the lake from a boat and make him swim.
It is important to show that R is not only a tool, but also an appropriate community. Therefore, the use of a large number of relevant developments, incl. presentation, is one of the formats of work with the community. Hadley can even write questions to the twitter or github. On worthy questions you can get comprehensive answers.
As a result of various experiments, the structural approach of “Deep Dive Into R” to the supply of base material was obtained.
Immersion in R
- The optimal course duration is 50 hours (~ 7-9 days, 7-6 hours).
- The key goal of the course is the development of practical skills for quickly writing high-quality and efficient code using optimal algorithms.
- Comprehensive demo examples are best created on specific tasks - so you can familiarize yourself with the tools and approaches much faster.
- A large number of issues under consideration serves to form a presentation and “bookmarks” about the possibilities of the ecosystem.
- Daily breakdown is not a dogma, but planned focus management.
- Within each day, practical tasks of varying degrees of complexity and volume are analyzed to demonstrate and consolidate the material.
Each student gets a practical task from his leadership (“coursework”) in the form of a real task, which he will have to perform during the dive and protect upon completion of the course.
Briefly about R. Syntax and structure of the language. Basics of using IDE RStudio for analysis and development. Base types and data. Interactive calculations and execution of program code. Brief acquaintance with R Markdown and R notebook Principles of working with libraries. Preparing for analytical work, installing the necessary libraries, creating a project. Principles of profiling calculations, the search for narrow (extremely long) places and their elimination.
- History and Ecology of R
- RStudio Cheatsheets
- Learning / performance quality criteria: quickly write fast and compact code using optimal algorithms
- Evaluation code execution speed:
- System performance evaluation:
- Ecosystem R, language basics
- Subsetting & slicing
The concept and ecosystem of the 'tidyverse' packages ( https://www.tidyverse.org/ ). A brief overview of the packages included in it (import / processing / visualization / export). Concept
tidy dataas the basis of work methods in
tidyverse. 'tibble' as a presentation format. (Packages from tidyverse ecosystem). Transformations and data manipulations. Syntax and principles of stream processing (pipe).
Группировка - вычисление - сборка. (Bags
- Tidyverse site
- Tibbles . There are three key differences between tibbles and data frames: printing, subsetting, and recycling rules.
- STAT 545. Cheatsheet for dplyr join functions by Jenny Bryan
Formation of graphical representations by means of ggplot ( https://ggplot2.tidyverse.org/reference/index.html ). Using graphical tools for analyzing business data.
A Gentle Guide to the Grammar of Graphics with ggplot2
Add. links to sample widgets and graphs
Work with string and enum types. Basics of regular expressions. Work with dates. (Bags
Work with strings:
Regular expressions. Online tools
Date and time:
- Entities: period, duration, interval.
- Plate with the main functions of Dealing with Dates
- A brief summary of Analytics for industRy: Dates and times
Advanced data import. txt, csv, json, odbc, web scrapping (REST API), xlsx.
- Working directory
setwd(). Package application
rspivot. rspivot is a Shiny gadget for RStudio for viewing data frames.
- Package and add-in
- Introduction to readr . Run through the specifics of the column specification.
- work with Excel:
- Web-scrapping on a demo example:
Export data. rds, csv, json, xlsx, docx, pptx, odbc. Basics of R Markdown and R Notebook.
- Creating presentations means the R
- Export to PDF via knit -> LaTeX
- Direct export to Word
Basics of programming in R. Creating functions. Scope of variables. View objects. View objects, their structure. Principles of work "by reference."
- The concept of function. Creating your own functions.
- R functions, R for Data Science. 19 Functions
- The concept of the environment. Advanced R. 7 Environments
- for loop while loop
- purrr tutorial
- Profvis - Profiling tools for faster R code
- Lazy_evaluation approach, non-standard evaluation package
Approaches to the validation of intermediate and final results. Principles of collaboration and the formation of reproducible calculations. Demonstration of shiny applications as a target interface for end users. (Bags
- Defensive programming. Validation of parameters. Package
checkmate, speed is our everything .
- Validation of input data:
- Logging means
Methods and approaches in working with data of "medium" size. Package
data.table. Main functions. Comparative experimental analysis.
Review of additional questions that appeared in 1-8 days.
Requirements for the workplace participants based on Windows 10
- installed R 3.5.2 ( https://www.r-project.org/ )
- installed C ++ compiler included in R ext. Rtools35 toolkit ( https://cran.r-project.org/bin/windows/Rtools/ )
- installed IDE RStudio Desktop Open Source License ( https://www.rstudio.com/products/rstudio/download/ )
- installed the latest build of Java SE RE 1.8
- Access to sites containing R package repositories is open:
- R for Data Science by Garrett Grolemund, Hadley Wickham
- R for Data Science: Exercise Solutions by Jeffrey B. Arnold
- Hands-On Programming with R by Garrett Grolemund
- Advanced R by Hadley Wickham
- Handling Strings with R by Gaston Sanchez
- RStudio Cheatsheets
- The proposed sequence of submission of the material is not a dogma. There may be various digressions and the inclusion of additional. materials, including mathematical inserts. Everything is determined by the actual topical issues and tasks that will be determined for coursework and a list of popular production issues. The most popular are the algorithms of regression, clustering, text mining, work with time series.
- Issues of parallel computing, creating shiny applications, using ML algorithms and external platforms do not fit into the concept of “fast immersion”, but can be a continuation after the start of practical work.
PS Usually, HR has difficulty in formulating job requirements.
Here is a possible example for the seed. Each complements and governs based on their expectations.
Data Science (DS): Big Data and Analytics. Job Requirements
- Technical or natural science higher.
- The availability of certificates for subject courses (Coursera, DataCamp) is welcomed.
- English is an asset (free reading technical literature, confident understanding of non-adapted oral speech, spoken language at the level of technical communication).
- In the field of DS - at least 1 year.
- The team development experience in agile methodologies is at least 2 years.
- Experience in the development of user interfaces (WEB + JS).
- Experience in the development of documentation.
- Confident possession of the following technologies (minimum, 30% from the list):
- SQL + No-SQL backend (at least one of the databases of each type).
- Open-source programming languages for DS tasks (Python, R, or both).
- Platforms for storing and processing big data (Hadoop and its derivatives, Spark \ Ignite, ClickHouse, Vertica, ELK stack ...)
- Basics of HTML + JS + CSS in the context of the development of web-GUI for end users.
- Confident knowledge of the basics of mathematical statistics and linear algebra.
- Time series (including forecasting and search for anomalies).
- Machine learning, neural networks.
- Text mining and regular expressions.
- Basics of administering windows + nix systems.
- Tools and algorithms for processing and visualizing geoinformation (ESRI, OpenStreet, Yandex Maps, Google Maps, leaflet, ..), working with shap and GPX files.
- Data import into data science tools and their normalization (files, ODBC, REST API, Web crawling).
- Visualization (Tableau, QlikView, Excel, Shiny \ Dash).
- Commercial math packages (Wolfram Mathematica, Maple, Matlab).
- Examination and preparation of initial data.
- Development of hypotheses and their verification on the source data.
- Development of mathematical models and their approbation.
- Development of software design solutions.
- Development of WEB-applications and interactive dashboards.
- Development of hard-copy reports.
- Setup, testing, development and maintenance of the analytical circuit.
- Updating of project documentation.
Previous publication - “How R is fast for productive?” .