Integrate SAS and Greenplum
Introduction
This article may be of interest to those who use the SAS ETL to build a data warehouse. Recently, we completed the active phase of the project to transfer storage to the Greenplum database. Prior to this, SAS datasets were used as a database, i.e. in fact, tables were files on the file system. At some point, it became clear that the growth rate of data volumes is greater than the speed with which we can increase the performance of the file system, and a decision was made to switch to a specialized database.
When we started the project, it was impossible to find anything on the Internet regarding the connection between SAS DIS and Greenplum. I would like to highlight the main points of the transition and the difficulties that arose in the process in this article.
An additional complication to the project was the fact that it was necessary not to build processes from scratch, but to redo existing ones, otherwise the terms and cost of the project would be unacceptable. Historically, we use SAS as an ETL tool, and in particular SAS Data Integration Studio. Each ETL process here is a so-called job, in which there are input tables, the logic of their processing, the output table (s). The daily repository loading process consists of ~ 800 such jobs. We had to redo them so that he won the transfer of input / output tables to Greenplum.
SAS / Access for Greenplum
How does code written in SAS Base work with tables in a database? For various databases there is a series of SAS products, usually called SAS Access for <database name>. Such a connector for Greenplum is an ODBC driver with a special wrapper that allows you to use it from SAS. A connector provides two ways to access data:
- The database schema is defined through SAS libname and tables can be operated as with ordinary SAS datasets. In this case, the SAS code is implicitly translated into the instructions for the database, but if such translation is impossible (for example, if some SAS function is used, which is absent in Greenplum), the request processing on the SAS side will be applied.
- SQL pass-through. In proc sql, you can write code that will be passed to the ODBC driver in fact as is.
In SAS Data Integration Studio, the code using the pass-through mechanism can only generate “SQL Join” transforms, the rest generate the usual SAS code, which must be checked (by log) for what it ultimately passes to the connector. There were examples of incorrectly generated code.
Work -> Greenplum Work
By default, the principle of jobs in SAS is such that Work is used as a place for temporary tables - a directory created at the start of the session on the disk, accessible only to the current owner of the process and deleted immediately after it ends. The work contains SAS datasets, to which the SAS Base language is applicable, which allows developing ETL very quickly. Also, such isolation allows you to easily "clean" the place for fallen jobs, to control the amount of disk space used. When transferring jobs to Greenplum, part (or all) of the intermediate tables moved from WORK to Greenplum, and the question arose of where to put these tables in the database?
SAS does not allow working with temporary tables in Greenplum, so one of the options was to use a separate scheme for them. But this approach had several significant disadvantages:
- Automatic code generation in SAS DI Studio does not delete tables that are no longer needed. It would take some kind of procedure, built into every job, because to constantly keep the full amount of ETL helper tables in the database is too wasteful.
- Name conflicts in parallel jobs are possible.
- In case of any problems, it is difficult to identify to whom (to which job) the table belongs.
As a result, another approach was chosen - to reproduce the behavior of SAS in the database. Each SAS session creates in Greenplum a schema into which all the intermediate tables of this session are added. Some details:
- One such “work” scheme corresponds to one SAS session. At the start of the SAS session, a scheme is created in Greenplum, and a SAS library is assigned to it.
- The necessary information is encoded in the name of the created scheme, such as the path to SAS work, host, user, process: % let LOAD_ETL_WRK_SCH_NAME = work_% substr (% sysfunc (pathname (work)),% sysfunc (prxmatch (/ (? <= SAS_work) ./,%sysfunc(pathname(work)))),12)_&SYSUSERID._srv%substr(&SYSHOSTNAME.,% eval (% length (& SYSHOSTNAME.) - 1)) _ & SYSJOBID .;
- On the crowns we put a “cleaner” of Greenplum time schemes. He checks for the presence of the corresponding SAS session for each "work_" scheme of Greenplum and, if it is absent, removes the "work_" scheme.
DI Studio creates an ODBC library with the & LOAD_ETL_WRK_SCH_NAM schema, to which intermediate tables are assigned if necessary.
Data transfer between SAS and Greenplum
With this scheme of operation, one of the main issues is the data transfer rate between SAS and Greenplum. Data transfer from Greenplum to SAS always goes through the master node and is limited by the speed of the driver. The speed depends on the width and composition of the fields of paged tables, on average we get about 50MB / s.
Downloading data from SAS to Greenplum is much more interesting. Greenplum allows bulk loading from text files. The essence of this mechanism is that an external file is defined as an external table (access to it is provided for Greenplum by a special utility installed on ETL hosts) and is loaded directly onto data hosts, bypassing master. Due to this, the download speed is greatly increased. From the SAS side, the process looks like this: the table is uploaded to a csv file, and then Greenplum forces this file into the database. However, it turned out that the speed of this process is very much dependent on the speed of unloading the table from the SAS to the csv file. The file is uploaded at a speed of up to 20-30MB / s (it rests on the processor), the csv download speed in Greenplum exceeds 150 MB / s. For large tables, this resulted in a completely unsatisfactory download speed. The acceleration was obtained by dividing the loaded table into parts: several parallel threads are launched, each of which works with its own piece of the table - unload it in csv and make insert in Greenplum. This allowed increasing the data loading speed in Greenplum up to ~ 90-100 MB / s.
ETL primitives
To work in DI Studio we had to rewrite some transforms, as the standard ones generated code that either did not work optimally or worked with errors. These are Table Loader and SCD Type2 Table Loader. In some places, due to the transfer of tables to Greenplum, I had to edit jobs: the standard transformation “Lookup”, for example, works inefficiently if both input tables are in the database.
Instead of a conclusion
The article describes the main tasks that had to be addressed during the migration process. Much remains beyond the scope of the article. The problem with hash join, table locking by user queries, partitioning and compression. If there is interest, we will describe this in more detail in the following posts.