How to automate with Jenkins the assembly and rolling-out of artifacts of the metadata model for tables in the repository

It all started with the fact that we were faced with the need to quickly and correctly form the structures of EDWEX, JSON, DDL and then roll them onto different contours of relational databases. By contours, I mean familiar abbreviations - DEV, TST, UAT, PRD.

At that time, we did almost everything manually: and generated DDL, and collected edwexes and jsones based on metadata from the Oracle database. Input parameters set. Miss one - and incorrectly form the essence. And since the whole process of formation was consistent and continuous, the error will be revealed only at the very end. Read about how we have automated everything and overcame errors under the cat.

Little about infrastructure

Before filling data into relational database tables, we need to accept them from the source - in any format, for example, in Excel. Data from the source using the in-house mechanism is transported to Hadoop (Data Lake). The Hiveop add-in is installed on Hive - with its help we can view the contents of files using an SQL-like syntax.

To transfer the data to Data Lake in the form of tables, we need json'y, which are formed on the basis of metadata. To parse the data in Excel format, we need to create EDWEX files. EDWEX (EDW_EXTRACTOR) files are a set of artifacts that contain sets of tables with the name, as well as sets of fields for each of these tables, formed on the basis of metadata. Depending on the model version and source ID, the field set is different. The formation of the DDL is necessary to create the tables themselves in the Hive database at the operational data level and in the Greenplum database at the detailed and aggregated data level. That is, the data is initially transferred to Hive, if necessary, filtered and transferred to the Greenplum for subsequent manipulations with the data and creating windows based on them.

Example edwex artifacts

pack - contains a set of
data tables - contains a set of fields

pack.edwex:

1	Table_1	User Table_1 bid between to_char($fromdt,'yyyymm') and to_char($actualdt,'yyyymm')
2	Table_2	User Table_2 curbid between to_char($fromdt,'yyyymm') and to_char($actualdt,'yyyymm')
3	Table_3	User Table_3 bid between to_char($fromdt,'yyyymm') and to_char($actualdt,'yyyymm')

data.edwex:

1	1	CHARGE_ID	NUMBER	38	0
1	2	SVC_ID	NUMBER	38	0
1	3	VND_ID	NUMBER	38	0
1	4	PRICE	NUMBER	38	5
1	5	QUANTITY	NUMBER	38	5
1	6	BASE	NUMBER	38	5
1	7	TAX	        NUMBER	38	5
1	8	TOTAL	NUMBER	38	5
1	9	TAX_RATE	NUMBER	38	5
1	10	TAX_IN	VARCHAR	1
1	11	CHARGE_KIND	VARCHAR	3
1	12	PRIVILEGE_ID	NUMBER	38	0
1	13	CHARGE_REF_ID	NUMBER	38	0
1	14	EBID	NUMBER	38	0
1	15	INVOICE_ID	NUMBER	38	0
1	16	ZERO_STATE_ID	NUMBER	38	0
1	17	USER_ID	NUMBER	38	0
1	18	BID	NUMBER	38	0
1	19	QUANTITY_REAL	NUMBER	38	5
2	1	CURBID	NUMBER	38	0
2	2	USER_ID	NUMBER	38	0
2	3	VND_ID	NUMBER	38	0
2	4	APPBID	NUMBER	38	0
2	5	SVC_ID	NUMBER	38	0
2	6	DEBT	NUMBER	38	5
2	7	INSTDEBT	NUMBER	38	5
3	1	INVOICE_ID	NUMBER	38	0
3	2	INVOICE_DATE	DATE
3	3	INVOICE_NUM	VARCHAR	64
3	4	INVOICE_NUM_N	NUMBER	38	5
3	5	BASE	NUMBER	38	5
3	6	TAX	NUMBER	38	5
3	7	TOTAL	NUMBER	38	5
3	8	PREPAID	VARCHAR	1
3	9	EXPLICIT	VARCHAR	1
3	10	VND_ID	NUMBER	38	0
3	11	ADV_PAYMENT_ID	NUMBER	38	0
3	12	MDBID	NUMBER	38	0
3	13	BID	NUMBER	38	0
3	14	USER_ID	NUMBER	38	0
3	15	ZERO_STATE_ID	NUMBER	38	0
3	16	ACTIVE_SUM	NUMBER	38	5
3	17	SPLIT_VND	NUMBER	38	5
3	18	PRECREATED	VARCHAR	1

Json artifact example

Table.json:
{
"metadata": [
{
"colOrder":"1",
"name":"charge_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"true",
"keyFile":""
},
{
"colOrder":"2",
"name":"svc_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"3",
"name":"vnd_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"4",
"name":"price",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"5",
"name":"quantity",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"6",
"name":"base",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"7",
"name":"tax",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"8",
"name":"total",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"9",
"name":"tax_rate",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"10",
"name":"tax_in",
"dataType":"STRING",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"11",
"name":"charge_kind",
"dataType":"STRING",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"12",
"name":"privilege_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"13",
"name":"charge_ref_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"14",
"name":"ebid",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"15",
"name":"invoice_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"16",
"name":"zero_state_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"17",
"name":"user_id",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"18",
"name":"bid",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
},
{
"colOrder":"19",
"name":"quantity_real",
"dataType":"DECIMAL",
"precision":"0",
"requied":"false",
"keyFile":""
}
  ],
"ctlPath":"путь до ctl",
"errPath":"путь до файла логирования ошибок",
"inPath":"путь для формирования в hdfs",
"outSchema":"схема БД",
"outTable":"наименование таблицы",
"isPartitioned":"параметр партиционирования",
"sourceId":"id системы источника"
}

DDL artifact example

alter table scheme.GP_000001_TABLE rename to Z_GP_000001_TABLE_20180807;
create table scheme.GP_000001_TABLE
(

`charge_id`   DECIMAL(38,0),
`svc_id`   DECIMAL(38,0),
`vnd_id`   DECIMAL(38,0),
`price`   DECIMAL(38,5),
`quantity`   DECIMAL(38,5),
`base`   DECIMAL(38,5),
`tax`   DECIMAL(38,5),
`total`   DECIMAL(38,5),
`tax_rate`   DECIMAL(38,5),
`tax_in`   STRING,
`charge_kind`   STRING,
`privilege_id`   DECIMAL(38,0),
`charge_ref_id`   DECIMAL(38,0),
`ebid`   DECIMAL(38,0),
`invoice_id`   DECIMAL(38,0),
`zero_state_id`   DECIMAL(38,0),
`user_id`   DECIMAL(38,0),
`bid`   DECIMAL(38,0),
`quantity_real`   DECIMAL(38,5),
`load_dttm`   TIMESTAMP,
`src_id`   SMALLINT,
`package_id`   BIGINT,
`wf_run_id`   BIGINT,
`md5`   STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'путь для формирования в hdfs'
TBLPROPERTIES ( 'auto.purge'='true', 'transient_lastDdlTime'='1489060201');
insert into scheme.GP_000001_TABLE(    `charge_id`,
   `svc_id`,
   `vnd_id`,
   `price`,
   `quantity`,
   `base`,
   `tax`,
   `total`,
   `tax_rate`,
   `tax_in`,
   `charge_kind`,
   `privilege_id`,
   `charge_ref_id`,
   `ebid`,
   `invoice_id`,
   `zero_state_id`,
   `user_id`,
   `bid`,
   `quantity_real`,
  `load_dttm`,
  `src_id`,
  `package_id`,
  `wf_run_id`,
  `md5`)

select     `charge_id`,
   `svc_id`,
   `vnd_id`,
   `price`,
   `quantity`,
   `base`,
   `tax`,
   `total`,
   `tax_rate`,
   `tax_in`,
   `charge_kind`,
   `privilege_id`,
   `charge_ref_id`,
   `ebid`,
   `invoice_id`,
   `zero_state_id`,
   `user_id`,
   `bid`,
   `quantity_real`,
load_dttm,
src_id,
package_id,
wf_run_id,
md5 from scheme.Z_GP_000001_TABLE_20180807;

alter table scheme.GP_000001_TABLE rename to Z_GP_000001_TABLE_20180807;
create table scheme.GP_000001_TABLE
(

`charge_id`   DECIMAL(38,0),
`svc_id`   DECIMAL(38,0),
`vnd_id`   DECIMAL(38,0),
`price`   DECIMAL(38,5),
`quantity`   DECIMAL(38,5),
`base`   DECIMAL(38,5),
`tax`   DECIMAL(38,5),
`total`   DECIMAL(38,5),
`tax_rate`   DECIMAL(38,5),
`tax_in`   STRING,
`charge_kind`   STRING,
`privilege_id`   DECIMAL(38,0),
`charge_ref_id`   DECIMAL(38,0),
`ebid`   DECIMAL(38,0),
`invoice_id`   DECIMAL(38,0),
`zero_state_id`   DECIMAL(38,0),
`user_id`   DECIMAL(38,0),
`bid`   DECIMAL(38,0),
`quantity_real`   DECIMAL(38,5),
`load_dttm`   TIMESTAMP,
`src_id`   SMALLINT,
`package_id`   BIGINT,
`wf_run_id`   BIGINT,
`md5`   STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'путь для формирования в hdfs'
TBLPROPERTIES ( 'auto.purge'='true', 'transient_lastDdlTime'='1489060201');
insert into scheme.GP_000001_CPA_CHARGE(    `charge_id`,
   `svc_id`,
   `vnd_id`,
   `price`,
   `quantity`,
   `base`,
   `tax`,
   `total`,
   `tax_rate`,
   `tax_in`,
   `charge_kind`,
   `privilege_id`,
   `charge_ref_id`,
   `ebid`,
   `invoice_id`,
   `zero_state_id`,
   `user_id`,
   `bid`,
   `quantity_real`,
  `load_dttm`,
  `src_id`,
  `package_id`,
  `wf_run_id`,
  `md5`)

select     `charge_id`,
   `svc_id`,
   `vnd_id`,
   `price`,
   `quantity`,
   `base`,
   `tax`,
   `total`,
   `tax_rate`,
   `tax_in`,
   `charge_kind`,
   `privilege_id`,
   `charge_ref_id`,
   `ebid`,
   `invoice_id`,
   `zero_state_id`,
   `user_id`,
   `bid`,
   `quantity_real`,
load_dttm,
src_id,
package_id,
wf_run_id,
md5 from scheme.Z_GP_000001_TABLE_20180807;

How to automate

To solve the problem, we used:

Jenkins - as an orchestrator and a tool for implementing the CI process.
Python - it implements the functionality and unit tests.
SQL - to access the metadata database.
Shell script - for copying artifacts between directories and generating scripts in multijob projects running on a Jenkins server.

To begin with, we initially work with a large number of sources, so using the Shell script as a launch parameter, we transfer source IDs to SQL functions to identify them. The resulting SQL functions will subsequently be automatically executed in the metadata database. Based on the available metadata, these functions form a file with a list of new SQL functions for each of the source tables, so that we can subsequently call them in an executable program. The result of the execution of the generated functions is the transfer to the output parameter of the DDL, JSON or EDWEX values.

It connects Python, which is written all the executable functionality and unit-tests. Before running any of the modules for rolling out artifacts with the help of unit tests, the correct transfer of parameters for running python scripts is checked. Tests not only check the correctness of the input parameters, but also their presence in the metadata module, the size of the created files and the date they were created. Tests also monitor the number of new artifacts created and the number of existing artifacts. So we optimize the use of server resources, because we take for rolling out only new files and do not install existing models again.

And only after successful completion of all checks, a python program is executed that creates the necessary artifacts and lays out the result in the required project folders on the server. Python not only breaks down the generated json files into directories, but also structures them into Data Lake so that the data loads correctly. When generating DDL artifacts, they are not only saved for later analysis and reuse, but can also be immediately installed in the database using new models and structures specified in the metadata module. This allows you to create hundreds of tables in a short time without involving manual labor.

And where is Jenkins?

Jenkins enters when it is necessary to control all these processes visually using the interface.

This tool was chosen because it:

fully covers the automation requirements of assembly and installation
allows you to design a mechanism for assembling artifacts with the implementation of the autotesting process
allows you to easily manage run jobs and performance monitoring to a person far from programming
allows you to customize the logging mechanism in such a way that the result of the execution will be clear to anyone in the team. The problem in the assembly will obviously be indicated, or the process will be executed successfully.

To solve the tasks we have created several multijob projects. This type of project was used, since it can work in parallel with other jobs with a one-time start. Each of the jobs is responsible for the implementation of its block of functionality. So we replaced the sequential process of obtaining artifacts on stand-alone parallel ones. Everything runs separately: the formation of EDWEX, JSON, DDL, the formation of structure in HIVE, the installation of table structures in the database. We analyze the result obtained at different stages of artifact formation and proceed to launch subsequent actions if successful.

Jenkins-part is implemented without any special tricks. Input Jobs are given String or Run parameters to run the python code. StringThis parameter is a window for entering values of the str type before launching. The Run parameter can be transferred on the fly to another job for execution; all you need to do is to specify from which project you need to pick up the obtained variable. A separate parameter is also passed to the node for execution. Here, the splitting of runtime environments into DEV, TST, UAT, PRD is implemented. Separate Jobs made the transfer of the received EDWEX files to the SVN with the revision number to be able to track the versions of the modified structures.

Jenkins interface example:

The result of the execution of jobs is the creation and installation of the necessary artifacts, their transfer to SVN and the generation of an HTML report that reflects the success of the unit tests and the results of the assembly and installation of the artifacts. Joby can be run either by hand separately or in automatic mode, having previously built a chain of execution.

Architecture of the assembly and installation mechanism

Summarize

Much work has been done to automate the formation of artifacts. Previously, you had to manually climb onto the server, run shell scripts, and then study and edit the data with your hands for a long time. Now it’s enough to just press the start button, specify the source system ID, model number and contour of execution. With Jenkins, we managed to structure and divide the entire assembly and installation mechanism of artifacts into independent stages. Necessary checks were added before launching the formation of artifacts and their integration. The resulting artifacts are automatically transferred to SVN, which simplifies working with them to adjacent teams of system analysts and data modelers. Implemented checks to avoid idle launches of artifact formation and to confirm their correctness.

As a result, we have reduced the time-consuming process of assembling and installing model artifacts from a few hours to a couple of minutes. And most importantly, eliminated the occurrence of errors due to the human factor, which inevitably arose in a complex routine process.

Tags:

How to automate with Jenkins the assembly and rolling-out of artifacts of the metadata model for tables in the repository

Little about infrastructure

How to automate

And where is Jenkins?

Summarize

Also popular now: