Universal Soldier: Groovy Transformer in DataStage
The ETL capabilities of IBM DataStage cover a fairly wide range of requirements that arise in data integration tasks, but, sooner or later, there is a need to expand functionality by implementing Parallel Routines in C or creating Java classes that are later used in Java Transformer or Java Client. The rather limited capabilities of the built-in Basic language are outdated a long time ago and cannot be regarded as a serious help (for example, it is impossible to use XML structures, or, for another example, try to write MD5 hashing using Basic. This is possible, but it will take considerable time to develop and debug )
Anyway, I would like to have a fairly flexible tool that allows you to work with the data stream, which does not require recompilation of its source codes and which could be used in the DataStage Client editor. My colleague and close friend was asked to develop the Groovy Transformer. About him and will be discussed in this article.
Why groovy Because this language is quite flexible and has all the features of Java, as it is an add-on for this language, but in addition it offers developers the following advantages:
The basic idea behind Groovy Transformer is to use Groovy code in Java Transformer. Since the code can be written directly in this stage, you can decide for yourself what to do - execute the Groovy code, which is stored in the file, comes from the parameters of the job or which you will write yourself.
Therefore, we need to learn how to create a Java Transformer. Those who already know how this is implemented can skip this section. But I will try briefly, since the documentation on this part is written in sufficient detail.
So, to create a Java transformer, we need to create a class that inherits from the Stage class:
And you need to implement the three most commonly used methods: initialize (), process () and terminate ().
The initialize () method is executed before the stage processes the stream lines and may contain declarations of objects that you intend to use throughout the life of the transformer.
The process () method is executed for each line of the input stream and should contain the logic of your processing.
The terminate () method is executed at the end of the existence of the transformer and it may contain actions to delete temporary objects (yes, I know that there are no destructors in Java, I mean any garbage that you used: files, tables, you never know).
Note for parallel transformer mode: DataStage launches a separate Java machine for each node. In other words, if you have four nodes, then DataStage will launch four JVMs. Since virtual machines are isolated, you do not have acceptable methods for exchanging data between threads running in each of them.
Now we are ready to create a template for our Java transformer:
To read the lines entering and leaving the transformer, you can use the Row object and two methods: readRow () to access the values of the input stream and writeRow () to write to the output.
The Row object also provides the ability to retrieve the metadata of each column and allows you to obtain the values of these columns. The following example demonstrates how you can replace the values of all columns that are of type VarChar with the value “Hello from the Java”, all other columns “push” without changes further:
Note : To compile the transformer class, be sure to import the tr4j.jar library into the class path or in your IDE.
Now we can formulate the requirements for our Groovy Transformer.
Groovy Transformer is a JavaTransformer that compiles Groovy code on the fly. Syntactic sugar has been added to it to facilitate the routine operations that have to be performed when working with the Stage class.
So, our transformer should:
In accordance with these requirements, Groovy Transformer was created, which you can download here: http://geckelberryfinn.ru/fr/GroovyTransformer.html . (Caution! This Java Transformer is also written in Groovy =), there will be problems with decompilation).
Groovy Transformer predefines the following objects:
Thus, there are two ways to create an output stream:
Which method to use depends on the specific situation.
To start using Groovy Transformer in your projects, it would be nice to populate the properties of the Java Transformer stage:
Below I will give some examples of the use of Groovy Transformer:
Some useful links:
Anyway, I would like to have a fairly flexible tool that allows you to work with the data stream, which does not require recompilation of its source codes and which could be used in the DataStage Client editor. My colleague and close friend was asked to develop the Groovy Transformer. About him and will be discussed in this article.
Why groovy Because this language is quite flexible and has all the features of Java, as it is an add-on for this language, but in addition it offers developers the following advantages:
- Native syntax for hashes (associative arrays) and lists, instead of the following Java code
import java.util.* ; … HashMap<String,String> someMap=new HashMap<String,String>() ; someMap.put("Key1", "Value") ; … String valFromMap=someMap.get("Key1") ;
you can use the following Groovy code (java.util. * is already imported by default in it):HashMap someMap=new HashMap(); someMap.Key1= "Value" ; … valFromMap=someMap.Key1;
- Interpreting strings (why is everyone calling it interpolation?):
def NumberOfItems= 10; println "Number of items=$NumberOfItems";
- Dynamic code execution, you can execute code stored in a string variable, in a stream, or in files without compilation:
def GroovyCode='sum=a+b; println "Sum is $sum"'; Binding binding=new Binding(); binding.setVariable("a", 10); binding.setVariable("b", 20); GroovyShell shell = new GroovyShell(binding); shell.evaluate(GroovyCode);
- Native XML and JSON support:
def writer=new StringWriter() ; def xmlOut=new groovy.xml.MarkupBuilder(writer) ; xmlOut.JobsInfo{ Job(name : 'ODS_MOUVEMENT_C') { precedants { precedant "ODS_ECRITURE_C" precedant "ODS_TEC_DEB" } } } println writer;
This example should print the following XML:<JobsInfo><Jobname='ODS_MOUVEMENT_C'><precedants><precedant>ODS_ECRITURE_C</precedant><precedant>ODS_TEC_DEB</precedant></precedants></Job></JobsInfo>
if you want to generate JSON, you need to replace groovy.xml.MarkupBuilder with groovy.json.JsonBuilder.
The basic idea behind Groovy Transformer is to use Groovy code in Java Transformer. Since the code can be written directly in this stage, you can decide for yourself what to do - execute the Groovy code, which is stored in the file, comes from the parameters of the job or which you will write yourself.
Therefore, we need to learn how to create a Java Transformer. Those who already know how this is implemented can skip this section. But I will try briefly, since the documentation on this part is written in sufficient detail.
So, to create a Java transformer, we need to create a class that inherits from the Stage class:
import com.ascentialsoftware.jds.* ;
classMyJavaTransformerextendsStage{
}
And you need to implement the three most commonly used methods: initialize (), process () and terminate ().
The initialize () method is executed before the stage processes the stream lines and may contain declarations of objects that you intend to use throughout the life of the transformer.
The process () method is executed for each line of the input stream and should contain the logic of your processing.
The terminate () method is executed at the end of the existence of the transformer and it may contain actions to delete temporary objects (yes, I know that there are no destructors in Java, I mean any garbage that you used: files, tables, you never know).
Note for parallel transformer mode: DataStage launches a separate Java machine for each node. In other words, if you have four nodes, then DataStage will launch four JVMs. Since virtual machines are isolated, you do not have acceptable methods for exchanging data between threads running in each of them.
Now we are ready to create a template for our Java transformer:
import com.ascentialsoftware.jds.*;
publicclassMyJavaTransformerextendsStage{
publicvoidinitialize(){
trace("Init");
}
publicvoidterminate(){
trace("Terminate");
}
publicintprocess(){
return0;
}
}
To read the lines entering and leaving the transformer, you can use the Row object and two methods: readRow () to access the values of the input stream and writeRow () to write to the output.
The Row object also provides the ability to retrieve the metadata of each column and allows you to obtain the values of these columns. The following example demonstrates how you can replace the values of all columns that are of type VarChar with the value “Hello from the Java”, all other columns “push” without changes further:
publicintprocess(){
Row inputRow=readRow() ;
if (inputRow == null)
//нет больше строк в потокеreturn OUTPUT_STATUS_END_OF_DATA;
Row outputRow=createOutputRow();
for (int i=0;i<inputRow.getColumnCount();i++) {
Object column=inputRow.getValueAsSQLTyped(i);
if (column instanceof java.lang.String)
outputRow.setValueAsSQLTyped(i, “Hello from Java”);
else
outputRow.setValueAsSQLTyped(i, column);
}
writeRow(outputRow);
}
Note : To compile the transformer class, be sure to import the tr4j.jar library into the class path or in your IDE.
Now we can formulate the requirements for our Groovy Transformer.
Groovy Transformer is a JavaTransformer that compiles Groovy code on the fly. Syntactic sugar has been added to it to facilitate the routine operations that have to be performed when working with the Stage class.
So, our transformer should:
- Get Groovy code from the Stage-> Properties tab of the Java Transformer stage;
- Allow access to metadata of input and output streams;
- Allow to manipulate column data by their name (instead of column number).
In accordance with these requirements, Groovy Transformer was created, which you can download here: http://geckelberryfinn.ru/fr/GroovyTransformer.html . (Caution! This Java Transformer is also written in Groovy =), there will be problems with decompilation).
Groovy Transformer predefines the following objects:
An object | Description | Example |
---|---|---|
GTransformer | Object Link to this Stage class. Contains all the methods and attributes of this class. | GTransformer.createOutputRow () |
Outputmatching | HashMap Contains matching column names and their indices. | OutputMatching.get (k); OutputMatching.ID; OutputMatching.LIBL; |
Meta data | HashMap. Contains information about the methodical columns of the input stream. | MetaData.ID.Description; MetaData.ID.Derivation; MetaData.ID.SQLType: MetaData.ID.DataElementName; |
OutputMetaData | HashMap Contains information on methodical output stream | OutputMetaData.ID.Description; OutputMetaData.ID.Derivation; OutputMetaData.ID.SQLType: OutputMetaData.ID.DataElementName; |
InputColumns | HashMap Contains all input stream columns | InputColumns.ID; InputColumns.LIBL; |
OutputRows | List <, HashMap>. List of output stream lines. You can use this object when the number of lines coming out is greater than the number of incoming lines. | HashMap curRow = new HashMap (); outputRows [0] = curRow; outputRows [0] .ID = 0; outputRows [0] .LIBL = "First item"; |
Thus, there are two ways to create an output stream:
- Fill in the list of OutputRows;
- Call the createOutputRow () method, and then writeRow to the GTransformer object.
Which method to use depends on the specific situation.
To start using Groovy Transformer in your projects, it would be nice to populate the properties of the Java Transformer stage:
- Transformer Class Name: groovytransformer.GroovyTransformer
- User's Classpath: /path/to/jar/GroovyTransformer.jar
Below I will give some examples of the use of Groovy Transformer:
- Easy data retrieval from JDBC source.
In this example, we will extract data from the ODS.AXE_LOCAL table using Oracle as a DBMS, although there is a separate stage for it. But, in fact, this code can be used in cases when it is necessary to obtain data from "exotic data sources" (Interbase, FoxPro, VectorWise, Derby, SQLite, H2 etc.).
This Groovy code is quite universal, since it does not use specific information except for SQL queries. In other words, it can be used in any cases.
To run this example, do not forget to add the list of columns with case-sensitive names in the “Output” tab of the transformer.sql = Sql.newInstance( 'jdbc:oracle:thin:@oraclehost:1533/nomdebdd', 'user', 'password', 'oracle.jdbc.OracleDriver' ); sql.eachRow('select id, LIBL_AXE_CENTRAL, SYS_CREATED_DAT as DAT_DEB_VALID from ODS.ODS_AXE_CENTRAL') { it -> GRow=GTransformer.createOutputRow(); i=0; it.toRowResult().each{k,v-> i++; if (v instanceof java.sql.Date) GRow.setValueAsSQLTyped(OutputMatching.get(k),it.getTimestamp(i)); else GRow.setValueAsSQLTyped(OutputMatching.get(k),v) ; } GTransformer.writeRow(GRow); }
- Using regular expressions.
In this example, we will try to extract information about the contract (its number and date) from lines of the form:
“Contract FU-1009 dated 10/01/1960”
“Service Agreement FU-1019 for 10/01/1961”
“ Contact number FU-1001 dated 10/01/1962 ”
and it is advisable for us to get two columns at the output: the number of the contract and the date of its conclusion.
In this case, you can use the following regular expressions:- For contract number: [AZ] + (- [AZ]) + (- [0-9] +)
- For its date: [0-3]? [0-9] (/ | \\.) [0-1]? [0-9] (/ | \\.) (19 | 20)? [0-9 ] {2}
Then for this example, Groovy code might be like this (note that another way to generate the output stream is used):String StrIn=InputColumns.contrat contRegExp="[A-Z]+(-[A-Z])+(-[0-9]+)" datRegExp="[0-3]?[0-9](/|\\.)[0-1]?[0-9](/|\\.)(19|20)?[0-9]{2}"def matchRegEx(str, regExp){ matcher=(str=~regExp); if (matcher.getCount()!=0) return matcher[0][0]; elsereturnnull; } def cont=matchRegEx(StrIn, contRegExp); def dat =matchRegEx(StrIn, datRegExp); HashMap curRow=new HashMap(); OutputRows[0]=curRow; OutputRows[0].contrat=StrIn; OutputRows[0].numero=cont; OutputRows[0].date=dat;
Some useful links:
- Groovy Handbook: http://groovy.codehaus.org/Documentation
- Java Transformer example: http://www.ibm.com/developerworks/data/library/techarticle/dm-1106etljob/index.html