Converting data, or deepening in Talent Open Studio

    The morning was quite ordinary and lazy: shower, coffee, cigarette ... It was time to get to work ...
    Upon arrival at the office, I was greeted by the news of the transition to a new project as an ETL engineer (I don’t know what that means, oh well). Well, I think we’ll try. Before me, one guy worked there, but as always, I did not wait for a lot of help about work. And so let's get started.

    What is an ETL? Here's what the wiki says :
    ETL (from the English Extract, Transform, Load - literally “extract, transform, load”) is one of the main processes in managing data warehouses, which includes:
    • extracting data from external sources;
    • their transformation and purification (English Data cleansing) so that they meet the needs of the business model;
    • and loading them into the data warehouse.

    With this I think more or less clear. The essence of my task was to extract data from a certain .xls document, convert it by type (most of the fields were defined as string), calculate something (depending on the specific case) and write the whole thing in the database. After casting star star schema , you can read what it is on the same wiki .
    So. We sorted out the task. Talent Open Studio was used for work , dug up an example of how and what to do. And the very long and boring work of mine-paste began.

    Here is a simplified example for converting data and writing to the database looks like this:
    • A project is being created;
    • A data source is added to it; (in our case .xml dock);
    • A data receiver is added (table in MySQL);
    • Create a new Job;
    • We throw the source and the recipient into it, as well as the component of the tMap environment;
    • In it we prescribe data conversion for each field (for example, the envelope toInt || toFloat, delete the extra characters, for example, "%", and so on;
    • We click our Job on performance and we look that it turned out;

    this is how a simple example of Job in TOS looks visually.
    TOS Job

    There were cases where I had to write a rather long check on the validity of data, their parsing, etc. After several such tasks and an annoying copy-paste, the idea of ​​optimizing the process matured, as it turned out in this IDE there is the opportunity to write your own static public classes. The first class for data conversion was written in a couple of minutes. And immediately, the amount of labor activity decreased significantly. If, for example, you had to copy-paste each time: then everything went on with one line

    obj.toString().equals("#") || obj.toString().equals("") || obj==null ? null
    Integer.parseInt(StringHandling.EREPLACE(obj.toString().replace(" ", "").substring(0,obj.toString().replace(" ", "").indexOf(".")),"\\xA0", ""))


    But as they say, not everything is so simple, especially if you have to perform some operations with some data, and even more so if you have to work with data from the previous osprey, since in Talent Open Studio the data is transferred by strings, not by the whole array, and get access to the previous osprey is not so simple, but rather I never found a way but to use the components of the buffer.

    More specifically, there is the following task: we have a set of values ​​with the following fields: "Year", "Quarter", "StartsFromJan1", "SomeValue1", "SomeValue2", "SomeValue3" the StartsFromJan1 field can take the value "Y" (true) or "N" (false).

    If the "StartsFromJan1" field takes the value "Y" (true), then this line must be processed according to a number of rules, if "N" then leave it as it is. For example, processing can occur as follows: the corresponding cell of the previous row is taken from the cell in the current row (if Q4 then we subtract Q3), taking into account the fact that if it contains the value of the StartsFromJan1 == N field, then we subtract one more minus one row and so far we get to Q1, but if the StartsFromJan1 == Y field is in the taken line, then we stop at it, for clarity I will give an example of the data:

    in our case, the first remains unchanged, from the second we take the first, the third do not touch, from the fourth we take 3 and 2

    when developing the task, the following problems
    • Calculation of data if the "StartsFromJan1" field takes the value "Y"
    • Sifting blank lines
    • Minimize manual work

    To solve the first problem, the same static classes came to the rescue. I used 3 one-dimensional arrays (three, because the line in which Quarter == Q1 always has the value “N” for the field “StartsFromJan1”), I created a certain variable of type Int, which increased by one after each processed cell, and after processing the last cell in line and go to the next nullified. Next, a method was created that converted the input data and wrote it into an array; it also reset the array when switching from Q4 to Q1. And in the last method, he was busy pulling data from the array, looking to see if the input parameter should be counted, he then called the method to add to the array and returned the counted or uncounted value (depending on the "StartsFromJan1" parameter).
    Further, in the tMap component, this method was called into which the values ​​“StartsFromJan1”, “Quarter”, “someValue” were transmitted ...

    The problem with empty lines ... at first we just cut the number of lines to read from the file, but this solution was temporary, since the amount of data from time increases. The solution came in the form of the tFilterRow component, which simply indicated some expression for filtering, for example Year! = null || Year! = "", All values ​​that met the condition were run through tMap.

    After carrying out the described optimizations, the development time of one data set was reduced by several orders of magnitude, the amount of work also became significantly less, although to a large extent everything was done because of the great unwillingness to do the same and long-term work, especially if it concerns copy-paste.

    Also popular now: