From daily crashes to stability: Informatica with 10 admin eyes
The ETL component of the data warehouse is often in the shadow of the warehouse itself and less attention is paid to it than the main database or front component, BI, reporting. At the same time, from the point of view of the mechanics of filling the data warehouse, ETL plays a key role and requires no less attention from administrators than other components. My name is Alexander, I am currently administering ETLs in Rostelecom, and in this article I will try to share a little with what the administrator of one of the most famous ETL-systems in a large data warehouse of Rostelecom has to deal with.
If dear readers are already familiar with our data warehouse project and the Informatica PowerCenter product in general, you can skip to the next section.
A few years ago, the idea of a single corporate data warehouse matured and began to be put into effect in Rostelecom. A number of storages that solved individual tasks had already been created, but the number of scenarios was growing, support costs were also increasing, and it became clear that the future was centralized. Architecturally, this is the repository itself, consisting of several layers, implemented on Hadoop and GreenPlum, auxiliary databases, ETL mechanisms, and BI.
At the same time, due to the large number of geographically distributed, heterogeneous data sources, a special data upload mechanism was created, the work of which is controlled by Informatica. As a result, data packets end up in the Hadoop front-end area, after which the process of loading data through the storage layers, in Hadoop and GreenPlum, begins, and they are controlled by the so-called ETL control mechanism implemented in Informatica. Thus, the Informatica system is one of the key elements that ensure the storage operation.
More details about our storage will be discussed in one of the following posts.
Informatica PowerCenter / Big Data Management is currently considered the leading software in the field of data integration tools. This is a product of the American company Informatica, which is one of the strongest players in ETL (Extract Transform Load), data quality management, MDM (Master Data Management), ILM (Information Lifecycle Management) and more.
The PowerCenter we use is an integrated Tomcat application server, in which Informatica applications themselves operate that implement its services:
Domain , in fact, it is the basis for everything else; within the domain, services, users, and GRID components work.
Administrator Console , a web-based management and monitoring tool, in addition to the Informatica Developer client, the main tool for interacting with the product
MRS, Model Repository Service , a metadata repository, is a layer between the database in which the metadata is physically stored and the Informatica Developer client in which it is being developed. Repositories store both a description of the data and other information, including for a number of other Infromatica services, for example, Schedules for launching tasks or monitoring data, as well as application parameterset, in particular, allowing to use the same application for work with various data sources and receivers.
DIS, Data Integration Service, this is a service in which the main functional processes take place, applications work in it and the actual launches of Workflows (descriptions of the sequence of mappings and their interaction) and Mappings (transformations, blocks in which the transformations themselves occur, data processing) take place.
GRID Configuration- in fact, the option of building a complex using several servers when the load launched by DIS is distributed among the nodes (that is, the servers that are part of the domain). In the case of this option, in addition to distributing the load to DIS through an additional GRID abstraction layer, combining several nodes, on which DIS works instead of working on a specific single node, additional backup MRS instances can also be created. You can even implement high availability when external calls can be made through backup nodes in the event of a primary failure. We have so far refused such a construction option.
Informatica PowerCenter, schematic
At the first stages of work, problems regularly appeared in the data supply chain, some of them due to the unstable Informatica work at that time. I am going to share some of the memorable moments of this saga - the exploration of Informatica 10.
The former Informatica logo
The area of responsibility of our direction also includes other Informatica environments, it has its own specifics due to a different load, for now I will recall exactly how Informatica developed as ETL components of the data warehouse itself.
How did it happen
In 2016, when we became responsible for Informatica, it already reached version 10.0, and for optimistic colleagues who decided to use a product with a minor version .0 in a serious solution, everything seemed obvious - you need to use the new version! From the point of view of hardware resources, everything was excellent at that time.
Since the spring of 2016, the contractor was responsible for the work of Informatica, and according to the few users of the system, “it worked a couple of times a week”. Here it is necessary to explain that the storage was de facto at the PoC stage, there were no administrators in the team and the system constantly crashed for various reasons, after which the contractor engineer raised it again.
In the autumn, three administrators appeared on the team, sharing their responsibilities and began to line up normal work on operating systems in the project, including Informatica. Separately, it must be said that this product is not widespread and has a large community in which you can find the answer to any questions and solve any problem. Therefore, full-fledged technical support from the Russian partner Informatica was very important, with the help of which all our mistakes and mistakes of the young then Informatca 10 were corrected.
The first thing we had to do for the developers of our team and contractor was to stabilize the work of Informatica itself, to make the web Administration Console (Informatica Administrator)
So we often met Informatica developers
Leaving aside the process of finding out the causes, the main reason for the crashes was the interaction between Informatica software and the repository database located on a relatively remote server from the point of view of the network landscape. This led to delays and disrupted the mechanisms for monitoring the status of the Informatica domain. After some tuning of the database, changing Informatica parameters, which made it more tolerant to database delays, and as a result of updating Informatica version to 10.1 and transferring the database from the previous server to the server located closer to Informatica, the problem has lost its relevance, and since then such kind of crashes we are not observing.
One of the attempts to get Informatica Monitor working
With the administration console, the situation was also critical. Since there was active development right on a conditionally productive environment, colleagues constantly needed to analyze the work of mappings, workflow “on the go”. In the new Informatica, the Data Integration Service does not have a separate tool for such monitoring, but the monitoring section (Informatica Administrator Monitor) has appeared in the administration web console, in which you can observe the operation of applications, workflow and mappings, starts, logs. Periodically, the console became completely unavailable, or information about current processes in DIS ceased to be updated, or errors occurred while loading pages.
Selection of java parameters to stabilize the work
The problem was fixed in many ways, experiments were conducted to change the parameters, logs were collected, jstack was sent in support, active googling was going on at the same time and observation was just conducted.
First of all, a separate MRS was created for monitoring, as it later turned out to be one of the main consumers of resources in our environments, since the mappings are launched very intensively. Parameters concerning java heap, and a number of others were changed.
As a result, the next update to Informatica 10.1.1 managed to stabilize the console and monitor, developers began to work more efficiently, and regular processes became more regular.
The experience of the interaction between development and administration may be interesting. The question of a common understanding of how everything works, what can and cannot be done, is always important when using complex systems. Therefore, we can safely recommend that you first train the administration team on how to administer the software, and the development team on how to write code and draw processes in the system, and only then send the first and second to work on the result. This is really important when time is not an endless resource. Many problems can be solved even by random enumeration of options, but sometimes some require a priori knowledge - our case confirms the importance of understanding this axiom.
For example, when we tried to include versioning in MRS (as it turned out, we needed a different version of SVN), after some time we were anxious to find that the system restart time increased to several tens of minutes. Going to the cause of the delayed start and disabling versioning, they did well again.
Of the notable obstacles associated with Informatica, one can recall the epic battle with the growing java streams. At some point, the time came for replication, that is, to extend established processes to a large number of source systems. It turned out that not all processes in 10.1.1 worked well, and after some time, DIS became inoperative. Tens of thousands of threads were detected, their number grew especially noticeably during the application deployment procedure. Sometimes it was necessary to restart several times a day to restore performance.
Here you need to thank the support, the problems were relatively quickly localized and fixed using EBF (Emergency Bug Fix) - after which everyone had the feeling that the tool was really working.
It still works!
By the time work began in target mode, Informatica looked as follows. Informatica version 10.1.1HF1 (HF1 is HotFix1, a vendor assembly from the EBF complex) with additional EBFs installed that fix our scaling problems and some others, on one of the three GRID servers, 20 x86_64 cores and storage, on a huge slow array of local disks - this is the server configuration for the Hadoop cluster. On another server of the same type, the Oracle DBMS with which the Informatica domain and the ETL control mechanism work. All this is monitored by the standard monitoring tools used in the team (Zabbix + Grafana), on both sides - Informatica itself with its services, and the loading processes going into it. Now both performance and stability without taking into account external factors, now depends on the settings,
Separately, we can say about GRID. The environment was built on three nodes, with the possibility of load balancing. However, during testing, it was found that due to interaction problems between the running instances of our applications, this configuration did not work as expected, and temporarily decided to abandon this construction scheme by removing two of the three nodes from the domain. At the same time, the scheme itself remained the same, and now it is a GRID service, but degenerate to one node.
Right now, there remains the complexity associated with a drop in performance during regular cleaning of the monitor circuit - with simultaneous processes in the CNN and running cleaning, malfunctions in the operation of the ETL control mechanism may occur. This is being solved so far by “crutch” - manual cleaning of the monitor circuit, with the loss of all its previous data. This is not too critical for the product, with normal full-time work, but so far the search for a normal solution is in progress.
Another problem arises from the same situation - sometimes multiple launches of our control mechanism occur.
Multiple application launches, leading to a breakdown of the mechanism
When starting according to a schedule at times of heavy load on the system, sometimes such situations happen that lead to a breakdown of the mechanism. Until now, the problem has been fixed manually; a permanent solution is being sought.
In general, it can be summarized that under heavy load it is very important to provide resources adequate to it, this also applies to hardware resources for Informatica itself, and the same for its database repository, as well as to ensure optimal settings for them. In addition, the question remains as to which database layout is better - on a separate host, or on the same one where Informatica software works. On the one hand, it will be cheaper on one server, and when combined, a possible problem with network interaction is practically eliminated, on the other hand, the load on the host from the database is supplemented by the load from Informatica.
As with any serious product, Informatica has some curious moments.
Once, analyzing some kind of accident, I noticed that the time of events was strangely marked in the MRS logs.
Time dualism in MRS “by design” logs
It turned out that timestamps are written in 12 hour format, without AM / PM, that is, before noon or after. An application was even opened on this subject, and an official response was received - it was so planned, the marks in the MRS log are written in this format. That is, sometimes there remains some intrigue regarding the time of the occurrence of some ERROR ...
Strive for the best
Today Informatica is a fairly stable tool, convenient for the administrator and users, extremely powerful in terms of current capabilities and potential. It many times exceeds functionally our needs and is de facto now used in the project in a not very characteristic and typical way. Part of the difficulty is how the mechanisms work - the specificity is that a large number of threads are launched in a short period of time, which intensively update the parameters and work with the repository database, while the server hardware resources are almost completely utilized by the CPU.
Now we have come close to switching to Informatica 10.2.1 or 10.2.2, in which some internal mechanisms have been redesigned, and support promises the absence of a number of current problems with performance and functioning. And from a hardware point of view, servers are expected to be optimal for us, given the margin for the near future due to the growth and development of the storage.
Of course, testing, compatibility testing, and possibly architectural changes in the HA GRID part are ahead. The development within Informatica will go on, because in the short term we cannot put anything to replace the system.
And those who will continue to be responsible for this system will certainly be able to bring it to the required reliability and performance indicators put forward by customers.
This article was prepared by the Rostelecom data management team. The
current Informatica logo