Success story, or DEV + DEVOPS + OPS

Development teams can be weakly interconnected and work in different directions, not knowing and not wanting to use DevOps. In today's article we will talk about how DevOps practices can be distorted and transformed so that they can be implemented in companies with long-established rules, policies and habits of people.

The material is based on the report-dialogue of Sergey Berdnikov (Golden Crown) and Artem Kalichkin (CFT) from the October DevOops 2017 conference . Under the cut - video and text transcript of the report.

Sergey: I am the head of the operations department in our company. Artem and I started a small revolution-evolution. It all started with a revolution, now moved to the stage of evolution.

Artem: The company has been operating in the financial market since 1992. The work consists of two main parts. The first part is software development, Core Banking, bank accounting and so on. Here we act as a vendor: we developed and sold you a box, and you deployed and operated it.

The second part is processing services. Here we provide services either directly to individuals or through our partners. These are large retail chains, banks, other participants in the financial services market. Here we are working on a full cycle from the idea to the development, implementation and further exploitation.

Sergey and I work in the processing part of our company. About how we came to the story with DevOps in this processing, we will tell.

What happened

Sergey: Our legacy was completely banking. The company initially made banking products, respectively, everything was familiar: the entire infrastructure of SPARC only, its own data centers, the entire kernel is written in Oracle. PL / SQL-code, all sorts of stuff - it is not easy to scale.

We scaled only vertically: we bought a powerful piece of iron, we used it until it became outdated, we replaced it with a new one and took the old one to staging.

Also wrote a lot of Java code. We were backed up through a warm reserve: there is a data center, and the entire copied structure is full of switches, servers, all one-on-one, bolt to bolt.

Here you can see how the whole structure looked from the point of view of the processes. There were separate Dev technical departments, further firewalls with fire. Next was centralized IT, which was engaged in the removal, deployment and so on. That is, the Devs wrote a large instruction, and IT Operations was divided into three divisions:

Application administrators who are engaged in deployment. They did not have a root, there were users on the machines, they could follow the instructions - this is called the Monkey code.
Unix administrators who could and were able to configure, pour iron and so on.
There were individual database specialists. And since the database for us is the main goal, the essence of our existence for a long time, there have been many processes.

Artem: DevOps is not here at this stage yet, we worked according to the regulations, with which Sergey was not pleased.

Sergey: I came to the team of application administrators and I remember what kind of "wonderful times" these were. We could make one application for three weeks. The application comes, they found some mistake in it and canceled it, believing that the fools themselves did not know how to write applications. And the knowledge that is necessary for the correct writing of the application was with me.

People then resorted every other day: “And what did we write wrong?”. I explained that somewhere they did not put a plus or a minus, they forgot a comma. They write an application, I give some expertise on how to work in Oracle, and send it further to the DBA, where specially trained people who know how to execute these applications sit.

And they also work with me, they say: “Why didn’t I indicate the main indicator here, didn’t write a semicolon?”. The application is canceled, the cycle begins anew. Now for such a shame, but what to do, previously worked in this way.

Artem: Then we began to change. There were really many applications. When Sergey came to our team, he became the result of a small evolution and transformation. I was the author of a sufficiently large number of regulations of different types of applications, because I had to somehow survive. In general, the whole transformation took place not because of the hype or fashion, but because of the need to solve specific problems.

For example, configuration changes lead to the fact that my combat broke and did not work correctly. I found out about this after 24 hours or at night: it happened that at six in the evening something rolled over, and until three in the morning it all worked incorrectly.

Passed the installation version, in front of which no one was going and did not discuss what to do if something goes wrong. The well-known famous multipage installation instructions were handed over literally half an hour before the start of work in operation. It was necessary to solve something, and at that time we found a solution - an adaptive implementation of ITIL processes.

We began to check whether everything was correctly rolled after the roll-up to combat, whether the service and the main key indicators work normally. We started dating before installing versions. It was then, in fact, that was the very beginning of DevOps, when the development team, which transmits the distribution kit, began to at least meet with the operation team, to discuss what would be in the night work.

Sergey: And there was something to discuss: we had four pages of installation instructions - execute the command, execute the plan. It was almost impossible to write without errors. We are constantly between the development of swearing that they wrote the wrong, read, and stuff like that. Meetings sometimes turned into hell.

Artem:We tried to move with requests to Confluence, since it was inconvenient to transfer to Word - it was possible to form something wrong. In Confluence, they always planned to post the current version with all the changes.

Inserted a piece of code to roll on the battle. Confluence chewed meta markup, gave out something else incorrect, the admin took the code that turned into noodles and started working with it - it was a disaster.

We understood that no matter how perverted the page instruction was, it all turned into utter nonsense, no matter how it was framed.

There were important prerequisites for prolonged downtime during night work, which led to a poor take-off after installation, jambs, and a huge number of conflicts between development and operation.

Many human errors in the transfer of changes;
Constant search for the guilty;
The speed of removal of new modules up to 3 weeks;
Single points of failure (only vertical scaling), lack of balancing;
Planned downtime during updates for 2 hours.

Transformation prerequisites

Sergey: There were a lot of changes, we were constantly messing up. Every week we gathered, swore, calmed down. This process was repeated forever, looking for the guilty: "It's all the developers who write the curve code, even the Java module can not pass."

Artem: And the developers thought that these were elementary things: an error occurred in the logs - figure it out, google it and figure out what needs to be fixed in the configs.

Sergey: Also released new products for a very long time. This is connected structurally: we created an application, we had to create an application for creating a user on the server, then create a scheme. Then began the football bids. The developers issued the modules, but we cannot use them, we have everything according to the regulations.

Also, we were very long set. The instruction is huge, while you read, while you do, the reel took about two hours. The actions themselves were not performed in a second.

Artem: There were also routine actions, for example, 30 Java modules. In all there is a config, in each config you need to go and make changes. First, you can go crazy doing the same change: you hate yourself and the rest of humanity. Secondly, the probability of making a mistake on the 25th config becomes extremely high.

Sergey: I remember how I got the offer to scale horizontally. And we have 150 modules with different configs: if there is an error in one version of the config, they will install the second one for me, and I will be mistaken in it. We are not robots, after all.

Artem:Planned downtime for 2 hours - this is one of the critical factors of why we began to look for a solution, how to get away from it.

The fact is that we provide financial services, processing services. We work for foreign countries. Then we already worked in 11 time zones, in 2013 we had only one hour window, when the use of the service was minimal, the number of customer calls was minimized, there was a calm, and something could be done.

Conventionally, we could conduct work from one to two at night. Two hours is much more than this window. We were approaching a catastrophe if it were not for the transformation, because now we have no physical windows.

With the answer to all these problems, I had an idea, an attempt to find out what DevOps is.

At that time, our colleague arrived with HighLoad, I was fiddling with the introduction of CMBD, because I needed the configurations to not break and I could manage at least something. He listened to the report of Sasha Titov, who spoke about some Chef. It seems that this is also configuration management.

In 2013, I read about it all, I decided that it was some kind of garbage and not what was needed. I have to control, and they force the code to write there. However, the situation did not change, problems accumulated and I forced myself to sit at home and begin to understand. I thought that there was something in it, some kind of salvation.

And then I discovered the postulates and values that we should have the same environment, the same roll-out and update scenario, that we should test these scenarios, starting with the test environment.

There was an opportunity to minimize the instruction and manual actions, and to automate everything as much as possible not just by separate bash-scripts, in which then another admin will break a leg.

That's when I came up with this idea, with the first declaration of what I want to receive. This is a 2013 document, first created in a company about DevOps.

Sergey: This is the key idea: to reduce the speed of removal of new modules, to reduce the number of errors when moving to a battle. That is, there were specific goals that we wanted to achieve at the first stage of the release of new releases.

There were many arguments against it. For example, the fear that the automation will break everything: it works incomprehensibly, it is terrible to execute someone else’s code, this is a great service, people get money through us. Not serious.

The following security people joined. They walked through us to the fullest: some identical stage! And they have an ideal picture of the world: transfer versions on a flash drive, sign them with a PGP key, and everything will be fine - perfect service. We worked with them for so long to get to the end, only thanks to the project activity something happened to us.

Artem: It was based on values: here we are losing money, this simple one is inadmissible.

The guys and I came up with a way to minimize these losses. Do you have a better option to do this? If not - be silent, if there is - criticize and propose. Nothing to offer? Then we try.

There was a process of persuasion and pushing: we offered to use our ideas on a limited number of systems.

Sergei:We were asked to write out all the risks, as we will release. It was necessary to coordinate with people who could lose money. The programmers also said: “Some code was written, we used to transfer zip normally, instructions were written, some additional code was written for removal?”

Artem: “I write business logic application code, I use frameworks to minimize any unnecessary part of the code. And you ask for more code to write. Take and carry out, in the end, ”- such dialogues were at the start. But nevertheless, all this gradually, through demonstrations and convictions worked.

Sergei:In the first iterations, we made many important decisions in terms of the structure of our company and in terms of technology. First of all, implemented Configuration management. This saved us from the problems with the removal of the wrong configuration with instructions in 10 A4 pages.

Then the operations began to sink, and the application administrators went to the technical directorate with the developers. It gave a sense of command, a feeling of elbow. We began to understand that we were making a product, and we were not fulfilling incomprehensible requests with a desire to reject them - there were specific goals.

Teamwork brings people together when they are sitting next to people, when they see how they work, when they see how we work. We even had a dialogue between the teams: this is the first spark of real DevOps. There was no technology, not some new technology. From the point of view of technology, we thought that nothing would take root at all; we worked differently, in a different world.

The first idea is that Configuration Management should write by itself, there are a lot of developers. Then we remembered everything that we wrote ourselves, and refused - we all had a feil.

Artem:I will correct a colleague. Sergei is not right: everything that we write ourselves worked perfectly in our applied field, in which we are strong. And when they tried to write some of their spiders to automatically build a CMDB or some kind of self-written monitoring systems to control business logic, yes, here comes a system of another class.

At this stage, it turned out that the application administrators moved from the IT Directorate to our technical directorate. As Sergey said, they began to feel the whole business value, elementary due to mercantile things.

We were able to pay them project awards for achievements, it was quite motivating. Since they started the dialogue, the removal of modules decreased from three weeks to more than a week, some progress went even without some deep automatics.

Sergei:At this moment, if we did not understand something with the application, we asked the developer to come up: “Let's decide together and write how the application should sound”.

Artem: And on this, conditionally, command structure, we began to move into the technique.

Sergey: Now let's tell you how we chose the system. It was quite interesting. First of all, we tried Chef, for one simple reason - we knew a guru who knew Chef. Then we tried Puppet, because at that time Oracle announced support for Puppet.

Ansible also tried, but both teams did not like it as a system. There were also security problems: Ansible in 2013 was very different from the current one.

We launched in parallel two different projects with the same functionality. And everything worked cool, there was a feeling that something was wrong and should be left alone. How did we choose?

Artem: Programmers wrote on Chef, admins wrote on Puppet. The idea was: try, then compare, where better and choose. But when we compiled, when as a result time passed, the duality began to create problems, because the amount of code is growing, it continues to grow and everyone likes everything, the developers write and automate.

I gathered everyone and asked what we would write on. Programmers said, “We like Chef a lot.” And admins: "And we are on Puppet!". It was a complete tin. I compared and understood that in the current environment and current parameters there is no objective way to choose this or that product.

In the end, I did, as they like to say in our country, elections with a predictable result. Some "closed" vote among the participants. But there was no falsification; there was a conditional effect on the brain, as a result of which Puppet was chosen. I decided that offended developers would calm down faster than offended admins. There was simply no other selection criteria.

Sergey: At that moment, we thought for a long time how to put a binary. On the slide you can see a photo from our board and meeting. We decided that you need to use some kind of packaging system, and not your bike. Again won the mind.

In fact, we chose not RPM, but IPS - the Solaris package manager. We imported ourselves from the 11th version into the top ten, which stood, and began to use. To refuse samopisny bash-crutches and zip-s at that moment was the right decision.

Artem: This is why it was still important: in the recipe, the result appears in the form of changing the version number, it stretches further and further from the repository and becomes necessary.

When we came to teach DevOps, Chef, all of these things, and we thought: “Now they will tell you how to transmit a binary,” but we were not told about this. The answers are usually such: “Everyone decides in his own way and gets out as he can.” Therefore, we understood that the industry response to this is “42”, as from “Hitchhiker's Guide to the Galaxy”, the answer to the main question of the universe.

Sergei:We also had long disputes on how to build a CI / CD, what it is. Like Configuration Management - one utility, they took it and installed it. And here there are a lot of options and choices, arguing for a long time, the developers made their own system, we made ours in operation for carrying out.

At that point in time, we realized that there was no perfect solution. Just take all that they have worked out and will deliver. The developers had their own system for the assembly, we made our delivery ourselves. There was no perfect choice, and still work with different teams in different ways. There is no ideal.

We also have a large stack, most of our code is stored in the database: all financial processing, in which, unfortunately, the paradigm “the closer to the data, the faster it works”. Oracle sells, Fowler agrees. Financial transactions hang in PS / SQL, we did not find any opensource-product that would help solve our problem with versioning and delivery. Perhaps he was, but we began to write his tool.

Artem: We have, in fact, a big problem, because, as mentioned in the initial slide, Production is a large vertical server. Accordingly, it is terribly expensive and very hard to raise an equally large vertical server on the Stage. This is still half the problem.

The fact is that we need to raise an environment similar in performance and fill it with similar in volume, and, last but not least, cardinality with data in order for our tests on the Stage to pass correctly.

Here were some very hard decisions. First of all, what we did in terms of scaling - we realized that we could not provide the same vertical machines on the Stage as in battle. But at the time point X, we can fix the benchmark indicators of the system performance requests on the Stage environment and compare them with new packages. If they start to change abnormally, it means that something has floated here and something does not work wrong. This is one problem.

Then we discovered a problem with transferring data from the battle to the Stage in order to fill it with the same amount of data. It is impossible that none of the people who do not have documents access to customer data, had access to them.

We do not have the right to pour personal data and bank secrecy of clients in the Stage. To transfer this data, I wrote obfuscation data scripts so that they are not recoverable and incompatible with the real ones. At the same time, it is important that it is impossible to replace all the full names with aaa bbb, because we lose the cardinality of the data, and all our checks on the Stage show incorrect information.

Therefore, we also wrote this script with an eye to the fact that it generated some conditionally random cardinality of these text data so that our queries show an adequate picture comparable to the battle, and we could understand the changes.

We are moving away from the absolute state of performance, we are watching the changes. The situation did not worsen compared to the previous version, which, we believe, was not bad in terms of performance.

This is the second iteration. Probably the most key phrase here is that the project ended here. There was no DevOps project. Here was originally an internal customer - me. I got my result: the instruction was reduced, the errors during the removal of the version were reduced, the combat configuration began to change, controlled through Puppet, it became controllable, understandable. What I wanted was what I got.

Sergey: There have been minor changes with your submission. Again, responsibilities flowed from centralized IT to the technical management.

It became a full OPS with root. This actually helped to accomplish tasks in terms of the removal of new modules. We began to take out modules faster, after three weeks it seemed ideal to us to do it in just three days. The result was tangible: there was a drive, the team itself began to generate ideas on what and how to improve.

What we did from the point of view of related units: we have a team of 200+ people, 150 developers, 6 OPS. There were a lot of escorts, security men. The first is the realization that the best ideal bid that should not be submitted is the best. They began to go to this, and try: if a person has the opportunity to do something without creating an application, everyone is happy. And it is done perfectly fast.

Artem: Here is an example, we make an offer via git. The manager himself comes in and makes an offer for a fight.

Sergey: We found tools like Gitlab, we liked that a person can work with a graphical interface. There is a “download” button, the user may not even understand what the commit actually does.

In this case, we have written scripts to check the content, for example, that pdf is a pdf, file size check, and other logic according to the rules issued by the security officers. People got the opportunity to update these documents without creating an application. The load on ops has decreased.

Another difficulty was how to calculate such moments. In the routine it is not clear how to look for problem areas. Therefore, we invented our scale and called it “Jackals”.

We were inspired by the old picture. We considered that we assigned the number of jackals to each completed application, how boring and tedious it is and do not want to do it. At the end of the month they considered which applications scored the most jackals.

They sat down all division and thought how to get rid of this disgrace. How to make it so that it was not necessary to create applications for this, it was cool and drove the people.

The next stage, where we found ways to automate, is bots. We mastered the API Telegram, began to cut bots for everyone, especially for ourselves. Concluded last triggers.

Business liked: there are situations when something is lying, everyone starts to call and ask what is happening. And so a person could take a telephone, select the “incidents” command and read the latest incidents. People began to conduct more incidents so that no one called and asked about them.

Then we began to write additional features to get information that used to be requests in Jira. The business wants to know whether the transfer has been completed: enters the number, gets the result. This also made life much easier in terms of applications.

Artem: At the same time, we carried out another organizational transformation, but already local, within Sergey’s department. At that time, we became very infected with the idea of an on-call duty-engineer and, thanks to Sergey, managed to build this scheme in the department. There is a seated engineer on incidents, there is a seated engineer on applications, all the others destroy jackals, are engaged in their shooting.

Work with DEV

Sergey: What the division has become involved in: permutations have appeared, people were engaged not only in jackals, but also in other matters. First of all, we had a dialogue with the development. We told new teams what DevOps is, how to prepare it properly, taught CM.

We passed the way from when we ourselves wrote them recipes, then they learned how to correct them, and then write on their own. We also talk about CI, help set up the pipeline and build packages. We help build a secure development environment.

Artem: From the point of view of CI, this is all very important. At the same time, I am a product-owner, I lead projects, I lead development teams. And here is a very interesting case.

On small teams, we combined the functions of operation, that is, Stage and Prod in the person of one unit. On these small projects, small products, teams and infrastructures it turned out very convenient. You see right through how you will roll the fight.

Sergey: When a combat engineer sets up a test environment, he makes it one-on-one, knowing that she will come to him later, and to torment him with it. This is an important psychological factor that you can not freeze, it is better to do everything at once and normal.

What came of it? Here, many say that there is no DevOps department, we believe that we have a DevOps department. What are the main tasks of the department?

He walks, promotes, talks about DevOps. Everyone understands what it is and how to cook it. They tell and show how a new product with a database can be rolled out in five minutes.

Our only limitation is security, coordination of schemes. When we have a virtual machine and the schemes are consistent, then everything takes five minutes. Everything is completely rolling automatically.

Artem: When I had a new product launch in August in a battle, that is, a completely new one, we rolled out 15-20 releases per day without conflict and tension. There is a value here: it's cool when you calmly relax and go to the next.

Sergei:I will talk about pain. We support DRP recovery plan from scratch. And when there was no automation, we copied there almost configs. New modules were constantly added, and the plan needed to be constantly updated. With the advent of DevOps and automation, the plan has declined: we take the latest current version of Git and add plans to it.

This deployment plan becomes fair. This is supported, among other things, by test deployment runs. We do test runs, and in combat we perform a run with switching to the reserve. The whole routine is gone. This, by the way, helped us move the stack a bit.

Previously, we used SPARC Solaris, now x86 has appeared for a simple reason: today nobody writes or tests hipster applications for Sparc. And we use for example Haproxy, together with the developers we saw fixes for Solaris bugs. It got us, no longer wanted to endure. We chose a platform on which everyone is testing products, and now we are working on it normally. This also shifted us towards speeding up the whole process.

Artem: It generally opened the gates to a new world full of wonders. Because the advent of x86 allowed us to use really relevant and useful utilities for our tasks. Moreover, when we got this opportunity, we moved in parallel towards clustering.

In fact, now everything, except for central and nuclear processing, is clustered and works fine for us. We do not steam: idle or not, or it takes a maximum of one minute, even where there is no cluster.

The only place where he stayed and was at least two hours is the migration of central processing schemes. Now it takes eight minutes.

Sergey: On the slide, a new document, one-line application: do a merge in git. No more of those ten A4-sheets.

Deploy new modules takes up to three hours. These are some difficult cases when you need to do something in Oracle, for example, to get a virtual machine. Disappeared errors during the removal. I don’t remember anyone saying something wrong. Of course, there are some rough edges, but they are all minor, not serious, they are corrected very quickly.

What allowed us to achieve success? First of all, we did not start a revolution here and now. I did not say: "We need to implement DevOps in three weeks." We approached this process methodically, agitated for a long time, told people, dropped on our brains, talked about the goals we were achieving.

Artem: I fought back from the questions of the authorities: “Artem, when is DevOps?”. He said that there would be no time limit, we try in Prod, do not ask anything.

Sergei:On the other hand, everything is also very cool. We did not impose technologies used by us on all divisions. The company is big, the neighbors are watching, they say: “Well, yes, great, but we deploy every six months”. They don't need it. We do not walk, do not tell us that we have the only right decision. Somewhere they don’t want to use our stack, for them we wrote down simple bash scripts that allow us to integrate with other commands.

Artem: Here I am convinced that the top can not be forced to implement DevOps. It is unreal for such a project.

Sergey: We analyzed where we are losing the most time: today we are losing the most time for safety.

We are able to work quickly, quickly deploy, but agreeing on a deployment pattern is some kind of hell. Now we looked at it, it reminds the same thing, when there were completely different units of Dev and Ops. Now we have no plan, we think, how to include security in our process so that they can analyze the changes.

Artem: For example, for this you can use Merge, see what has changed in the recipe. Bezopasnik is also an engineer.

Sergey: Often our formal processes do not provide real security. When we conduct an audit, we understand that all procedures have passed, but have not received the desired level of security, and a lot of time and resources have been spent. We constantly find some problems that have occurred due to poor integration of security processes and CI / CD.

From the point of view of OPS, we still have the problem of wasting time on CI and fitting recipes. This thing is already beginning to recruit "jackals." Therefore, we are looking at systems to introduce frameworks for developers, we are looking towards Docker, Kubernetes, so that we can write: “Guys, there are tools, there are no big documents, you can unify the delivery process”.

We want to promote this idea, but security again resists. They say: “Some kind of virtual networks you have, how will the services go without a firewall?”. There are some contradictions, but I think we will win anyway.

Artem:I have my own pain, I would like her to finish. We have a very big problem: we are a company, and we are not the only such company, there are representatives who are in the same situation. We are under the constant control of the regulator, the Central Bank constantly comes with an audit. We pass an audit, we pass an externally independent audit.

It is difficult to blame the auditor, he does work on the basis of standards in which it is written that the physical piece of iron should be a separate, non-virtual machine. No containers.

Not a single international standard has so far moved one iota towards new technologies. There is a black hole. They do not notice that this is a big problem. I can not blame the auditors, they are listening to standards. They have nowhere else to take it, but not a single standard is trying to change in this sense, to transform and move somewhere.

I need to figure out how to make friends with the picture now in the company with these terrible words so that everything is correct and fair.

If you are tired of longrids, we recommend listening to the release of the “Five minutes of PHP” podcast with our friends - Baruch Sadogursky and Vyacheslav Kuznetsov. Trends DevOps, DecSecOps, Kubernetes victory and State of DevOps 2018 report by DORA.

And if you want more cool reports - come to the DevOops 2018 conference. There will be Baruch, Glory, and even John Willis ! All speakers and program - on the site .

Pleasant bonus: until October 1, a ticket for DevOops 2018 can be booked at a discount .

Tags:

Success story, or DEV + DEVOPS + OPS

What happened

Transformation prerequisites

Work with DEV

Also popular now: