
How we learned how to outsource throwing the ball with the internal IT department

We use both outsourcing and our internal IT department resources. On one physical server there may be a service for which external employees are responsible, and a service for which we are responsible. And from the season these services can migrate inside the company or go outside.
The story began with the fact that we needed a centralized system with a terminal farm. At that time, we had about 10 stores, and each of them maintained its own database, the data from which were used to compile an aggregate report at the end of the period or upon request.
Centralization
The described scheme turned out to be very inconvenient - it was necessary all the time to know where and what is in the warehouses, plus to display the availability of goods in real time on the site. Accordingly, we centralized the base and made the stores work through terminal access to it. Before that, we did not want to do this because of the requirements for the communication channel.
Naturally, there was a problem with turning off the Internet: if the channel was weakening or falling (which happens almost always - a matter of time), the store was left without a cash desk. That is, it was still possible to break through the checks (then the separation of IT and the physical cash desk was used), but the accounting flew to hell until correction. It was impossible to allow this, so we decided to go the simple way - reservation of channels. They brought into each store two physical cables from different providers, purchased “USB whistles”, but, in general, the situation from this, if it became easier, not by much. Anyway, someone fell sooner or later.
Next stepthere was a distributed database. The master database is stored in the data center, and each store has its own version of the database (more precisely, a fragment related specifically to this store). In case of failure of the Internet channel, the store continues to work in its version of the database, and when a connection appears, an asynchronous update of the main database occurs. In the presence of a stable channel, updates are in real time. If the channel crashes, the data is "achieved" during recovery, just the store leaves the network for a while. By the way, the same availability of goods is considered already at the central instances by expectation - for example, with a sales speed of 5 game boxes per day and 8 boxes in the store’s warehouse without communication for lunch there will be not “In stock”, but “end, call”. The goal is not to screw up so that a person arrives, but there is no game.
Iron
Iron has been changing all this time. The first server was in a rather cheap data center, but we quickly left it, because it was impossible to work. Then we went along the path “everything home” and tried to deploy the server in the office, from where, in fact, the replication of the store bases was made. The problem is the same - when the Internet falls in the office, all stores are left without IT. Despite two channels, this happened at least once, when they decided to dig a hole near the office to change pipes.
Accordingly, the next step was the transition to a fault-tolerant data center. Chose quite a long time. Now we have 10 physical servers on which virtual machines are deployed. In case of failures, services can migrate between machines, there is a delicate load balancing. In season, the required capacity rises sharply, we buy cars, then we rent a rental in a month - it’s very convenient in terms of economy. There is no storage for the database, it spins on one of the servers where we finished off the shelf with SSD disks. When they began to rest on long miscalculations based on the results of the period, they tried a cluster with midrange storage, but it turned out to be quite expensive for our capacities so far, they refused.
As a result, there appeared independence from the office infrastructure in terms of critical processes: if something happens, we will lose the work of the designer until the last backup, a couple of labels in XLS - and that’s all. All commercial processes will be restored almost as soon as new iron appears. By the way, about hardware - for the New Year we duplicate all nodes not only in the IT structure from the server side, but also physically. If the terminal fails in the store, the exact same pre-configured system unit already lies in the warehouse right there, which you can simply connect instead of the failed one and not understand what is wrong there - Windows, the capacitor on the motherboard or the flown settings.
Outsourcing
We have our own IT department and an external outsourcing company, which complements it. The most important thing in the scheme of work - it was possible to realize a situation where there is no transfer of the ball between us and external employees.
To begin with, we clearly divided the services. For example, if there is a physical server, then on it:
- Virtualization system (administration and monitoring on outsourcing)
- Win-machines (administration and monitoring on outsourcing)
- One of the machines has a DBMS (administration and monitoring on our IT specialists)
- Base 1C (administration and monitoring inside us)
- Backing up databases (outsourcing).
Etc. Then we very clearly prescribed each price for each node and SLA to it. For example, if a computer crashes in a store:
- At the peak of the season - there must be a pre-configured system unit that you just need to reconnect.
- The rest of the time - SLA 4 hours to restore the operation of this node in the database.
Services are divided into levels:
- Workplace level services (for example, support for a storekeeper AWP).
- Site level (network printing service in the company office).
- There are company level services (terminal server in the DC).
Each service has its own indicators: reaction time, availability, cost, “fine”. There are a lot of services, and indicators, respectively, too. The system has been refined over the years, and a simple enumeration of services takes about 30 pages. The guys claim that this approach allowed them to get rid of the chaos in their heads and offer clearly what they need to other companies as well.
An example of how an active SLA works (note that when it comes to legal entities, it became possible to set deductions for the jambs):
- The SLA for super important service meets the following indicators: operating mode 12/7, reaction time 6 minutes, standby time 1 hour.
- 1 hour of downtime for an extremely important service - minus 100% of the monthly cost of the service.
- 2 hours of downtime for an extremely important service - minus 200% of the monthly cost of the service.
- Further, the deduction from the bonus continues to grow linearly, but not so fast - 8 hours - 300%, and so on up to 400%. Moreover, the cost of the service is the payment for it to outsourcers, and not the cost of the service for the company. For example, if the maintenance of one computer in a store costs 700 rubles, its failure per shift is 2100 rubles.
Despite the fact that the store does not receive any profit during this time (the second cash desk does not work), it has nothing to do with IT outsourcing and its payment - guys simply cannot answer for such things. We went to such logic for a rather long time, and we had to stand in our place and in their place in order to understand this.
Vova (our former admin who founded bitmanager.ru) says that thanks to such work, his quality indicators for all customers have grown. More precisely - he well understood what retail needed. No, at first, of course, he banged his head against the wall, of course, but then he learned to control the processes from the inside using a system of balanced indicators. Weekly the total of all tickets is summarized, monthly SLA indicators are calculated for all tickets, and if necessary, a penalty or bonus is calculated.
Then it became even more interesting. The fact is that depending on the seasonality of the business, the load on our internal IT department changes. And it turned out that some parts of the infrastructure can be outsourced, for example, on New Year's Eve, when IT forces are needed in other areas, and then taken back “under your wing” at the end of the peak. Because it saves costs. As a result, at the beginning of each month, outsourcers recount all nodes and services that are under their responsibility and invoice. Accordingly, this approach of evaluating each service and node also means that no one in the departments has “free” hardware or licenses - everything that requires support, money or is an unused asset is returned “to the base”, from where it can be much more efficient use.
I must say that they were helped by the fact that at the initial stage, when we were just building our relations, the external support thought that it was enough to have the service of an exit administrator. But SLA violations caused fines or bonuses, so they adapted their processes to retail support. Now they work like a clock - each ticket is placed in the support panel, prioritized and the necessary resources are allocated for it. The last exit from the SLA was over a year ago.
PS Well, tell us further our adventures , or what for him, this IT infrastructure of medium-sized business, which is already familiar to everyone?