How we transferred the disk space of hundreds of bank branches to one storage system in Moscow without losing local LAN speeds
Given: a bank with a data center in Moscow and many branches.
In the data center there is a bunch of x86 machines and a serious high-end data storage system (SHD). The branches have a network with a central server or mini-cluster (+ backup server and low-end storage), with a disk basket. Backup of the general data is done on a tape (and in the evening in a safe) or on another server next to the first. Critical data (financial transactions, for example) are asynchronously replicated to the center. The server runs Exchange, AD, antivirus, file server and so on. There is also data that is not critical for the banking network (these are not direct transactions), but are still very important - for example, documents. They are not replicated, but sometimes backup at night when the branch is not working. Half an hour after the end of the working day, all sessions are quenched, and a large copy begins.
That's about how it was arranged before the start of work.
The problem, of course, is that all this is slowly starting to increase technological debt. A good solution would be to make VDI access (this would eliminate the need to keep a huge service team and make administration much easier), but VDI requires wide channels and low latencies. And this is not always easy in Russia due to the lack of trunk optics in a number of cities. Every month, the number of unpleasant “pre-emergency” incidents is increasing, and iron restrictions constantly interfere.
And so the bank decided to do something that seems to be more expensive to implement, if you take it directly, but greatly simplifies the maintenance of the server infrastructure in the branches and guarantees the safety of the branch data:Consolidate all data into one central storage system . Only not simple, but also with a smart local cache.
Naturally, we turned to international experience - after all, such problems have already been solved several times for mining companies for sure. The first was Alamos Gold Corporation, which digs gold in Mexico and explores deposits in Turkey.
Data (and there are a lot of them, especially raw from geological exploration) must be transferred to the head office in Toronto. WAN channels are narrow, slow, and often weather dependent. As a result, they wrote on flash drives and discs and sent the data by physical mail or physical couriers. IT support was a natural “phone sex”, only adjusted for 2-3 language jumps through translators. Because of the need to store data locally, just increasing the WAN speed would not solve the problem. Alamos was able to avoid the cost of deploying physical servers at each mine by using Riverbed SteelFusion, a specialized Riverbed SFED appliance that combines WAN speed capabilities, Virtual Services Platform virtualization (VSP), and SteelFusion for Edge-VSI. VSP gave local computing resources. After upgrading the channels, after receiving the master snapshot of the volume, it was possible to transfer data back and forth normally. Return on investment - 8 months. A normal disaster recovery procedure has appeared.
We started picking this solution and found two more cases from the iron manufacturer that almost exactly described our situation.
Bill Barrett Corporation needed to upgrade equipment at remote sites; it first considered a traditional “half rack” solution, which was expensive but would not solve many current problems. In addition, most likely, it would be required to increase the channel bandwidth to these sites, which doubled the cost. High costs are not the only drawback of this approach. The IT skills of staff at remote sites were limited, and the proposed solution required someone to manage servers, switches, and backup equipment. We also delivered Riverbed SteelFusion, as a result, the case turned out to be three times cheaper than the traditional solution, and the rack space was significantly less.
The law firm Paul Hastings LLP was expanding, opening offices in Asia, Europe and the USA, and the number of its data centers (four central data centers and many small data centers with 19 offices) was increasing. Although such an architecture ensured that remote offices were operational, each data center needed a manager and 1-2 analysts, as well as physical host servers and tape backup systems. This was expensive, and in some regions data protection was not as reliable as the company would like. They decided the same, only the second motive was safety.
Accordingly, we calculated this and a couple more options in traditional architectures and showed it to the customer. The customer thought for a long time, thought, asked a bunch of questions and chose this option with the condition of test deployment of one “branch” (before purchasing the iron of the main project) on the test bases.
Here is what we did
We do not have the ability to change channels, but we can put iron on the data center side and on the branch side. Storage in the data center is cut into many volumes, and each branch works with 2–4 of them (the connection is injective: several branches cannot work with the same volume). We throw out disk shelves and some servers on places, they are no longer needed as storage systems and replication controllers. In the data center, we install simple Riverbed Steelhead CX traffic optimizers + virtual SteelFusion Core devices, a couple of Riverbed SFED (SteelFusion Edge) are installed in the field.
Previously, servers worked with data located locally (on local disks or on low-end storage). Servers now work with central storage data through a local LUN projection provided by SteelFusion. At the same time, the servers “think” that they are working with local volumes in the local branch network.
The main piece of iron of the branch is called Riverbed SFED (SteelFusion Edge), it consists of three components. These are SteelHead (optimization + traffic compression), SteelFusion Edge (an element of the data centralization system) and VSP (virtualization with the built-in ESXi hypervisor).
At a practical level, software worked almost without delay, replication at night was forgotten like a nightmare. Another feature - in the center is really all the data of the branch. That is, if earlier on local disks there could be traditionally “file-washing” scans of documents, presentations, various non-critical documents and everything that is expensive for ordinary users, now it also lies in the center and is also easily restored in case of failures. But more about this and improving fault tolerance a little lower. And of course, servicing one storage system is much easier than a dozen. No more “phone sex” with a “programmer” from a distant small town.
They are mounted like this:
In the current infrastructure - nothing special. Data continues to be written to the SFED cache; when the channel is restored, it is synchronized. When users request “to the center,” they are given a local cache. It is worth adding that the problem of "schizophrenia" does not arise, because access to the LUN is only through the SFED of a particular branch, that is, no one writes from the data center to our volume.
With a long disconnection, a moment may come when the local SFED cache is full, and the file system will signal that there is not enough space for recording. The necessary data cannot be recorded, which the system will warn the user or server about.
We tested this scheme of the emergency branch in Moscow several times. The time calculation is as follows:
Windows boot time on a communication channel with delays of 100 milliseconds is less than 10 minutes. The trick is that you do not need to wait for the OS to load until all the data from C has been transferred to the local cache. The loading of the OS, of course, is accelerated by the intelligent block prefetching mentioned above, and the advanced Riverbed optimization.
Naturally, the load on the central storage system drops, because most of the work falls on the devices in the branches. Here is a picture from the manufacturer on tests of optimizing the operation of storage systems, because now there are much fewer calls to it:
Before that, I talked about Pin the LUN when the cache is equal to LUN. Since the cache on devices in branches is not upgraded (you need to buy a new piece of hardware or put a second one next to it, and by the time you need it - after 4-5 years, they will already be a new generation, most likely), you need to take into account the requirement of a sharp increase in the number data in the branch. The planned one is designed and laid out for many years to come, but the unplanned one will be solved by switching to the Working Set mode.
This is when the blockstore (local cache dictionary) is less than branch data. Typically, the percentage of cache compared to the total amount of data is 15–25%. There is no way to work completely autonomously, using the cache as a copy of the central LUN: at this moment, in the communication channel failure mode, recording is in progress, but it is put in the buffer. The channel will be restored - we will give the record in the center. When you request a block that is not in local storage, a normal connection error is generated. If there is a block, the data is given. I assume that in those 5 years, when the amount of data exceeds the capacity of the branch cache, admins will not buy more hardware, they just centralize the mail and put the file share in Working Set mode, critical data will be left in Pin the LUN mode.
One more thing. Retrofitting with the second SFED creates a failover cluster, which may also be important in the future.
We did such an unusual integration and virtualization of storage for the first time - the project differs from others in setting up local blockstores linked in the PAC with traffic optimizers and virtualization servers. I several times fell apart and collected clusters from devices to see possible problems in the process. A couple of times during a significant reconfiguration I caught a complete cache warming up at the branch, but I found a way around this (when not needed). Of the pitfalls - quite completely the blockstore is completely nullified on one specific device, it is better to learn how to do this on tests before working with combat data. Plus, on the same tests, they caught one exotic kernel crash on the central machine, described the situation in detail, sent it to the manufacturer, they sent a patch.
By recovery time - the wider the channel, the faster the data will be restored in the branch.
It is important that the kernel does not give the priority to give data to specific SFEDs, that is, the storage operations and channels are used evenly - it will not work “green” to quickly transfer data to the fallen branch using this PAK. Therefore, another recommendation is to leave a small supply of storage capacity for such cases. In our case, the storage capacity is enough for the eyes. However, you can allocate bandwidth for each branch using QoS configurations on the same SteelHead or other network devices. And on the other hand, limit SteelFusion synchronization traffic so that the traffic of centralized business applications does not suffer.
The second most important is resistance to cliffs. As I understand it, their security team also added a voice for the project, who really liked the idea of keeping all the data in the center. Admins are glad that they no longer fly to places, but, of course, some of the local "enikeys" suffered because the tape and servers no longer needed to be serviced.
By itself, this architecture on Riverbed hardware still allows you to do a lot of things, in particular, fumbling printers around cities, making unusual proxies and firewalls, using server capacities of other cities for big miscalculations, etc. But this is not for us in the project required (at least for now), so you can only rejoice and wonder how many more features you can pick.
SteelFusion Edge: A converged device that integrates servers, storage, network and virtualization to run on-premises branch applications. No other branch infrastructure is required.
SteelFusion Core: The storage delivery controller is located in the data center and interacts with storage. SteelhFusion Core projects centralized data to branch offices, eliminating backups in branch offices, and provides fast deployment of new branches and disaster recovery.
Actually, you might still be interested to know:
In the data center there is a bunch of x86 machines and a serious high-end data storage system (SHD). The branches have a network with a central server or mini-cluster (+ backup server and low-end storage), with a disk basket. Backup of the general data is done on a tape (and in the evening in a safe) or on another server next to the first. Critical data (financial transactions, for example) are asynchronously replicated to the center. The server runs Exchange, AD, antivirus, file server and so on. There is also data that is not critical for the banking network (these are not direct transactions), but are still very important - for example, documents. They are not replicated, but sometimes backup at night when the branch is not working. Half an hour after the end of the working day, all sessions are quenched, and a large copy begins.
That's about how it was arranged before the start of work.
The problem, of course, is that all this is slowly starting to increase technological debt. A good solution would be to make VDI access (this would eliminate the need to keep a huge service team and make administration much easier), but VDI requires wide channels and low latencies. And this is not always easy in Russia due to the lack of trunk optics in a number of cities. Every month, the number of unpleasant “pre-emergency” incidents is increasing, and iron restrictions constantly interfere.
And so the bank decided to do something that seems to be more expensive to implement, if you take it directly, but greatly simplifies the maintenance of the server infrastructure in the branches and guarantees the safety of the branch data:Consolidate all data into one central storage system . Only not simple, but also with a smart local cache.
Study
Naturally, we turned to international experience - after all, such problems have already been solved several times for mining companies for sure. The first was Alamos Gold Corporation, which digs gold in Mexico and explores deposits in Turkey.
Data (and there are a lot of them, especially raw from geological exploration) must be transferred to the head office in Toronto. WAN channels are narrow, slow, and often weather dependent. As a result, they wrote on flash drives and discs and sent the data by physical mail or physical couriers. IT support was a natural “phone sex”, only adjusted for 2-3 language jumps through translators. Because of the need to store data locally, just increasing the WAN speed would not solve the problem. Alamos was able to avoid the cost of deploying physical servers at each mine by using Riverbed SteelFusion, a specialized Riverbed SFED appliance that combines WAN speed capabilities, Virtual Services Platform virtualization (VSP), and SteelFusion for Edge-VSI. VSP gave local computing resources. After upgrading the channels, after receiving the master snapshot of the volume, it was possible to transfer data back and forth normally. Return on investment - 8 months. A normal disaster recovery procedure has appeared.
We started picking this solution and found two more cases from the iron manufacturer that almost exactly described our situation.
Bill Barrett Corporation needed to upgrade equipment at remote sites; it first considered a traditional “half rack” solution, which was expensive but would not solve many current problems. In addition, most likely, it would be required to increase the channel bandwidth to these sites, which doubled the cost. High costs are not the only drawback of this approach. The IT skills of staff at remote sites were limited, and the proposed solution required someone to manage servers, switches, and backup equipment. We also delivered Riverbed SteelFusion, as a result, the case turned out to be three times cheaper than the traditional solution, and the rack space was significantly less.
The law firm Paul Hastings LLP was expanding, opening offices in Asia, Europe and the USA, and the number of its data centers (four central data centers and many small data centers with 19 offices) was increasing. Although such an architecture ensured that remote offices were operational, each data center needed a manager and 1-2 analysts, as well as physical host servers and tape backup systems. This was expensive, and in some regions data protection was not as reliable as the company would like. They decided the same, only the second motive was safety.
Accordingly, we calculated this and a couple more options in traditional architectures and showed it to the customer. The customer thought for a long time, thought, asked a bunch of questions and chose this option with the condition of test deployment of one “branch” (before purchasing the iron of the main project) on the test bases.
Here is what we did
We do not have the ability to change channels, but we can put iron on the data center side and on the branch side. Storage in the data center is cut into many volumes, and each branch works with 2–4 of them (the connection is injective: several branches cannot work with the same volume). We throw out disk shelves and some servers on places, they are no longer needed as storage systems and replication controllers. In the data center, we install simple Riverbed Steelhead CX traffic optimizers + virtual SteelFusion Core devices, a couple of Riverbed SFED (SteelFusion Edge) are installed in the field.
Previously, servers worked with data located locally (on local disks or on low-end storage). Servers now work with central storage data through a local LUN projection provided by SteelFusion. At the same time, the servers “think” that they are working with local volumes in the local branch network.
The main piece of iron of the branch is called Riverbed SFED (SteelFusion Edge), it consists of three components. These are SteelHead (optimization + traffic compression), SteelFusion Edge (an element of the data centralization system) and VSP (virtualization with the built-in ESXi hypervisor).
What happened
- Addressing for a branch is a single LAN with a central storage system (more precisely, a pair of its volumes). Servers access the central storage system as an instance within the LAN.
- At the first request, data blocks begin to be broadcast from the center to the branch (slowly, but only once). Running a little ahead - we use the Pin the LUN mode when the cache (blockstore) is equal to the size of the LUN, that is, immediately remove the full data cast from the central storage on the first start.
- When any data changes on our side, they queue for synchronization and immediately become available “as if in the center”, but from the SteelFusion Edge cache located locally.
- When transferring data to either side, effective compression, deduplication and overrides of protocols are used to optimize them (bulky and “chatty” protocols are transmitted by devices to optimized for narrow channels with a long delay).
- In general, all data is always stored in a data storage system in a central data center.
- “For dessert”, cool block-level prefetching has appeared. By reading the contents of blocks and applying knowledge of file systems, SteelFusion is able to determine what the OS is actually doing (for example, loading, launching an application, opening a document). She can then determine which data blocks are needed for reading, and downloads them before the OS requests them. Which is much faster than doing everything in sequence.
At a practical level, software worked almost without delay, replication at night was forgotten like a nightmare. Another feature - in the center is really all the data of the branch. That is, if earlier on local disks there could be traditionally “file-washing” scans of documents, presentations, various non-critical documents and everything that is expensive for ordinary users, now it also lies in the center and is also easily restored in case of failures. But more about this and improving fault tolerance a little lower. And of course, servicing one storage system is much easier than a dozen. No more “phone sex” with a “programmer” from a distant small town.
They are mounted like this:
What happened:
- The so-called quasisynchronous replication (this is when for the branch they look like synchronous, and for the center they look like fast asynchronous).
- There were quick and convenient recovery points up to minutes, and not "at least a day ago."
- Earlier, during a fire in the branch, the local file sphere and the mail of the city employees were lost. Now all this is also synchronized and will not be lost in an accident.
- A new box simplified all the procedures of branch recovery - a new office is deployed on a new hardware in a matter of minutes (a new city from ready-made images is similarly deployed).
- I had to remove the local servers from the infrastructure, plus throw small tape drives for backup also in the field. Instead, Riverbed hardware was purchased, the central storage system was supplemented with disks, and a new large tape library for backup was installed, which was installed in another Moscow data center.
- Data security has improved thanks to unified and easily controlled access rules and stronger channel encryption.
- With the disaster of the branch, accompanied by the destruction of the infrastructure, there was a simple opportunity to run virtual servers in the data center itself. As a result, RTO and RPO are rapidly decreasing.
What happens with a short disconnection?
In the current infrastructure - nothing special. Data continues to be written to the SFED cache; when the channel is restored, it is synchronized. When users request “to the center,” they are given a local cache. It is worth adding that the problem of "schizophrenia" does not arise, because access to the LUN is only through the SFED of a particular branch, that is, no one writes from the data center to our volume.
What happens when a connection breaks for more than 4 hours?
With a long disconnection, a moment may come when the local SFED cache is full, and the file system will signal that there is not enough space for recording. The necessary data cannot be recorded, which the system will warn the user or server about.
We tested this scheme of the emergency branch in Moscow several times. The time calculation is as follows:
- Delivery time of spare parts.
- Initial SFED configuration time (<30 minutes).
- Virtual server OS boot time through the communication channel with the cache empty.
Windows boot time on a communication channel with delays of 100 milliseconds is less than 10 minutes. The trick is that you do not need to wait for the OS to load until all the data from C has been transferred to the local cache. The loading of the OS, of course, is accelerated by the intelligent block prefetching mentioned above, and the advanced Riverbed optimization.
Storage Load
Naturally, the load on the central storage system drops, because most of the work falls on the devices in the branches. Here is a picture from the manufacturer on tests of optimizing the operation of storage systems, because now there are much fewer calls to it:
Explanation of Quasi-Synchronous Replication Modes
Before that, I talked about Pin the LUN when the cache is equal to LUN. Since the cache on devices in branches is not upgraded (you need to buy a new piece of hardware or put a second one next to it, and by the time you need it - after 4-5 years, they will already be a new generation, most likely), you need to take into account the requirement of a sharp increase in the number data in the branch. The planned one is designed and laid out for many years to come, but the unplanned one will be solved by switching to the Working Set mode.
This is when the blockstore (local cache dictionary) is less than branch data. Typically, the percentage of cache compared to the total amount of data is 15–25%. There is no way to work completely autonomously, using the cache as a copy of the central LUN: at this moment, in the communication channel failure mode, recording is in progress, but it is put in the buffer. The channel will be restored - we will give the record in the center. When you request a block that is not in local storage, a normal connection error is generated. If there is a block, the data is given. I assume that in those 5 years, when the amount of data exceeds the capacity of the branch cache, admins will not buy more hardware, they just centralize the mail and put the file share in Working Set mode, critical data will be left in Pin the LUN mode.
One more thing. Retrofitting with the second SFED creates a failover cluster, which may also be important in the future.
Tests and trial operation
We did such an unusual integration and virtualization of storage for the first time - the project differs from others in setting up local blockstores linked in the PAC with traffic optimizers and virtualization servers. I several times fell apart and collected clusters from devices to see possible problems in the process. A couple of times during a significant reconfiguration I caught a complete cache warming up at the branch, but I found a way around this (when not needed). Of the pitfalls - quite completely the blockstore is completely nullified on one specific device, it is better to learn how to do this on tests before working with combat data. Plus, on the same tests, they caught one exotic kernel crash on the central machine, described the situation in detail, sent it to the manufacturer, they sent a patch.
By recovery time - the wider the channel, the faster the data will be restored in the branch.
It is important that the kernel does not give the priority to give data to specific SFEDs, that is, the storage operations and channels are used evenly - it will not work “green” to quickly transfer data to the fallen branch using this PAK. Therefore, another recommendation is to leave a small supply of storage capacity for such cases. In our case, the storage capacity is enough for the eyes. However, you can allocate bandwidth for each branch using QoS configurations on the same SteelHead or other network devices. And on the other hand, limit SteelFusion synchronization traffic so that the traffic of centralized business applications does not suffer.
The second most important is resistance to cliffs. As I understand it, their security team also added a voice for the project, who really liked the idea of keeping all the data in the center. Admins are glad that they no longer fly to places, but, of course, some of the local "enikeys" suffered because the tape and servers no longer needed to be serviced.
By itself, this architecture on Riverbed hardware still allows you to do a lot of things, in particular, fumbling printers around cities, making unusual proxies and firewalls, using server capacities of other cities for big miscalculations, etc. But this is not for us in the project required (at least for now), so you can only rejoice and wonder how many more features you can pick.
Iron
SteelFusion Edge: A converged device that integrates servers, storage, network and virtualization to run on-premises branch applications. No other branch infrastructure is required.
SteelFusion Core: The storage delivery controller is located in the data center and interacts with storage. SteelhFusion Core projects centralized data to branch offices, eliminating backups in branch offices, and provides fast deployment of new branches and disaster recovery.
References
Actually, you might still be interested to know:
- About the basic use of traffic optimizers with a local blockstore .
- A practical solution for optimizing traffic where it was not necessary to do LAN speeds in branches (satellite channels).
- About a network detective with an anomaly search (it was an educational detective, no one was hurt).
- Here here you can download a general description of the use of solutions and options from the vendor (but there will have to fill out a form with the post).
- And my mail, if your question is not for comment, or if you need to first estimate the cost of a similar solution for yourself: AVrublevsky@croc.ru
- Even the day after tomorrow, November 12, we are conducting a webinar on how to reduce costs for this part of the IT infrastructure . Everything is detailed there, with practical calculations and with an overview of different options for action.