One day in the life of a seasoned admin or a story about how to tame storage

Today we will talk about the heroic everyday life of admins and storage systems. In the framework of this article, we will tell two real stories of the implementation of storage systems and try to share our experience in the implementation and operation of storage solutions. The names of the participants are of course fictitious.

History 1. How to temper the admin


The harsh weekdays of the administrator Petit began, and in the evening the next batch of equipment arrived along with the storage, but the groans of users are already heard when new storage resources will be issued to them. And now the system administrator, regardless of the weather and the completed work day, is already running to his data center (or server, anyone like that). After all, his main goal is there - the storage system, about which he already read a lot on the manufacturer’s website, practically studied through booklets how it works. After all, it was he who defended the purchase of this system from his IT director and brought a thousand pros and cons, and now the moment has come, happiness is very close.

Long-awaited meeting


Entering the server room, he discovers the long-awaited box with storage. Moments, and now the system is already shining and shining, the logo shined merrily shines, illuminated by an LED strip of general lighting. The administrator knows the entire TIA / EIA-569 standard and, of course, has already stipulated that the floor in the server room will withstand this "beast", because the storage system weighs as much and as little as a baby forest elephant, try lifting it. Petya dreams that he will see a new unallocated space of his storage in the Web console of the storage system, but the question arises: how to connect it to the existing system and how to upgrade?

On stage a new hero


Petya is calm; Kolya, a service engineer from the manufacturer of storage systems, comes to his rescue. The one who took important courses on servicing storage systems (or failed, everything happens for the first time sometime). It carries secret documentation, which says and sometimes shows how to turn on the mass of jumpers and connect the connecting wires in order to perform this upgrade. Having successfully connected his storage system to electricity, having previously escaped to the nearest market and changing the outlet from a two-phase switching circuit to a three-phase circuit, as required by the documentation, Petya and Kolya with a sinking heart turn on the new system. And suddenly they notice that the main system goes into Recovery mode, which means a serious system failure. Yuri Petrovich, the head of Petit, is calling, and having heard that everything is working as planned, it goes into another dimension,

Chipmunks rush to the rescue


Kolya tells Peter that there are no hopeless situations, because there is round-the-clock support, and he will call there at night and ask what to do, as the documentation clearly says in big red letters - “CONTACT SUPPORT, FURTHER ACTIONS MAY CAUSE SERIOUS DAMAGE TO YOUR HARDWARE. " Calling support only at the security checkpoint, since there is no reception zone anywhere, Kolya hears a pleasant voice in English on his phone, which promises to immediately help if Kolya sends several megabytes of debugging information by e-mail. Having previously written down the reverse mailing address of his colleague from sunny India, Nikolai waits for the letter to leave the Outbox folder, and puts all his energy into it so that it still goes to the addressee. Petya does not waste time and checks how his systems work, suddenly “drive off” discs or something else. Having received the answer, Kolya begins to read the history of his actions and discovers that the upgrade procedure in the documentation politely recommended connecting the wires to other connectors in the storage system, as well as the message: “WARNING! CABLE PLUGGED INCORRECTLY MAY CAUSE SERIOUS DAMAGE TO YOUR HARDWARE. "

"Eureka!" - exclaims Kolya.

“But how so? - retorts Petya in response and continues. “My storage system is like a Boeing aircraft, because you can provide a sticker and place it next to the desired connector so as not to confuse it when you turn it on.”

The situation is soon changing, Kolya completed his work, and the storage system goes into normal mode. All systems work as expected, and the UPS did not even overload, since Petya calculated everything in advance when he planned the power consumption of the new equipment.
And now the long-awaited moment has arrived. At 8:00 at a scheduled meeting, Petya reports that the storage system is ready for use.

Let's work on the bugs?


The story that we told at the very beginning of the article is taken from the real life situation of one large company (names are made up), which acquired storage systems. In fact, there are many such stories, and we have something to tell at least in order to show how our compatriots can solve complex problems. Indeed, often a foreigner will never come up with an idea how to solve a complex problem without clear and step-by-step instructions for remote technical support.

Let's try to put together all the problems, here are some of them:

  1. The human factor - it turned out that there is only one experienced administrator in the company staff.
  2. Unavailability of engineering infrastructure - the power supply system is not ready, since there was no necessary three-phase connector for connecting storage systems.
  3. Lack of qualifications - a service engineer from a storage vendor does not have sufficient qualifications.
  4. Ineffective technical support - the remote service support chain proved to be very difficult during operation.
  5. “Last mile” - the process of assembling storage systems is complex and also poorly documented, which caused a fatal error of the engineer, which, according to a message from the documentation, could lead to a shutdown of the storage system.

Do we still not have competent service engineers?


Many will want to answer that we don’t have such a problem, since any self-respecting buyer of storage systems will always train their staff and pay for courses from the vendor. Objectively, not all problems can simply be solved by sending your administrator to the courses from the storage vendor. This happens for a number of reasons:

  1. Training courses from the manufacturer do not teach the commissioning of storage systems.
  2. The goal of the training courses is to instill the “user” skill, which should not break the system, but should fulfill the simple duties of maintaining the storage system in working condition.
  3. Courses motivate administrators to develop in the direction of this vendor of storage and form a community of fans of this vendor, but do not form an objective point of view on storage, i.e. do not form basic knowledge about the principles of operation of storage systems and its internal structure.

In general, almost all training courses for storage systems of vendors do not set themselves the task of preparing an autonomous flying Carlson with a jet engine and adjustable wrench, capable of helping at any time and changing a broken engine of a flying airplane.

And what does the storage vendor want?


The vendor has a very specific goal - to form a clear idea of ​​his product and offer his wording of numerous terms, such as RAID terminology and many other features, and most importantly - to gently form an opinion about his indispensability.

Indeed, in practice, any Russian system integrator is completely dependent on the company of the storage vendor, and this must be said honestly. Even the presence of service centers in our country does not change the situation. The reason is simple - the storage technology of foreign vendors is being developed not in Russia, but abroad. Therefore, real expertise from representatives of global vendors is absent locally in Russia. We simply do not have such engineers who are able to develop software for storage, to diagnose and repair complex elements of storage (controllers, disks).

The situation is changing with the advent of small commercial organizations that begin to produce their own versions of storage systems, and our example is not the only one in today's practice. We want to tell the community that we can design our storage systems taking into account the current market conditions in Russia, and we can do it.

Lack of qualifications and what do we end up with in practice?


One administrator studied with vendor A and another administrator with vendor B. And they decided to discuss how RAID60 works and how many disks should be in it, and they couldn’t agree. And when it came to disk configuration, everyone defended their system and vendor.

image

Information on the storage device for various vendors is designed in such a way that the consumer can understand the functional purpose of a particular part of the storage system, but cannot understand the principles of operation of this complex system.

Consider one simple practical example.


Using a “long-range” optical SFP transceiver requires the administrator to know what operating wavelengths exist and, accordingly, what types of fiber optic cable this transceiver supports. A simple mistake in choosing an optical cable will make you spend a lot of time searching for the causes of a performance problem in the storage system, contacting the vendor's technical support when the real reason is on the surface.

Thus, in addition to the usual skills of tuning storage systems, basic engineering knowledge is required in the field of data transfer standards, data transfer protocols, which, unfortunately, are not taught in full at the courses of storage vendors. The reason for this phenomenon is commonplace - vendors cannot disclose the features of the device components of their storage systems, as part of the characteristics they claimed in the marketing articles may not be confirmed.
In our opinion, many of the problems that are encountered in servicing storage systems can and should be detected and diagnosed automatically.

How can you protect yourself from such problems?


The solution to such problems in our opinion can only be in new knowledge and experience.

That is why we developed the SDK, which allows low-level control of the operation of I / O operations at the level of SCSI commands. Using Broadcom's Fiber Channel adapters, we get all the necessary information about the status of the connection at the data link layer. Almost all SCSI SPC-3 standard commands are implemented within our SDK. Using our SDK, you can emulate SCSI devices (disk, VTL) and analyze problem areas in the SAN.

Is there a “Russian” engineer inquisitive mind?


If we consider the organization of RAID in the storage system of various vendors, then the reason for the disputes, in our opinion, is simply the reluctance to understand the essence of the issue. Looking at the article “A Case for Redundant Arrays of Inexpensive Disks (RAID)” by engineers David Patterson, Garth A. Gibson, and Randy Katz, who described the principles of RAID and its options, you can get all the information about the RAID device, but do not take axiom of private engineering solutions for storage vendors. Of course, there are a lot of such private issues and differences among storage vendors. Sometimes basic knowledge about the principles of functioning of a particular system helps to delve deeper into the essence of the problem and understand a difficult situation.

What's in my name for you, you estimate the storage volume


Vendor companies formulate their principles of RAID operation, based on their commercial benefits, to take at least the calculation of storage capacity, when one megabyte equals one thousand kilobytes of stored information. It is not difficult to calculate the losses from such imaginary calculations, when the client pays for real gigabytes, but they are systematically undermined.

Our principle is to evaluate different storage systems on the same scale, i.e. based on the result that they allow you to get. Of course, the set of features that consumers evaluate is different for everyone, and it includes: total cost of ownership, cost of upgrade, cost of technical support, etc.

When evaluating storage systems, we recommend using the following approaches:

  1. Use the same tool for load testing of storage systems (proprietary, fio, etc.)
  2. Fix the microcode version that is installed on the storage system at the time of testing.
  3. Be sure to check the disk microcode versions and record them in the report.
  4. Use tests with real systems and not limited to synthetic tests.

The question arises, which of the following approaches is manageable by clients? In practice, the client cannot manage any of the listed approaches when choosing storage systems. The situation is aggravated by the fact that in fact Russian companies are held hostage to the pricing policy of foreign vendors.

Storage vendor myths


As a rule, consumers always focus on external marketing characteristics of storage systems. But what if we look inside the product itself? It turns out that there is nothing unusual there, recalling the practice of a three-letter vendor 10 years ago, many storage systems used the usual Pentium III processor. Analyzing the development of foreign vendors, the hardware storage platform is always the cheapest and easiest. There is a widespread myth that a reliable storage solution requires a very sophisticated hardware platform, and it is it that provides high reliability. A number of vendors do design complex digital elements for their storage systems, but for other reasons, which are often explained by simple economies. According to the most rough estimates, the total cost of “iron” in the storage system does not exceed 10-15% of its actual value for the end user. The client does not pay for the hardware platform, but for the software that makes this hardware work.

Now, many Russian developers are "indulging" in PCI Express-based systems, which have been actively used by foreign companies in the storage market for over 10 years now. The fact is that progress in the field of storage is not determined by the development of complex digital elements, but consists in the creation of simple and multifunctional circuits, where most of the logic will be on the software side.

The art of this or that vendor in designing storage systems consists precisely in creating a universal storage platform (software) that can move to any hardware platform with minimal costs.

Now there is a new trend in the use of virtualization in foreign storage vendors.

The virtualization system of storage systems is usually positioned as a universal system of access to any other storage systems, which hides the features of the existing zoo of storage systems at the client and at the same time improves the performance of existing storage systems. Of course, it is also worth noting the "compatibility matrix", which is released by storage vendors.

Both the first idea of ​​storage virtualization and the second idea of ​​the compatibility matrix are completely false. Imagine how a storage vendor has in its hangar all the systems available on the storage market and a whole staff of specially trained people who check each driver and each version of the operating system for compatibility, accurately recording the results in a matrix. Given the struggle for financial results and fierce competition in the market, many vendors are simply not able to support such systems and maintain compatibility matrices. As a result, each client, at his own risk and peril, applies the next software updates.

Consider a case from the practice of one client who has acquired a storage virtualization system.

History 2. About how the administrator of the storage system has tamed


The administrator of the storage system Petya proceeds to configure his new virtual storage system, which was recently installed. Deep expertise and self-confidence allow Peter to carefully configure access to data for his servers through the new storage virtualization system. Checking the settings on each server, Petya makes sure that all the “paths” to the disks lead to the new storage virtualization system. Now all services are under control, including the most important ones - e-mail and electronic payment processing systems.

The time of intensive load on the part of users comes and error messages appear when accessing data in the operating system log. Long and painful negotiations with technical support show that everything is set up correctly, and in order to finally solve the performance problem, you need to upgrade the storage system and buy an additional volume of solid-state SSDs. This prospect does not please either Petya or his boss, Yuri Petrovich, who long and painfully defended the budget for the purchase of storage. "What to do? - thinks Yuri Petrovich. “But I could have made a decision from another vendor.” It might have been more expensive, but it could have been more reliable, and now these problems would not have happened. ”

As a result, such a story ends with a gradual migration to the old solution and the abandonment of the new virtual storage system. Of course, let's not forget that depreciation charges continue to go towards the new storage system, and it hangs on the organization’s budget.

Why can you compete with foreign storage vendors?


In our opinion, the answer is very simple - foreign vendors use a vendor lock-in strategy, and therefore any architecture of a foreign storage system will have a serious flaw, which means there is a niche for replacing such storage system. We are aware of all the changes from foreign vendors and understand device solutions from more than 10 global manufacturers, such as: Hitachi, DELL (EMC), HPE, Netapp, etc.

How can you avoid problems during the operation of storage and get the necessary experience in programming storage?


We are launching a school for storage developers. First of all, we invite students of Russian universities on a free basis.

School participants will be able to really learn how to work with the SCSI and NVMe protocol, learn new data protection algorithms and in practice try this knowledge in working with virtualization systems based on VMWare. In the course of laboratory work, we will talk about the methods and principles of load testing, as well as about the main criteria for evaluating the performance of storage systems. We will also pay attention to such a problem as data migration and talk about free and effective ways of data migration for storage. Anyone can sign up for our school here .

In the following articles we will talk about machine learning in storage systems, as well as share information about our new storage models.

Also popular now: