How to take control of network infrastructure. Chapter one. Hold

This article is the first in a series of articles "How to take control of the network infrastructure." The contents of all articles in the series and links can be found here .

I fully admit that there are a sufficient number of companies where a simple network in one hour or even one day is not critical. I, unfortunately or fortunately, did not have a chance to work in such places. But, of course, the networks are different, the requirements are different, the approaches are different, and yet, in one form or another, the following list in many cases will actually be a “good-do”.

So, the initial conditions.

Are you at a new job or are you promoted or have decided to take a fresh look at your responsibilities? Company network is your area of responsibility. For you, in many respects, this is new and new, which somewhat justifies the mentoring tone of this article :). But I hope that the article can also be useful to any network engineer.

Your first strategic goal is to learn how to resist entropy and keep the level of service provided.

Many of the tasks described below can be solved by various means. I deliberately do not raise the topic of technical implementation, since In principle, it is often not so important how you solved this or that problem, but how you use it and whether you use it at all. It is of little use, for example, from your professionally built monitoring system, if you do not look there and do not react to alerts.

Equipment

First you need to understand where the greatest risks are.

Again, it can be different. I admit that somewhere, for example, it will be security issues, and somewhere there will be issues related to the continuity of service, and somewhere, maybe something else. Why not?

Suppose for definiteness that this is still a continuity of service (this was the case in all the companies where I worked).

Then you need to start with the equipment. Here is a list of topics that need attention:

equipment classification by severity
reservation of critical equipment
support, licenses

You should consider possible breakdowns, especially with equipment located at the top of your criticality classification. Usually, the likelihood of double problems is neglected, otherwise your solution and support may become unnecessarily expensive, but in the case of truly critical network elements, the failure of which can significantly affect the business, you should think about it.

Example

Suppose we are talking about a root switch in a data center.

Since we agreed that the continuity of service is the most important criterion, it is reasonable to provide “hot” redundancy of this equipment. But that is not all. You also have to decide how much time, in the event of the first switch fails, it is acceptable for you to live with only one remaining switch, because there is a risk that it will break.

Important! You do not have to resolve this issue yourself. You must describe the risks, possible solutions and the cost to your management or company management. They must make decisions.

So, if it was decided that, under the condition of a small probability of double breakdown, working for 4 hours on one switch is, in principle, acceptable, then you can simply take the appropriate support (according to which the equipment will be replaced within 4 hours).

But there is a risk that they will not deliver. Unfortunately, once we were in this situation. Instead of four hours the equipment rode a week !!!

Therefore, this risk also needs to be discussed and, perhaps, it will be better for you to buy another switch (the third one) and keep it in the spare parts kit (cold backup) or use it for laboratory purposes.

Important! Make a table of all the supports that you have with the end dates and add them to the calendar so that you receive a letter at least a month before that you should start worrying about extending support.

You will not be forgiven if you forget to extend support and the day after it ends, your equipment will break.

Emergency works

To prevent it from happening on your network, ideally, you should keep access to your network equipment.

Important! You should have console access to all equipment and this access should not depend on the operability of the user data transmission network (data).

You should also foresee possible negative scenarios and document the necessary actions. The availability of this document is also critical, so it must not only be laid out on a shared resource for the department, but also saved locally on the computers of engineers.

Mandatory there should be

information required to open an application in support of a vendor or integrator
information, how to get to any equipment (console, management)

Also, of course, any other useful information can be contained, for example, a description of the procedure for upgrading various equipment and useful diagnostic commands.

Partners

Now you need to evaluate the risks associated with partners. Usually this

Internet service providers and traffic exchange points (IX)
communication channel providers

What questions should I ask myself? As in the case of equipment, it is necessary to consider various emergency situations. For example, for internet providers, it could be something like:

What if Internet provider X stops providing service for some reason?
Do you have enough bandwidth for other providers?
how good will the connectivity be?
How independent are your ISPs and will not one of them cause serious problems with others?
How many optical inputs to your data center?
What happens if one of the inputs is completely destroyed?

Regarding inputs, in my practice in two different companies, in two different data centers, an excavator destroyed wells and only by a miracle our optics was not affected. This is not such a rare case.

And, of course, you need not just ask these questions, but, again, having enlisted the support of the leadership, to provide an acceptable solution in any situation.

Backup

The next in priority may be a backup of hardware configurations. In any case, this is a very important point. I will not list those cases where you may lose the configuration, it is better to make a regular backup and not think about it. In addition, a regular backup can be very useful in controlling changes.

Important! Make a backup daily. It’s not such a big amount of data to save on it. In the morning, the engineer on duty (or you) should receive a report from the system, which clearly indicates whether the backup was successful or not, and in case of an unsuccessful backup, the problem should be solved or a ticket should be created (see the processes of the network department).

Software Versions

The question of whether or not to upgrade the software is not so straightforward. On the one hand, the old versions are known bugs and vulnerabilities, but on the other hand, the new software is, firstly, the upgrade procedure is not always painless, and secondly, new bugs and vulnerabilities.

Here you need to find the best option. Some obvious recommendations.

set only stable versions
still not worth living on very old versions of software
make a sign with the information where some software is
periodically read reports on vulnerabilities and bugs in software versions, and in case of critical problems you should think about upgrading

At this stage, having console access to equipment, information about support and description of the upgrade procedure, you are, in principle, ready for this step. The ideal option is when you have laboratory equipment where you can check the whole procedure, but unfortunately this happens infrequently.

In the case of critical equipment, you can contact the support of the vendor with a request to assist you in carrying out the upgrade.

Ticket system

Now you can look around. You need to establish processes of interaction with other departments and within the department.

Perhaps this is not mandatory (for example, if your company is small), but I would highly recommend organizing the work in such a way that all external and internal tasks pass through the ticket system.

A ticket system is essentially your interface for internal and external communications, and you should describe this interface with a sufficient degree of detail.

Let us consider, for example, the important and frequently encountered task of opening access. I will describe an algorithm that worked perfectly in one of the companies.

Example

Let us begin with the fact that access customers often formulate their desires in an incomprehensible network engineer language, namely, in the language of the application, for example, “give me access to 1C”.

Therefore, we have never accepted requests directly from such users.
And this was the first requirement.

requests for access should come from technical departments (in our case these were unix, windows, helpdesk engineers)

The second requirement is that

this access should be recorded (by the technical department from which we received this request) and as a request we get a link to this recorded access

The form of this request should be clear to us, that is,

the request should contain information on which and in which subnet access should be open, as well as on the protocol and (in the case of tcp / udp) ports

Also there should be indicated

description for what this access opens
temporary or permanent (if temporary, how long)

And this is a very important point.

from the head of the department that initiated the access (for example, accounting)
from the head of the technical department, from where this request came to the network department (for example, helpdesk)

At the same time, the “owner” of this access is considered to be the head of the department that initiated the access (accounting in our example), and he is responsible for ensuring that the page with the recorded accesses for this department remains relevant.

Logging

This is something to drown in. But if you want to implement a proactive approach, then you need to learn how to cope with this data stream.

Here are some practical recommendations:

need to view logs daily
in the case of scheduled viewing (and not an emergency), you can limit yourself to severity levels 0, 1, 2 and add selected patterns from other levels if you see fit
write a script that logs and ignores those logs whose patterns you added to the ignore list

Over time, this approach will make it possible to compile the ignore list of logs that you are not interested in and leave only those that you really consider important.
It worked perfectly for us.

Monitoring

It is not uncommon when the company does not have a monitoring system. You can, for example, hope for logs, but the equipment may simply “die” without having had time to “say”, or the udp packet of the syslog protocol may be lost and not reach. In general, of course, active monitoring is important and necessary.

The two most popular examples in my practice are:

monitoring of loading of communication channels, critical links (for example, connection to providers). They allow you to proactively see the potential problem of degradation of the service due to the loss of traffic and, accordingly, avoid it.
graphics based on NetFlow. They make it easy to find anomalies in traffic and are very useful for detecting some simple but significant types of hacker attacks.

Important! Configure sms alert for the most critical events. This applies to both monitoring and logging. If you do not have an on-duty shift, then sms should also arrive during off-hours.

Think of the process in such a way as not to wake up all the engineers. We had an engineer on duty for this.

Change control

In my opinion, it is not necessary to control all changes. But, in any case, you should be able to easily find out who and why made these or other changes in the network if necessary.

Some tips:

use the ticket system for a detailed description of what was done within this ticket, for example, by copying the applied configuration into a ticket
use the features of the comments on the network equipment (for example, commit comment on Juniper). You can write down the ticket number
use diff of your configuration backups

You can enter it as a process, daily reviewing all tickets for changes.

Processes

You must formalize and describe the processes in your team. If you have reached this point, then at least the following processes should work in your team:

Daily processes:

work with tickets
work with logs
change control
daily check sheet

Annual processes:

extension of warranties, licenses

Asynchronous processes:

reaction to various emergencies

Conclusion of the first part

You noticed that all this is not about configuring the network, not about design, not about network protocols, not about routing, not about security ... It's something around. But this, although possible and boring, but, of course, very important elements of the work of the network unit.

So far, as you can see, you have not improved anything in your network. If there were security vulnerabilities, then they remained, if there was a bad design, then it remained. Until you have applied your skills and knowledge of the network engineer, which was spent most likely a lot of time, effort, and sometimes money. But first you need to create (or strengthen) the foundation, and then engage in construction.

How to find and fix errors, and then improve your infrastructure - this is the following parts.

Of course, it is not necessary to do everything consistently. Time can be critical. Do it in parallel if resources allow.

And an important addition. Communicate, ask, consult with your team. In the end, it is for them to support and do all this.

Tags: