How to take control of network infrastructure. Chapter two Cleaning and documenting

This article is the second in a series of articles on "How to take control of the network infrastructure." The contents of all articles in the series and links can be found here .

Our goal at this stage is to establish order in the documentation and configuration.
At the output of this process, you must have the necessary set of documents and a network configured in accordance with them.

Now we will not talk about security audit - the third part will be devoted to this.

The complexity of the task at this stage, of course, varies greatly from company to company.

The ideal situation is when

Your network was created in accordance with the project and you have a complete set of documents.
A change management and change management process has been implemented for your network.
In accordance with this process, you have documents (including all the necessary diagrams) providing complete information on the current state of affairs.

In this case, your task is quite simple. You must review the documents and review all the changes that have been made.

In the worst case, you will have

a network created without a project, without a plan, without coordination, by engineers who do not have sufficient qualifications,
with chaotic, undocumented changes, with lots of garbage and suboptimal solutions

It is clear that your situation is somewhere between, but, unfortunately, on this scale is better - worse, with a high probability, you will be closer to a worse end.

In this case, you will be required to include the ability to read minds, because you have to learn to understand what the “designers” wanted to do, restore their logic, finish what was not finished and remove the “garbage”.
And, of course, you will need to stop their mistakes, change (at this stage as little as possible) the design and change or re-create the scheme.

This article in no way claims to be complete. Here I will describe only the general principles and focus on some common problems that have to be addressed.

Set of documents

Let's start with an example.

Below are some of the documents that are usually created at Cisco Systems when designing.

CR - Сustomer Requirements, customer requirements (those. Task).
It is created together with the customer and determines the network requirements.

HLD - High Level Design, high-level design based on network requirements (CR). The document explains and justifies the adopted architectural decisions (topology, protocols, equipment selection, ...). HLD does not contain design details, for example, about the interfaces used and IP addresses. Also, the specific hardware configuration is not discussed here. This document is rather intended to explain the key design concepts to the technical management of the customer.

Lld- Low Level Design, low-level design based on high-level (HLD).
It should contain all the details necessary for the implementation of the project, such as information on how to connect and configure the equipment. This is a complete guide to the implementation of the design. This document should provide sufficient information for its implementation even by not very qualified personnel.

Something, for example, IP addresses, AS numbers, physical switching circuit (cabling), can be “rendered” into separate documents, such as the NIP (Network Implementation Plan).

The construction of the network begins after the creation of these documents and occurs in strict accordance with them and then is checked by the customer (tests) for compliance with the design.

Of course, different integrators, different customers, and different countries may have different requirements for project documentation. But I would like to avoid formalities and consider the issue on the merits. This stage is not about design, but about putting things in order, and we need a set of documents sufficient for our tasks (charts, tables, descriptions ...).

And in my opinion, there is a certain absolute minimum, without which it is impossible to effectively control the network.

These are the following documents:

physical switching circuit (log) (cabling)
network diagram or diagrams with essential L2 / L3 information

Physical switching circuit

In some small companies, work related to equipment installation and physical switching (cabling) is under the responsibility of network engineers.

In this case, the problem is partially solved by the following approach.

use a description on the interface to describe what is connected to it.
shutdown (shutdown) all unconnected network equipment ports

This will give you the opportunity, even in the event of a link problem (when cdp or lldp is not working on this interface), to quickly determine what is connected to this port.
Also, you can easily see which ports you have are occupied and which ones are free, which is necessary for planning connections of new network equipment, servers or workstations.

But it is clear that if you lose access to the equipment, you will lose access to this information. Moreover, in this way you cannot record such important information as what the equipment is, with what power consumption, with how many ports, in which rack it is, what kind of patch panel is there and where (in which rack / patch panel) are they connected . Therefore, additional documentation (not only on hardware descriptions) is very useful.

The ideal option is to use applications created to work with this kind of information. But you can limit yourself to simple tables (for example, in Excel) or display information that you consider necessary in L1 / L2 diagrams.

Important!

The network engineer, of course, can know quite well the subtleties and standards of SCS, types of racks, types of uninterruptible power supplies, what a cold and hot corridor is, make proper grounding, ... just like in principle he can know particle physics or C ++. But it is necessary to understand nevertheless that all this is not his area of knowledge.

Therefore, it is a good practice to solve problems associated with the installation, connection, maintenance of equipment, as well as physical switching to have either dedicated departments or dedicated people. Usually for data centers this is a data center engineers, and for an office it is a help-desk.

If such units are provided for in your company, then the issues of physical switching logging are not your task, and you can restrict yourself to a description on the interface and administrative shutdown of unused ports.

Network diagrams

There is no universal approach to drawing schemes.

The most important thing is that the schemes should give an understanding of how traffic will flow, through which logical and physical elements of your network.

By physical elements we mean

active equipment
interfaces / ports of active equipment

Under logical -

logical devices (N7K VDC, Palo Alto VSYS, ...)
VRF
wilan
subinterfaces
the tunnels
zones
...

Also, if your network is not completely elementary, it will consist of different segments.
for example

data center
the Internet
Wan
remote access
office LAN
DMZ
...

It would be reasonable to have several schemes that give both a general picture (as traffic travels between all these segments) and a detailed explanation of each individual segment.

Since in modern networks there can be many logical levels, it is possible that a good (but not obligatory) approach is to do different schemes for different levels, for example, in the case of an overlay approach, the following schemes could:

overlay
L1 / L2 underlay
L3 underlay

Of course, the most important scheme without which it is impossible to understand the idea of your design is the routing scheme.

Routing scheme

At least this diagram should be reflected

what routing protocols and where are used
basic information about routing protocol settings (area / AS number / router-id / ...)
what devices are redistributed
where route filtering and aggregation occurs
default route information

Also, the L2 scheme (OSI) is often useful.

L2 scheme (OSI)

This diagram may reflect the following information:

what vlan
which ports are trunk ports
what ports are aggregated in ether-channel (port channel), virtual port channel
what STP protocols and what devices are used
basic STP settings: root / root backup, STP cost, port priority
additional STP settings: BPDU guard / filter, root guard ...

Typical design errors

An example of a bad approach to building a network.

Let's take a simple example of building a simple office LAN.

Having experience in teaching telecoms to students, I can say that by the middle of the second semester virtually every student has the necessary knowledge (within the course that I taught) in order to set up a simple office LAN.

What is difficult to connect switches to each other, configure VLANs, SVI interfaces (in the case of L3 switches) and prescribe static routing?

Everything will work.

But at the same time, issues related to

security
by reservation
network scaling
performance
bandwidth
reliability
...

At times, I hear a statement that an office LAN is something very simple and I usually hear it from engineers (and managers) who do everything but not networks, and they say it so confidently that they don’t be surprised if LAN will be made by people with insufficient practice and knowledge and will be made approximately with the mistakes that I will describe below.

Typical design level L1 errors (OSI)

If, nevertheless, you are responsible, including for SCS, then one of the most unpleasant legacies that you can get is a careless and not thoughtful commutation.

I would also refer to type L1 errors related to the resources of the equipment used, for example,

insufficient bandwidth
insufficient TCAM on equipment (or inefficient use of it)
poor performance (often related to firewalls)

Typical L2 design level errors (OSI)

Often, when there is no good understanding of how STP works, what potential problems it carries with it, the switches are connected chaotically, with default settings, without additional tuning of STP.

As a result, we often have the following

large network diameter STP, which can lead to broadband storms
STP root will be determined randomly (based on mac addresses) and the traffic path will not be optimal.
ports connected to hosts will not be configured as edge (portfast), which will result in STP recalculation when end stations are turned on / off
the network will not be segmented at the L1 / L2 level, as a result of which problems with any switch (for example, power overload) will result in STP recalculation of the topology and stopping traffic in all VLANs on all switches (including in the case of service segment)

Examples of errors in L3 (OSI) design

A few characteristic mistakes novice networkers:

frequent use (or use only) of static routing
use of routing protocols that are not optimal for this design
nonoptimal logical network segmentation
non-optimal use of the address space, which does not allow aggregation of routes
no backup routes
no reservation for default gateway
asymmetric routing when rebuilding routes (may be critical in the case of NAT / PAT, statefull firewalls)
MTU problems
when rebuilding routes, traffic goes through other security zones or even other firewalls, which causes this traffic to drop
poor topology scalability

Criteria for assessing the quality of design

When we talk about optimality / non-optimality, we need to understand in terms of what criteria we can evaluate it. Here, from my point of view, the most essential (but not all) criteria (and decoding applied to routing protocols):

scalability
For example, you decide to add another data center. How easy you can do it.
ease of management (managability)
How easy and safe are operational changes, such as announcing a new grid or filtering routes
availability
What percentage of the time does your system provide the required level of service
security (security)
How secure is the transmitted data
price

Changes

The basic principle at this stage can be expressed by the formula "do no harm."
Therefore, even if you do not fully agree with the design, and the chosen implementation (configuration), it is not always advisable to make changes. A reasonable approach is to rank all the identified problems by two parameters:

how easy this problem can be fixed
how much risk does she bear

First of all, it is necessary to eliminate the fact that in real time it reduces the level of the service provided below the permissible, for example, problems leading to packet loss. Then eliminate what is easiest and safer to eliminate in order of decreasing severity of risk (from design or configuration problems that carry greater risks towards smaller ones).

Perfectionism at this stage can be harmful. Bring the design to a satisfactory state and synchronize the network configuration in accordance with it.

Tags: