znoom October 13, 2016 at 15:37

Not a single gap: how we created a wireless network for 3000 devices

Wireless Society by JOSS7

Over the past ten years, Wi-Fi in the offices of Mail.Ru Group has gone through several equipment changes, approaches to building a network, authorization schemes, administrators and those responsible for its work. The wireless network began, probably, as in all companies - with several home routers that broadcast some kind of SSID with a static password. For a long time this was enough, but the number of users, the area and the number of access points began to grow, home D-Linkʼs were gradually replaced by Zyxel NWA-3160. This was already a relatively advanced solution: one of the points could act as a controller for the rest and provided a single interface for managing the entire network. The NWA-3160 software did not give any deeper logic and automation, only the ability to configure points connected to the controller, user traffic was processed by each device independently. The next equipment change was the switch to the Cisco AIR-WLC2006-K9 + controller with several Aironet 1030 access points. It is already a completely adult solution, with brainless access points and the processing of all traffic by the wireless network controller. After there was still migration to the AIR-WLC4402-K9 pair, the network has already grown to hundreds of Cisco Aironet 1242AG, 1130AG, 1140AG points.

1. Accumulated problems

The year 2011 came, a year later the company was expected to move to a new office, and Wi-Fi was already a sore topic and the most common reason for employees complaining about technical support: low connection speed (and buffering videos on youtube / vk / pornhub causes serious stress, and obviously interferes work), connection breaks. Periodic attempts to use Wi-Fi phones failed due to idle roaming. Notebooks with built-in Ethernet were getting smaller (thanks to the advent of the MacBook Air and the race of manufacturers for the thickness of the case), the vast majority of mobile phones already required a constant Internet connection.

The air was constantly busy, old access points could not withstand the load. Disconnects of users began when 25+ devices were connected to one access point, the standard 802.11n and the range of 5 GHz were not supported. In addition, for the needs of mobile development in the office there was a heap of SOHO routers connected to various emulators ( NetEm tools ).

From the point of view of the logical scheme, little has changed since the transition to centralized solutions in 2007-2008: several SSIDs, including a guest, several large subnets (/ 16), into which users authorized in a particular wireless network got into.

Network security was also bad: the main mechanism for authorizing users to trusted Wi-Fi networks was PSK, which has not changed for several years. About a thousand devices were constantly on the same subnet without any isolation, which contributed to the spread of malware. Nominal filtering of traffic was carried out using iptables on the * NIX-gateway, which served as NAT for the office. Naturally, there was no question of any granularity of firewalls.

2. New height

Moving the company turned out to be a great opportunity to think over and build an office network from scratch. Having fantasized on the topic of an ideal network and analyzing the main complaints, we managed to determine what we want to achieve:

maximum access point performance available on the market. It is advisable to upgrade to new 802.11 standards without replacing all equipment;
fault tolerance. Authorization servers, Wi-Fi controllers, switches to which access points were connected, firewalls and routers - reserve everything;
the ability to emulate various network conditions (packet loss, delay, speed) using corporate Wi-Fi. The presence in the office of many Wi-Fi-soap dishes without centralized control did not allow the use of ether in an optimal way;
Wi-Fi telephony. The mobility of work telephones is convenient for the work of some departments - technical support, administrative department, etc .;
ITSEC. Identification of connected users. The granularity of access lists: only the resources necessary for him to work, and not the entire network, should be accessible to the connected user. Isolation of user devices from each other;
work based on bonjour and mDNS services. We have many macOS and iOS users, and all kinds of apple services like airplay, airprint, time machine were not originally designed to work in large segmented networks;
full wireless coverage of all office premises, from toilets and a gym to elevator halls;
centralized system for location of users and interference sources in Wi-Fi operation.

There are several approaches to organizing a wireless network in terms of managing and processing user traffic equipment:

A scattering of autonomous access points. Cheap and cheerful - the administrator and installer arrange inexpensive home Wi-Fi routers in the room, and if possible configure them on different channels. You can even try to configure them to broadcast the same SSID and hope for some kind of roaming. Each device is independent, to make changes to the configuration, you need to flatter each point individually.
Partially centralized solutions. A single point of control for all access points is a Wi-Fi network controller. He is responsible for making changes to the configuration of each access point, eliminating the need for the administrator to manually bypass and reconfigure all available devices. He may be responsible for centralized user authorization when connected to the network. The rest of the access points are independent of the controller’s work, they independently process user traffic and release it to the wired network.
Centralized solutions. Points are no longer any independent devices, completely transferring both control and traffic processing to the network controller. All user traffic is always transmitted for processing to the controller, decisions about changing the channel, signal strength, broadcast wireless networks and user authorization are made exclusively by the controller. The task of access points is to serve wireless clients and tunnel frames in the direction of the wireless network controller.

We managed to try each of these approaches, and a centralized solution with a single controller was the most suitable for our new tasks. Together with the controller, we received a single point of application for access lists and untied roaming clients between access points from the address space on the wire.

3. Equipment selection

At that time (end of 2012) there were only a few vendors that inspire confidence in us and at the same time have a line of equipment that meets the basic requirements. In addition to the obvious Cisco, a live test came from Aruba. The points of the 93rd 105-, 125- and 135th series with the controller were tested. Everything took place in real conditions, with live users: we deployed a network at these points on several floors of the old office. In terms of performance, the points fully met the needs at that time. The software of the controller was also good: many chips for which, in the case of Cisco, it would be necessary to install additional servers (MSE / WCS / Prime) and purchase licenses, were implemented directly on the controller (geolocation, collection and display of advanced statistics on clients, rendering heatmaps and displaying users on a map in real time). Along with this, there were also disadvantages:

non-disconnectable (or rather, disconnectable only with the necessary functionality) stateful firewall with a very modest session limit. In fact, they managed to kill the Wi-Fi network from one laptop by running a successful network scan;
a spectrum analyzer at points was used only to generate alerts to the administrator. Cisco already nominally knew how to react to interference on its own (Event Driven RRM);
MFP was not implemented at all;
unlike Cisco, Aruba points could not be reflashed and used without a controller.

As a result, I had to return to solutions from Cisco: the 5508 controller, the top-end AP 3602i for the main office premises and the AP 1262 for connecting external antennas. The points of the 36th series at that time were interesting with the ability to upgrade to 802.11ac Wave 1 by connecting an additional antenna module. Unfortunately, these modules did not become compatible with the Russian-made points with the -R- index, so for full support for 802.11ac you have to change the access points to AP 3702 (and 3802 in the future).

There are a lot of step-by-step instructions for the initial setup of “ciscine” Wi-Fi on the network, as well as for planning (and since the eighth version of the software, most of the “best practices” for setting up are available directly from the web-ui controller).

I will focus only on non-obvious and problematic issues that I have encountered.

4. Fault tolerance

The Wi-Fi network controller processes all traffic and is a single point of failure. No controller - the network does not work at all. It was necessary to reserve it first. For some time, Cisco has been offering two different solutions for this:

"N + 1". We have several controllers with a completely independent control-plane, our own configuration, IP addresses and a set of installed licenses. Access points know the list of controller addresses and the priority of each of them (primary-secondary-tertiary ...), and in case of a sudden failure of the current controller, the point reboots and tries to connect to the next one in the list. The user remains without communication for a minute or two.
"AP SSO." We combine the primary and backup controllers with each other, they synchronize the configuration, status of connected users and use the same IP address to create a tunnel to access points. In case of failure of the main controller, the IP and MAC address to which the access points were hooked up is quickly and automatically transferred to the backup one (remotely similar to the operation of the FHRP protocols). Access points also should not notice a disconnect. In an ideal world, users will not feel that something is broken at all.

The “AP SSO” option looks much more interesting: failover is instantaneous and invisible to the user, there is no need for additional licenses, you do not need to manually maintain the relevance of the configuration of the second controller, etc. In real life, in the 7.3 software that was fresh at that time, everything turned out to be not so rosy:

both WLCs (Wi-Fi controllers) must be physically close to each other. A dedicated copper port is used to synchronize configuration and heartbeats. In our case, the controllers were in rooms on different floors, and the length of the copper cable was enough at the limit;
transparent failover of connected users (“Client SSO”) appeared only in version 7.6. Prior to this, users still disconnected from Wi-Fi, albeit briefly;
to put it mildly, a “strange” mechanism for determining and behaving a cluster in an accident. In short: both controllers ping each other once a second on a copper wire and check the availability of the default gateway (again, using ICMP ping).

With the last paragraph, and there were difficulties. The essence of the problem - in accordance with the table in case of any incomprehensible situation - standby controller goes into reboot. Suppose we have the following network diagram:

What happens when you turn off the C6509-1? The active controller loses the uplink and reboots immediately. The backup controller loses contact with the main one and tries to ping the gateway, which for three seconds (with default VRRP timers) will be unavailable until the address is moved to C6509-2. After two failed pings of the gateway, standby wlc will also go into reboot within two seconds. And twice. Congratulations, for the next 20-25 minutes we were left without Wi-Fi. A similar behavior was observed when using any first-hop reservation protocol (FHRP), as well as spontaneous reboots of controllers with too strict ICMP rate limit. The problem is solved either by tuning the FHRP timings so that the address has time to “move” before the standby wlc reboots. Or transferring FHRP master to the router to which standby wlc is connected,

In software 8.0+, the problem was solved by complicating the logic of checking the availability of the gateway and switching from ICMP pingalka to UDP-heartbeats of its own format. As a result, we settled on a bunch of HSRP and software 8.2, having achieved the same unnoticeable for the user faylover between controllers.

Also, for fault tolerance, several RADIUS servers (MS NPS) are used, access points within the same room are connected to different access switches, access switches have uplinks to two independent network core devices, etc.

5. Tuning

It’s not difficult to find general recommendations on tuning Wi-Fi performance (for example, Wi-Fi: unobvious nuances (for example, a home network) ), so I won’t focus much on this. Unless briefly about the specifics.

5.1. Data rates

Imagine that after the basic configuration of the controller and connecting the first ten access points on the test floor of an unfinished building to it, we connect to the spectrum analyzer and see that more than 40% of the air in 2.4 GHz is already taken. Around not a single living soul, we are in an empty building, there are no other people's networks and home Wi-Fi routers. Half of the airtime is occupied by the transmission of beacons - they are always transmitted at the minimum transmission speed supported by the points, with a high density this is especially noticeable. Adding new SSIDs to the air exacerbates the problem. With a minimum data rate of 1 Mbps, already 5 SSIDs at 10 points in the “defeat” zone lead to 100% utilization of the ether exclusively by beacons. Disabling all data-rates below 12 Mbps (802.11b) dramatically changes the picture.

5.2. Radius VLAN assignment

Large L2 domains are fun. Especially on a wireless network. Multicast clogs the ether, open peer-to-peer connections within a segment allow one infected host to attack others, etc. The obvious solution was the switch to 802.1X. Clients were divided into several dozen groups. Each has a separate VLAN and separate access lists.

With a strong-willed decision in trusted SSID p2p was forbidden. For WLANs with radius authorization, the WLC allows you to combine any number of VLANs into a logical group and issue each user the desired network segment. In this case, the user does not need to think about where to connect. In dreams, the final scheme looked like two SSIDs - PSK for guest users and WPA2-Enterprise for corporate users, but this dream quickly crashed into harsh reality.

5.3. 30+ SSID

The need for new WLANs appeared immediately. Some devices did not support .1x, but should have been in semi-trusted segments. For the other part, p2p was required, while the rest had especially specific requirements, such as PBR traffic through the server, or ipv6-only.

At the same time, 3602 points allow broadcasting of no more than sixteen SSIDs (and 802.11ac modules, for which there was hope in the future, no more than eight).

But to declare even 16 SSIDs means to score a very substantial percentage of ether with bicones.
Ap Groups came to the rescue - the ability to broadcast certain networks from specific access points. In our case, each floor was divided into a separate group with an individual set for each. If desired, crushing can be continued further.

5.4. Multicast and mDNS

The following problem follows from the previous paragraph: devices that require multicast and mDNS (Apple TV is the most common instance). All users are beaten by VLANs and do not see someone else's traffic, and it is somewhat problematic to keep a separate mDNS device in each VLAN'e. In addition to this, failover svi was initially implemented on routers through VRRP, which uses multicast, and by default sends an authentication key in clear text.

We connect to Wi-Fi, listen to traffic, craft a hello packet, and become a master. Add md5 to VRRP. Now hello packages are to some extent protected. Protected and shipped to all customers. Like all other multicast traffic within a segment. In other words:

devices that require mDNS do not fully work for us;
traffic unnecessary to clients (and it was by no means just hello from VRRP) is sent to them anyway.

The solution to the second problem seemed to suggest itself - to disable multicast on the wireless network. With the first problem at that time (before the release of 7.4) it was all a bit more complicated. It was necessary to raise the server in the necessary VLANs, listening to mDNS requests, and relaying them between clients and devices. The solution is obviously unreliable, unstable and does not fully solve the problem of multicast.

Starting with 7.4, Cisco rolled out the mDNS-proxy at the controller level. Now all mDNS requests with a specific “service string” inside (for example, _airplay._tcp.local. For Apple TV) can now be sent only to interfaces with a specific mDNS profile (moreover, this can be separately configured on each access point, which allows broadcasting requests even from those VLANs to which the controller is not physically connected due to the connection of only one point there). And this functionality works regardless of global multicast settings. That allows you to turn off the latter and safely discard packets. Which was done.

5.5. And again multicast

We turned off multicast. Network load has decreased. It would seem that there is happiness. But here one or two clients appear who still need him desperately. Unfortunately, it was not possible to manage without crutches. And it turned out to be the Crutch of FlexConnect, which is not intended for these purposes, and in general ...

FlexConnect is a functionality that allows you to bind to the controller points located, for example, in a remote office, for centralized management. And the main feature for us in this case will be the ability to implement Local Switching at such points. This is necessary for the points to be able to manually process traffic (broadcast SSID, etc.) when the connection with the controller drops or if we do not want to force all traffic from the point through it.

We create a separate point in the FlexConnect group, create a separate SSID in this group and process all the traffic there locally. On the one hand, this is an obvious use of the functionality for other purposes, but on the other hand, we have the opportunity to raise small wireless non-filtering L2 domains as needed, without affecting the main infrastructure.

5.6. Rogue AP

Sooner or later, the need arises to defend itself against the evil twin , as BYOD does not protect the client from itself. All points are embedded in the beacon frame, which is responsible for belonging to the controller. Upon receipt of a beacon with an incorrect frame, its BSSID is recorded.

Any Lightweight Access Pointeach specified time interval is removed from its channel for 50 ms to collect information about interference, noise, and unknown clients and access points. When rogue AP is found with SSID identical to one of the trusted ones, the corresponding entry in the “enemies” table is generated. Further, it becomes possible to catch the device with human resources, or to suppress it with the controller. In the latter case, the controller gives several points that are not involved in data transmission to sniff traffic from the “twin” and send deauth packets to it both on behalf of the clients and all clients on its behalf.

Potentially, this functionality is very interesting and very dangerous at the same time. Incorrect configuration and we destroy everything unknown to the Wi-Fi controller in the radius of the coverage points.

6. Conclusion

The article does not claim to be a guide on how to properly build a wireless network. Rather, these are simply the main problems that we encountered when replacing and expanding office infrastructure.

Now only in the main office we have more than three thousand wireless clients and more than three hundred access points, so some solutions may not be applicable or redundant in other conditions.

PS I did not find any mention of WLCCA on a habr. This is a controller configuration analyzer, indicating both some problems and giving configuration tips. Invite can be requested here . We fill in the output of show run-config (215,000 lines in our case) and get the page with the analysis of all the interesting things on the WLC. Enjoy!

Tags: