Avery Laws for Wi-Fi Reliability
- Transfer
Replacing a router: Manufacturer A: 10% broken Manufacturer B: 10% broken P (both A and B are broken): 10% × 10% = 1% Replacing a router (or firmware) almost always solves the problem. | Adding a Wi-Fi amplifier: Router A: 90% works Router B: 90% works P (both A and B work): 90% × 90% = 81% An additional router almost always worsens the situation. |
After several years of fussing with these technologies (surrounded by a bunch of engineers working on other problems of distributed systems, which, as it turned out, have the same limitations), I think I can draw conclusions. Distributed systems are more reliable if you can get service from one node OR from another. They become less reliable if the service depends on one node AND on another. The numbers combine multiplicatively, so the more nodes you have, the faster the service will fall off.
If you take an example that is not related to wireless networks, imagine the operation of a web server with a database. If they are on two computers (real or virtual), then your web application will crash if the web server AND the database server do not work perfectly. In essence, such a solution is less reliable than a system that needs a web server, but does not need a database. Conversely, imagine that you are organizing a fault-tolerant system with two database servers, so if one falls, we will switch to the other. The database will work if the primary OR secondary server is operational, and this is much better. But this is still less reliable than if you didn’t need a database server at all.
Back to Wi-Fi. Imagine that I have a router from manufacturer A. The Wi-Fi router is usually so-so, so for the sake of example, suppose its reliability is 90%, and for simplicity we define it as “it works well for 90% of users, and 10% experience annoying bugs. " So, 90% of users with a brand A router will be satisfied and will never change it for anything. The remaining 10% will be unhappy, so they will buy a new router - from manufacturer B. This one also works well for 90% of users, but bugs are not related, so it will work for others 90%. This means that 90% of people with a brand A router are satisfied; and 90% of the 10% who use the brand B router are also satisfied. It turns out the satisfaction level is 99%! Even though both routers are only 90% reliable. So it turns out
This applies equally to software (vendor firmware vs openwrt vs tomato) or program versions (people may not upgrade from v1.0 to v2.0 until v1.0 starts to cause problems). Our project has a v1 router and a v2 router. The first version worked fine for most users, but not for everyone. When the second version came out, we started distributing v2 routers to all new users, as well as to those v1 users who complained about problems. When we got a graph of user satisfaction, we saw that it jumped right after the release of the second version. Excellent! (Especially excellent, because my team was developing v2 router :)). Now update everyone, right?
Not really, actually. The problem is that we distorted our statistics: we updated on v2 only those v1 users who experienced problems. We did not “update” v2 users with problems on v1 (of course, there were such ones too). Maybe both routers were 90% reliable; the above story could well work and vice versa. The same phenomenon explains why some people switch from openwrt to tomato and enthusiastically respond to how much more reliable this firmware is, and vice versa. The same thing with Red Hat and Debian or Linux and FreeBSD, etc. This phenomenon “Everything works for me!” Is known in the open source world; simple probability. You need an incentive to move only if you have any problems right now.
But the reverse side of the equation is also true, and it matters to the mesh network. When you install multiple routers in a mesh circuit, you are dependent on several routers at the same time, otherwise your network will fall apart. Wi-Fi is notorious for this: one router makes connections, but it works weirdly (for example, it does not route packets), and clients are still tied to this router, and nothing works for anyone. If you increase the number of nodes in the chain, then the probability of such an outcome increases rapidly.
Of course, LTE base stations also have reliability issues - and many. But they are usually not organized in the form of a mesh topology, and each LTE station usually covers a much larger area, so that a dependence on a smaller number of nodes is formed. In addition, each LTE node is usually “too big to fall” - in other words, it will instantly cause problems for so many people that the telephone company will quickly fix it. The only faulty node in the mesh network operates only on a small area, so problems will arise only when passing through this territory, although in most situations there will be no problems. All this leads to the vague impression that “Wi-Fi mesh networks are buggy and LTE is reliable,” even if your own mesh node works most of the time. This is all a game of statistics.
Solution: buddy system
Let the buddy say if you started to behave improperly.
Router A: 90% is working
Router B: 90% is working
P (either A or B are working):
1 - (1-0.9) × (1-0.9) = 99%
In the past 15 years or so, the theory and practice of distributed systems has come a long way. Now we basically know how to transform the AND situation into the OR situation. If you have a RAID5 array and one of the disks fails, you are disabling the drive, so you can replace it until the other fails. If you have a NoSQL database service with 200 nodes, you verify that no requests are sent to the failed nodes, so that other nodes can take over their work. If one of your web servers is overloaded with redundant Ruby on Rails code, then your load balancers redirect traffic to another node that is less loaded until the first server returns to normal mode.
The same should be with Wi-Fi: if your router works strangely, you need to disable it before fixing it.
Unfortunately, the performance of a Wi-Fi router is harder to measure than the performance of a database or web server. The database server can easily test itself; just run a couple of requests and make sure the request socket is in order. Since web servers are accessible via the Internet, you can run one test service, which will periodically query all servers and signal the need for a reboot if the server stops responding. But by definition, not all nodes of a mesh network are accessible via a direct Wi-Fi link from one place, so a single verification service will not work.
Here is my suggestion, which can be called "Wi-Fi buddy system." The analogy is this: as if you and your friends went to a bar where you got too drunk and started acting like a moron. Since you are too drunk, you don’t necessarily know that you behave like a moron. It can be difficult to determine. But do you know who can determine this? Your friends. Usually even if they got drunk too.
Although by definition not all mesh nodes are accessible from one place, you can also say that, by definition, each mesh node is available for at least one other mesh node. Otherwise, it will not be a mesh structure, and you have even bigger problems. This hints how to fix the situation. Each mesh node must from time to time try to connect to one or more neighboring nodes, posing as an end user, and see if traffic is being routed or not. If passes, then excellent! We tell this node that it is doing well, let it continue in the same spirit. If not, then bad! We tell this node that it is better for him to return to the car. (Strictly speaking, the safest way to do this is to send only “you are doing well” messages after the poll. A failed node may not be able to receive “your affairs are bad” messages.
In a fairly dense mesh network — where there are always two or more routes between a given pair of nodes — this converts the behavior of type AND into behavior of type OR. Now the addition of nodes (those that can take themselves out of the network in case of a problem) makes the system more reliable, and not less .
This gives mesh networks an advantage over LTE because LTE has less redundancy. If the base station fails, a large area loses connection, and the telephone company needs to rush to fix it. If the mesh node fails, we bypass the problem and fix it later in our free time.
A small mathematical example has come a long way!
Is this not enough for you?
You can viewall my slides (pdf) about consumer Wi-Fi mesh networks (including the speaker’s detailed notes) from the Battlemesh v10 conference in Vienna or my presentation on YouTube:
Note.
So-called “laws” are a special case of more general and therefore more useful theorems of distributed systems. But this is the Internet, so I chose one special case and named it after me. Come on, try stopping me.