Distributed Captive Portal in public places and difficulties with Apple
After reading about the subway , I wanted to comment, but decided to write separately.
We participated in the creation of public networks with distributed captive portal and stepped on almost all rakes, so I want to share experience.
To begin with, a little theory about how this works and how distributed portals differ from centralized ones. Ideologically, what we used to call the Captive Portal actually consists of three components:
web frontend- designed to interact with the user, collecting information by filling out forms and showing him ads. If we are going to ask the user to enter personal information and passwords, then we should use https, respectively, a normal certificate is needed on the server. If we are going to ask to put a checkmark under the agreement of the user, then http is enough.
actually captive portal- this is a certain agent designed to receive information collected using the web frontend, analyze it, it is possible to make clarifying requests on its own behalf (for example, in RADIUS) and, based on the results, report its decision either directly to the user or to him through the web frontend. If the decision is positive, captvie portal opens the necessary holes in the firewall for this user. After a specified period of time, the holes are closed and we have the user back to the web frontend. The portal is prematurely closed due to user inactivity. Often the only reason for limiting the session time is the desire to show the user an advertisement again (if we do not want to act like in the subway, disfiguring the design of other sites)
firewall- knows the access of individual users to the network. If there is no access for ideological reasons, it redirects the user to the web frontend. If there is no access for technical reasons (the gateway does not respond), you can instruct the firewall to redirect the user to a special page “there is no service, but we are repairing it with all our might”.
In the case of a centralized captive portal, all three components are obviously located on the same machine (device), which greatly simplifies the task. In this case, the Firewall often also executes NAT, and the captive portal can be implemented as a bunch of scripts that twist the local iptables. There is an irresistible desire to push cheap access points into the network, which will dump us all users on ethernet or, in the best case, into a separate vlan. What are the problems here:
As you may already guess, the distributed captive portal is designed to solve all these problems. Speaking "distributed", we assume that the components can be placed on different devices. This will allow us to create a reliable system that will provide the necessary level of security and service, while having great scalability. The problem that we have to solve is to ensure the interaction between the captive portal components. Where should the solution components be located?
Firewall should be as close as possible to the client, i.e. definitely at the access point. Since there are several access points and each of them has its own firewall, their work must be synchronized within a certain space or area within which roaming of clients is supposed. Otherwise, customers will experience communication problems when roaming. In modern networks, the task of synchronizing the work of something inside a certain domain (RF domain) is performed using appointed arbitrators (RF domain managers) and was solved in ancient times, regardless of the task of implementing a distributed captive portal. For this system, synchronization of the firewall is just another of the processes that must be performed in a domain in a coordinated fashion, along with (for example) traffic switching,
The location of the web frontend is highly dependent on the complexity of the tasks that it has to solve. If you need to show pages that do not involve server side processing or any difficulties such as sending SMS, then it is quite possible to get by with the server on the access point. He, again, is located as close to the client as possible and provides the most effective interaction with him. The synchronization of web server content at different access points will be handled by (surprise) the manager of the RF domain.
The location of the captive portal depends on the position of the web frontend and the availability of points. Since the task of the captive portal is to twist the firewall, it must have its own representative (agent) at each point. However, the web frontend can communicate with any of the copies of these agents, because their state (you guessed it) is also synchronized within the domain.
Thus, we are achieving a situation in which for a client that successfully passes authorization, the captive portal opens immediately in the entire domain and after that, at any time, on all access points of the firewall domain for this client is configured the same.
Method for interacting with captive portal. We need a mechanism by which we can tell the portal the results of user interaction. In our case, HTTP GET was chosen as such a mechanism. If necessary, open the portal, we send an HTTP GET to any of its representative offices. The composition of the parameters passed to the GET depends on the mode in which the portal operates. Here are a few options:
Everything outside this logic requires implementation in the web frontend. For example, you can ask the user for a phone, send him an SMS, check the code. According to the results - charge to the radius of the user (for example) with username = phone_number and password = ip_IP and then send GET to the portal with these values.
How does a portal, after receiving a GET, figure out which user is involved? When redirecting a user to the web frontend, the portal attaches a rather long variable to the call, which we must return to it intact among the parameters of the request to open the portal.
Ideally, the point performs bridging (level 2 forwarding) between the SSID and a certain vlan in the wires. That is, the firewall operates at the second (MAC) level. Since the firewall sees the DHCP offer arriving from the bowels of your network to the client, it knows its IP exactly, replies to ARP instead of the client, and strictly filters all ARP and DHCP on the wireless segment.
The fact that the point does not have an IP address in the user vlan eliminates the possibility of user communication directly with the point. However, sometimes we need such an opportunity - with the location of the web and portal directly on the point. In this case, a fictitious address 1.1.1.1 is used.
And why do we everywhere, like in the subway, convince the iPhone that there is no portal.
By the way iPhones behave in wireless networks, I have developed a strong belief that the creators of this megaproduct suggested only one scenario, namely one access point . That is, either at home or in a cafe for hipsters. In the second case, there is a non-fake chance to meet with the captive portal.
What does an iPhone do when it encounters several points with one SSID and captive portal? He tries everythingavailable. On each, he connects, asks for the address, checks the random url from his long list (he used to be alone), realizes that there is captive, gives the address (dhcp release) and disconnects. Since in our case one SSID shines from each point both at 2.4 and 5 GHz, everything is multiplied by two. Having come to the logical conclusion “yes, there is an ambush everywhere!”, The iPhone connects to one of the points again and draws its minibrowser. In the terminology of our customers and clients, this process is called “My last iPhone has been connecting to your network for a very long time” and “everything flies at my place for 1000 rubles at my place.” In the case of a coordinated network (not individual points), each connection sends a message to the manager domain "we have a new passenger", and in the case of MESH - in parallel there. The whole process takes up to 20 seconds.
What does an iPhone do when it encounters the same SSID right away at 2.4 and 5GHz? You thought that you could balance clients between channels, points and ranges, making the most of client capabilities and network bandwidth? Not with Apple products! From the network side, hearing requests from the client in both ranges, we have the right to assume that we can force the client to connect where we need it, bypassing requests to those points where we do not want it to connect. Typically, clients understand the hint and connect, for example, to 5Ghz. The iPhone will break in 2.4 to the last. For persistent there is a separate counter (20 requests in a row by default). It also takes time.
The two described processes occur not only when connected to the network, but also when roaming, if you move far enough away. Oh yes there are new points. Well, let's check ...
What does an iPhone do if it launches a minibrowser and we (suddenly) need to send an SMS to the client? It shows SMS in a small window from above with an exposure time of about 3 seconds. The blonde is not able to memorize 6 (six!) Digits during this time. The window leaves, the user pokes a finger in SMS, the minibrowser closes, dhcp release, disconnect, welcome to 3G. A user with grief in half remembers the code, crawls into settings, connects to the network, enter the phone number, get a new SMS. And further, and further ... In the terminology of customers and users, this is called "your captive portal does not work on my last iPhone" and "they’ve already repaired it in the subway."
The situation can be corrected by sending the user’s MAC to the web frontend (we can), remember there that we already sent him SMS and at the second call ask the code already. For cookies this minibrowser does not support.
What is the reason for such inadequate behavior? It's simple: the creators of the device set a goal not to leave you without communication.
Suppose you came to visit. There is a closed network, but the good hosts told you the password and voilà - here it is the Internet. Your smartphone remembered the network and during your next visit, it was automatically connected to it. But the owners forgot to pay the provider and this time they didn’t let you go further than the router. That is, you didn’t do anything, you didn’t even pick up the phone, but without realizing it, you were left without contact with the outside world. This is very bad. To avoid this, modern mobile devices perform a multi-step process when connected, the purpose of which is not to leave you without communication:
If the last step is successful, we assume that there is Internet and get off with 3G. And so do it every time you connect to wifi. Even at home.
If instead of “Success” we observe something wrong, here is the captive portal. It's time to launch the minibrowser. The user could not at one time agree with the portal in the window - we disconnect. The problem with the iPhone is that he hopes to the last for the best. If you ask to connect to the network, and it can be seen at more than one point, all options will be tried. Time will be killed. Most devices, seeing the portal, suggest that it is everywhere, probably.
The only way to stop throwing is to bypass portal discovery. It can be done in two ways - by filtering "User-Agent: CaptiveNetworkSupport" or bypassing traffic through some list of domains. In the subway, for example, iMessage works with a closed portal.
As a result of the portal bypass, the network is visible either in any way or not all. In any case, this is very bad, because, in fact, leaves the user without communication in an invisible way for him.
On our equipment, detection is turned off with one command:
We participated in the creation of public networks with distributed captive portal and stepped on almost all rakes, so I want to share experience.
To begin with, a little theory about how this works and how distributed portals differ from centralized ones. Ideologically, what we used to call the Captive Portal actually consists of three components:
web frontend- designed to interact with the user, collecting information by filling out forms and showing him ads. If we are going to ask the user to enter personal information and passwords, then we should use https, respectively, a normal certificate is needed on the server. If we are going to ask to put a checkmark under the agreement of the user, then http is enough.
actually captive portal- this is a certain agent designed to receive information collected using the web frontend, analyze it, it is possible to make clarifying requests on its own behalf (for example, in RADIUS) and, based on the results, report its decision either directly to the user or to him through the web frontend. If the decision is positive, captvie portal opens the necessary holes in the firewall for this user. After a specified period of time, the holes are closed and we have the user back to the web frontend. The portal is prematurely closed due to user inactivity. Often the only reason for limiting the session time is the desire to show the user an advertisement again (if we do not want to act like in the subway, disfiguring the design of other sites)
firewall- knows the access of individual users to the network. If there is no access for ideological reasons, it redirects the user to the web frontend. If there is no access for technical reasons (the gateway does not respond), you can instruct the firewall to redirect the user to a special page “there is no service, but we are repairing it with all our might”.
In the case of a centralized captive portal, all three components are obviously located on the same machine (device), which greatly simplifies the task. In this case, the Firewall often also executes NAT, and the captive portal can be implemented as a bunch of scripts that twist the local iptables. There is an irresistible desire to push cheap access points into the network, which will dump us all users on ethernet or, in the best case, into a separate vlan. What are the problems here:
- Security issues. We restrict access to the external channel, but everything is bad on the local network. Since the network is open, any user can respond to arp on behalf of our default gateway, receive user traffic and engage in phishing. It is not forbidden to put your own DHCP server and in a certain delta neighborhood to push users with statements like "your browser is hopelessly outdated." If your captive portal and the user are separated by a router, then you do not have the ability to control the compliance of mac and ip with all the consequences on the captive portal. Communication between wireless clients becomes possible. You can prevent wireless clients from communicating on a cheap point, but clients of other points are already visible on ethernet.
- Traffic problems. We have a lot of excess traffic on the local network. It is advisable not to let access points go further than opening the captive portal clients.
- Scalability issues. With a large number of customers, any of the three portal components can become problematic.
As you may already guess, the distributed captive portal is designed to solve all these problems. Speaking "distributed", we assume that the components can be placed on different devices. This will allow us to create a reliable system that will provide the necessary level of security and service, while having great scalability. The problem that we have to solve is to ensure the interaction between the captive portal components. Where should the solution components be located?
Firewall should be as close as possible to the client, i.e. definitely at the access point. Since there are several access points and each of them has its own firewall, their work must be synchronized within a certain space or area within which roaming of clients is supposed. Otherwise, customers will experience communication problems when roaming. In modern networks, the task of synchronizing the work of something inside a certain domain (RF domain) is performed using appointed arbitrators (RF domain managers) and was solved in ancient times, regardless of the task of implementing a distributed captive portal. For this system, synchronization of the firewall is just another of the processes that must be performed in a domain in a coordinated fashion, along with (for example) traffic switching,
The location of the web frontend is highly dependent on the complexity of the tasks that it has to solve. If you need to show pages that do not involve server side processing or any difficulties such as sending SMS, then it is quite possible to get by with the server on the access point. He, again, is located as close to the client as possible and provides the most effective interaction with him. The synchronization of web server content at different access points will be handled by (surprise) the manager of the RF domain.
The location of the captive portal depends on the position of the web frontend and the availability of points. Since the task of the captive portal is to twist the firewall, it must have its own representative (agent) at each point. However, the web frontend can communicate with any of the copies of these agents, because their state (you guessed it) is also synchronized within the domain.
Thus, we are achieving a situation in which for a client that successfully passes authorization, the captive portal opens immediately in the entire domain and after that, at any time, on all access points of the firewall domain for this client is configured the same.
Subtleties
Method for interacting with captive portal. We need a mechanism by which we can tell the portal the results of user interaction. In our case, HTTP GET was chosen as such a mechanism. If necessary, open the portal, we send an HTTP GET to any of its representative offices. The composition of the parameters passed to the GET depends on the mode in which the portal operates. Here are a few options:
- The portal always opens. It is possible to record this in the log.
- The portal opens when there is a variable in the GET that reflects the agreement with the conditions.
- Username and password are transferred to GET, the portal itself crawls into RADIUS with these attributes and opens, receiving ACCEPT from there.
- One (universal) attribute is passed to GET, the portal indicates it both as username and password when accessing RADIUS and opens, receiving ACCEPT. It is clear that such a user should be in RADIUS
Everything outside this logic requires implementation in the web frontend. For example, you can ask the user for a phone, send him an SMS, check the code. According to the results - charge to the radius of the user (for example) with username = phone_number and password = ip_IP and then send GET to the portal with these values.
How does a portal, after receiving a GET, figure out which user is involved? When redirecting a user to the web frontend, the portal attaches a rather long variable to the call, which we must return to it intact among the parameters of the request to open the portal.
Ideally, the point performs bridging (level 2 forwarding) between the SSID and a certain vlan in the wires. That is, the firewall operates at the second (MAC) level. Since the firewall sees the DHCP offer arriving from the bowels of your network to the client, it knows its IP exactly, replies to ARP instead of the client, and strictly filters all ARP and DHCP on the wireless segment.
The fact that the point does not have an IP address in the user vlan eliminates the possibility of user communication directly with the point. However, sometimes we need such an opportunity - with the location of the web and portal directly on the point. In this case, a fictitious address 1.1.1.1 is used.
What does Apple have to do with it
And why do we everywhere, like in the subway, convince the iPhone that there is no portal.
By the way iPhones behave in wireless networks, I have developed a strong belief that the creators of this megaproduct suggested only one scenario, namely one access point . That is, either at home or in a cafe for hipsters. In the second case, there is a non-fake chance to meet with the captive portal.
What does an iPhone do when it encounters several points with one SSID and captive portal? He tries everythingavailable. On each, he connects, asks for the address, checks the random url from his long list (he used to be alone), realizes that there is captive, gives the address (dhcp release) and disconnects. Since in our case one SSID shines from each point both at 2.4 and 5 GHz, everything is multiplied by two. Having come to the logical conclusion “yes, there is an ambush everywhere!”, The iPhone connects to one of the points again and draws its minibrowser. In the terminology of our customers and clients, this process is called “My last iPhone has been connecting to your network for a very long time” and “everything flies at my place for 1000 rubles at my place.” In the case of a coordinated network (not individual points), each connection sends a message to the manager domain "we have a new passenger", and in the case of MESH - in parallel there. The whole process takes up to 20 seconds.
What does an iPhone do when it encounters the same SSID right away at 2.4 and 5GHz? You thought that you could balance clients between channels, points and ranges, making the most of client capabilities and network bandwidth? Not with Apple products! From the network side, hearing requests from the client in both ranges, we have the right to assume that we can force the client to connect where we need it, bypassing requests to those points where we do not want it to connect. Typically, clients understand the hint and connect, for example, to 5Ghz. The iPhone will break in 2.4 to the last. For persistent there is a separate counter (20 requests in a row by default). It also takes time.
The two described processes occur not only when connected to the network, but also when roaming, if you move far enough away. Oh yes there are new points. Well, let's check ...
What does an iPhone do if it launches a minibrowser and we (suddenly) need to send an SMS to the client? It shows SMS in a small window from above with an exposure time of about 3 seconds. The blonde is not able to memorize 6 (six!) Digits during this time. The window leaves, the user pokes a finger in SMS, the minibrowser closes, dhcp release, disconnect, welcome to 3G. A user with grief in half remembers the code, crawls into settings, connects to the network, enter the phone number, get a new SMS. And further, and further ... In the terminology of customers and users, this is called "your captive portal does not work on my last iPhone" and "they’ve already repaired it in the subway."
The situation can be corrected by sending the user’s MAC to the web frontend (we can), remember there that we already sent him SMS and at the second call ask the code already. For cookies this minibrowser does not support.
What is the reason for such inadequate behavior? It's simple: the creators of the device set a goal not to leave you without communication.
Suppose you came to visit. There is a closed network, but the good hosts told you the password and voilà - here it is the Internet. Your smartphone remembered the network and during your next visit, it was automatically connected to it. But the owners forgot to pay the provider and this time they didn’t let you go further than the router. That is, you didn’t do anything, you didn’t even pick up the phone, but without realizing it, you were left without contact with the outside world. This is very bad. To avoid this, modern mobile devices perform a multi-step process when connected, the purpose of which is not to leave you without communication:
- We can not get IP - disconnect
- We do not see ARP with default gateway - we are disconnected
- Not a single DNS from the list answers - disconnect
- We request a certain url from one of our domains - we hope to see
Success Success
If the last step is successful, we assume that there is Internet and get off with 3G. And so do it every time you connect to wifi. Even at home.
If instead of “Success” we observe something wrong, here is the captive portal. It's time to launch the minibrowser. The user could not at one time agree with the portal in the window - we disconnect. The problem with the iPhone is that he hopes to the last for the best. If you ask to connect to the network, and it can be seen at more than one point, all options will be tried. Time will be killed. Most devices, seeing the portal, suggest that it is everywhere, probably.
The only way to stop throwing is to bypass portal discovery. It can be done in two ways - by filtering "User-Agent: CaptiveNetworkSupport" or bypassing traffic through some list of domains. In the subway, for example, iMessage works with a closed portal.
As a result of the portal bypass, the network is visible either in any way or not all. In any case, this is very bad, because, in fact, leaves the user without communication in an invisible way for him.
On our equipment, detection is turned off with one command:
ap7131-ABCDEF(config-captive-portal-XXXXX)#bypass ?
captive-portal-detection Captive portal detection requests(e.g., Apple
Captive Network Assistant
ap7131-ABCDEF(config-captive-portal-XXXXX)#bypass captive-portal-detection