A Few Words About Path MTU Discovery Black Hole
A Few Words About Path MTU Discovery Black Hole
Instead of joining
Once for every true system administrator (or one acting as such), a moment of truth arrives. He has the fate to configure the router on a computer with installed GNU / Linux. Those who have already passed this know that there is nothing complicated in this and that it is possible to meet a couple of teams. And now our admin finds these commands, drives them into the console and proudly goes to the users to say that everything is already working. But it wasn’t there - users say that their favorite sites do not open. After spending some part of my life finding out the details, it turns out that most of the sites behave as follows:
1. When you open the page, the title is loaded and nothing else;
2. In this state, the page hangs indefinitely;
3. The browser status bar all this time indicates that the page is loading;
4. Pings and tracing to this site are fine;
5. The telnet connection to port 80 also works fine.
A discouraged admin calls the provider's technical support, but they quickly get rid of it, advising them to try to configure the router on Windows OS, and if it doesn’t work there then ... buy a hardware router.
I think this situation is familiar to many. Some fell into it themselves, some had friends with her, and some had met such admins in forums and other conferences. So: if you have such a situation, then - Congratulations! You are faced with Path MTU Discovering Black Hole . This article is dedicated to why this happens and how to solve this problem.
Terms needed to understand the article.
MTU (Maximum Transmission Unit) - this term is used to determine the maximum packet size (in bytes) that can be transmitted at the data link layer of the OSI network model. For Ethernet, this is 1500 bytes. If a larger packet arrives (for example, via Token Ring), then the data is reassembled into packets of no more than MTU (i.e. no more than 1,500 bytes). The operation of packet reassembly under another MTU is called fragmentation and is expensive for the router.
PMTU (Path MTU) - this parameter indicates the smallest MTU among the MTU data channels located between the source and receiver.
PMTU discovery is a PMTU discovery technology designed to reduce the load on routers. Described in RFC 1191in 1988. The essence of the technology is that when two hosts are connected, the DF (don't fragment) parameter is set, which prevents packet fragmentation. This causes the node, whose MTU is less than the packet size, to reject the transmission of the packet and send an ICMP message of type Destination is unreachable. The error message is accompanied by the MTU value of the node. The sending host reduces the packet size and sends it again. This operation occurs until the packet is small enough to reach the destination host without fragmentation.
MSS (Maximum Segment Size) - the maximum segment size, i.e. The largest chunk of data that TCP will send to the remote other end of the connection. It is calculated by the following formula:
Interface MTU - Header_IP_Size (20 bytes) - Header_TCP_Size (20 bytes). Total usually it is 1460 bytes. When a connection is established, each side can declare its own MSS. The smallest value is selected. More details can be found here .
Flag DF (Don't fragment) - A bit in the flag field of the IP packet header, which, when set to one, indicates that this packet is forbidden to fragment. If a packet with this flag is larger than the MTU of the next forwarding, then this packet will be discarded, and the sender will receive an ICMP error: “fragmentation needed, but the fragment bit set is not set” (fragmentation needed but don't fragment bit set).
Test site
This problem is best met in practice (but not in time trouble, when the boss yells over his ear). To do this, I created a test network, shown in Fig

. 1 . 1. Test network.
This is a simplified version of the global network. Roles:
1. A computer named deb-serv-03 is our Linux router. Attention - on its eth2 interface, the MTU size is reduced to 1400 bytes;
2. deb-serv-05 - client on the local network;
3. deb-home - a router located at the provider;
4. deb-serv - A web server on the Internet with which we want to exchange data. We get from www.site.local , located on it a page of 5.9Kb in size.
Of course, in reality the chain is much larger, but for an illustrative example this is enough. All computers on this network are running Debian GNU / Linux 5.0 Lenny. At different points in the network, I control the situation using the tcpdump program.
Normal detection of PMTU
First, let's see what happens on the network when you open a page. Learn how packages from the web server go. We look at the output of TCPDUMP # 1 (on eth0 deb-serv): I list only the first 10 packets and cut off all the excess in the standard tcpdump output. We disassemble: 1. In lines 1 through 3 we see the tcp connection setup. The parties exchange SYN, SYN-ACK, ACK packets. Here it is worth paying attention to the options field, namely the MSS parameter exchanged between the parties. On both sides it is 1460 bytes. So the maximum size of packets that the parties will send to each other is 1460 (MSS) +20 (TCP Header) +20 (IP Header) = 1500 bytes. 2. On line 4, send a request for a web page from deb-serv-05. Line 5 confirms receipt of this package.
1 IP 172.16.5.3.48547 > 192.168.0.1.80: Flags [S], seq 2947128725, win 5840, options [mss 1460...], length 0
2 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [S.], seq 757312786, ack 2947128726, win 5792, options [mss 1460...], length 0
3 IP 172.16.5.3.48547 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0
4 IP 172.16.5.3.48547 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117
5 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], ack 118, win 181, options [...], length 0
6 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1:2897, ack 118, win 181, options [...], length 2896
7 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
8 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1:1349, ack 118, win 181, options [...], length 1348
9 IP 192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1349:2697, ack 118, win 181, options [...], length 1348
10 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
3. In line 6, we see the sending of a response to the request (that is, sending a piece of the web page). Probably due to the peculiarities of pcap on this interface, tcpdump sees one packet of 2948 bytes in size, while 2 packets of 1500 and 1452 bytes in size will go to the network, respectively. If you look at the more detailed output of tcpdump, you will see that the DF flag is on this packet (more precisely, the packets): 4. When these data packets reach deb-serv-03 they are discarded because they cannot go through the connection to MTU 1400 and cannot be fragmented (DF flag), and in response an ICMP message type 3 code 4 is generated: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400) , which we see on line 7 (on line 10, a message for the 2nd package). This message transmits the desired MTU.
IP (tos 0x0, ttl 64, id 5177, offset 0, flags [DF], proto TCP (6), length 2948)
192.168.0.1.80 > 172.16.5.3.48547: Flags [.], seq 1:2897, ack 118, win 181, options [nop,nop,TS val 86620459 ecr 4922429], length 2896
5. In lines 8 and 9 we observe how deb-serv, having received MTU = 1400, sends the same piece of the web page in packets of 1400 bytes in size. These packets reach deb-serv-05, where a confirmation is generated, and this is repeated until the entire page has been transmitted. The size of all subsequent packets will be no more than 1400 bytes.
This example demonstrates the Transport MTU Determination Procedure (PMTU) described in RCF1911. I presented it in a simplified form in Figure 2.

Figure 2. The procedure for determining PMTU.
Meeting with Path MTU Discovery Black Hole
Now imagine that a new specialist came to the provider and decided (for example, to protect against icmp flood) to prohibit sending icmp packets through deb-home, which is now in his charge. We look at what happens:
TCPDUMP # 1 output (on eth0 deb-serv): TCPDUMP # 2 output (on eth0 deb-serv-03):
1 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [S], seq 1723325723, win 5840, options [mss 1460...], length 0
2 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [S.], seq 2482933888, ack 1723325724, win 5792, options [mss 1460...], length 0
3 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0
4 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117
5 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], ack 118, win 181, options [...], length 0
6 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:2897, ack 118, win 181, options [...], length 2896
7 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
8 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
9 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
10 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
1 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [S], seq 1723325723, win 5840, options [mss 1460...], length 0
2 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [S.], seq 2482933888, ack 1723325724, win 5792, options [mss 1460...], length 0
3 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0
4 IP 172.16.5.3.57925 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117
5 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], ack 118, win 181, options [...], length 0
6 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
7 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
8 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1449:2897, ack 118, win 181, options [...], length 1448
9 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
10 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
11 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
12 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
13 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
14 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
15 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
16 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
17 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
18 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
19 IP 172.16.250.2 > 192.168.0.1: ICMP 172.16.5.3 unreachable - need to frag (mtu 1400), length 556
20 IP 192.168.0.1.80 > 172.16.5.3.57925: Flags [.], seq 1:1449, ack 118, win 181, options [...], length 1448
As you can see, the situation is quite expected. The first 6 lines in each output are exactly the same as in normal transfer (see the description in the previous example). But then the discrepancies begin. ICMP 3: 4 is generated in the same way on deb-serv-03 (lines 7, 9 11.13, 15, 17, 19 in TCPDUMP # 2), but deb-serv does not receive it and continues to send packets of 1500 bytes in size (lines with 6 to 12 in TCPDUMP # 1 and 6, 8, 10, 12, 14, 16, 18 and 20 in TCPDUMP # 2). Each time, the time between retransmissions increases (in these examples, I dropped the timestamps, but the TCP retransmit mechanism actually works this way). In this case, no data larger than PMTU can be transmitted. But alas, TCP does not know this and continues to send packets with the MSS selected at the time the connection was established. This situation is calledPath MTU Discovery Black Hole (Black hole in the definition of transport MTU). I tried to present it in a simplified form in Fig. 3.

Fig. 3. The black hole in the definition of PMTU.
This problem is not new at all. It is described in RFC 2923 in 2000. Nevertheless, it continues to meet with enviable tenacity among many providers. But it is the provider who is to blame for this situation: they do not need to block ICMP type 3 code 4. Moreover, they usually do not want to listen to the “voice of reason” (that is, clients who understand what the problem is).
Solving the PMTU Problem
We will not call technical support, but try to solve the problem based on our own funds.
Linux developers, who also know about it, have provided a special option in iptables. Quote from man iptables: My free translation for those who have a tight English: As you can see, they wrote a lot of things, even described approximate problem symptoms. And this behavior of providers was called "criminal incompetence (criminally braindead)", which I completely agree with them. Let's explore how this option will work in our example. Add the recommended rule to deb-serv-03: iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN, RST SYN -j TCPMSS –set-mss 1360 And look what happened: TCPDUMP # 1 output (on eth0 deb -serv):
TCPMSS
This target allows to alter the MSS value of TCP SYN packets, to control the maximum size for that connection (usually limiting it to your outgoing interface’s MTU minus 40 for IPv4 or 60 for IPv6, respectively). Of course, it can only be used in conjunction with -p tcp. It is only valid in the mangle table. This target is used to overcome criminally braindead ISPs or servers which block "ICMP Fragmentation Needed" or "ICMPv6 Packet Too Big" packets. The symptoms of this problem are that everything works fine from your Linux firewall/router, but machines behind it can never exchange large packets:
1) Web browsers connect, then hang with no data received.
2) Small mail works fine, but large emails hang.
3) ssh works fine, but scp hangs after initial handshaking.
Workaround: activate this option and add a rule to your firewall configuration like:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
--set-mss value
Explicitly set MSS option to specified value.
--clamp-mss-to-pmtu
Automatically clamp MSS value to (path_MTU - 40 for IPv4; -60 for IPv6).
These options are mutually exclusive.
TCPMSS
Это действие позволяет изменять значение MSS в TCP SYN пакетах, для контроля максимального размера пакетов в этом соединении (Обычно ограничивая его MTU исходящего интерфейса минус 40 байт для IPv4 или минус 60 для IPv6). Конечно, это действие может использоваться только в сочетании с -p tcp. Разрешено это только в таблице mangle. Это действие используется для преодоления преступной некомпетентности провайдеров и серверов, блокирующих "ICMP Fragmentation Needed" или "ICMPv6 Packet Too Big" пакеты. Симптомы этой проблемы – все прекрасно работает на вашем сетевом экране или роутере, но машины за ним никогда не смогут обмениваться большими пакетами:
1) Веб браузеры связываются, но просто висят без пересылки данных.
2) маленькие электронные письма приходят нормально, но большие висят.
3) ssh работает отлично, но scp висит после начальных рукопожатий(прим пер: процесс установки TCP соединения также называют "тройным рукопожатием").
Решение: активировать эту опцию и добавить правило, подобное нижеприведенному, в конфигурацию своего сетевого экрана:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
-j TCPMSS --clamp-mss-to-pmtu
--set-mss значение
Явная установка в опции MSS специфического значения.
--clamp-mss-to-pmtu
Автоматическая установка значения MSS в (path_MTU - 40 для IPv4; -60 для IPv6).
Эти опции являются взаимоисключающими.
1 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [s], seq 1484543117, win 5840, options [mss 1360...], length 0
2 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [s.], seq 2230206317, ack 1484543118, win 5792, options [mss 1460...], length 0
3 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [.], ack 1, win 1460, options [...], length 0
4 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [p.], seq 1:118, ack 1, win 1460, options [...], length 117
5 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [.], ack 118, win 181, options [...], length 0
6 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [.], seq 1:2697, ack 118, win 181, options [...], length 2696
7 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [.], ack 1349, win 2184, options [...], length 0
8 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [.], seq 2697:5393, ack 118, win 181, options [...], length 2696
9 IP 192.168.0.1.80 > 172.16.5.3.33792: flags [fp.], seq 5393:6380, ack 118, win 181, options [...], length 987
10 IP 172.16.5.3.33792 > 192.168.0.1.80: flags [.], ack 2697, win 2908, options [...], length 0
Output TCPDUMP # 3 (on eth0 deb-serv-05): We parse: 1. In lines 1-3, we are already familiar with the TCP connection. But pay attention to the MSS values. In TCPDUMP # 1, deb-serv-05 receives a value of 1360, while in TCDUMP # 3 it can be seen that a packet with MSS = 1460 is leaving. This is exactly how the rule works with –set-mss 1360. It edits the MSS value of passing packets. For the SYN packet that came back, this value is also edited. 2. In lines 4 and 5 of both conclusions, we again observe the sending of a GET request and confirmation of receipt.
1 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [S], seq 1484543117, win 5840, options [mss 1460...], length 0
2 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [S.], seq 2230206317, ack 1484543118, win 5792, options [mss 1360...], length 0
3 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [.], ack 1, win 1460, options [...], length 0
4 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [P.], seq 1:118, ack 1, win 1460, options [...], length 117
5 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], ack 118, win 181, options [...], length 0
6 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], seq 1:1349, ack 118, win 181, options [...], length 1348
7 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], seq 1349:2697, ack 118, win 181, options [...], length 1348
8 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [.], ack 1349, win 2184, options [...], length 0
9 IP 172.16.5.3.33792 > 192.168.0.1.80: Flags [.], ack 2697, win 2908, options [...], length 0
10 IP 192.168.0.1.80 > 172.16.5.3.33792: Flags [.], seq 2697:4045, ack 118, win 181, options [...], length 1348
3. In line 6 for TCPDUMP # 1 and lines 6 and 7 for TCPDUMP # 3 we see sending packets with data, but now the size of each packet does not exceed 1400 bytes. Again, a strange glitch occurs with TCPDUMP # 1, where one large packet is visible, while in TCPDUMP # 3 we observe the arrival of 2 packets.
4. Further packet exchange is in accordance with the rules of the TCP protocol. But never a packet size exceeded 1400 bytes.
In a simplified form, the behavior of MSS is shown in Fig. 4. I did not show the data exchange, since it is similar to the usual behavior.

Fig. 4. Change MSS on the fly.
Although there are two options described in man iptables, I have only applied one so far. The option you need depends on the specific situation. All situations can be divided into 2 types:
1. Sites open normally on your router; clients on the local network experience problems.
In this case, the smallest MTU all the way is located on your server. Usually these are some encapsulation protocols, such as PPPoE, PPtP, etc. The best option for this situation is –clamp-mss-to-pmtu, which will automatically set the minimum MSS on all transit packets.
2. On your router and on clients on the local network, sites do not open.
In this case, the smallest MTU is located somewhere with the provider and it is difficult to calculate it by standard means. Especially for this, I wrote a small python script (not really worrying about PEP8 and the inability to shoot in the foot), which will help determine the required MSS size for this situation:
#!/usr/bin/env python
# -*-coding: utf-8 -*-
import socket
import os
import time
import sys
# Полное имя веб сервера на котором проводятся испытания. Следует выбирать из
# сайтов, которые точно не работают.
HOST = 'www.site.local'
# Временной интервал, в течении которого следует ожидать ответа от сайта.
# Слишком маленькое значение может породить ложные срабатывания, слишком
# большое - долгое время работы скрипта.
TIMEOUT = 25.0
# Количество байт, которые надо получить с веб сервера, чтобы убедится что он
# наверняка работает. Рекомендуется устанавливать большим нежели значение MTU
BUF = 3000
# Значение MTU на интерфейсе в интернет.
MTU = 1500
# Значение MSS будет искаться в пределе от MTU-LIM-40 до MTU-40. Запрещено
# ставить значение больше MTU и не рекомендуется ставить значения более чем
# 100-200 - это может привести к большому времени работы скрипта.
LIM = 100
# Задержка между обращениями к сайту. Рекомендуется устанавливать отличной от
# нуля на медленном канале.
TRY_TIME = 0
def set_mss(mss, action='A'):
return os.system("iptables -t mangle -%s OUTPUT -p tcp --tcp-flags \
SYN,RST SYN -j TCPMSS --set-mss %d" % (action, mss) )
def check_connection(host):
sock = socket.socket()
sock.connect( (host, 80) )
sock.send('GET / HTTP/1.1\r\nHost: %s\r\n\r\n' % host)
sock.settimeout(TIMEOUT)
try:
answer_size = len( sock.recv(BUF) )
except:
answer_size = 0
sock.close()
return answer_size
def main():
mss = MTU - 40
if not check_connection(HOST):
mss = MTU - 40 - LIM
set_mss(mss)
if not check_connection(HOST):
set_mss(mss,'D')
print "Error: Too small LIM"
sys.exit(1)
else:
while check_connection(HOST):
time.sleep(TRY_TIME)
set_mss(mss,'D')
if mss >= MTU-40:
print "Error in determining MSS"
sys.exit(1)
mss += 1
set_mss(mss)
set_mss(mss,'D')
mss -= 1
print 'MSS = %d' % (mss)
if __name__ == '__main__':
main()
sys.exit(0)
You need to run the script with superuser privileges. The algorithm of his work is as follows:
1. We are trying to get a certain amount of data from a site with a normal MSS value.
2. If this does not work, then lower the MSS on the iptables OUTPUT chain to MTU - 40 - LIM.
3. If even after that we cannot get the data, then we get an error that the LIM is too small.
4. Consistently increasing MSS, we are looking for the moment when the data ceases to arrive. After that we display the last working value of MSS.
5. If we get to MSS = MTU-40, then we get an error saying that we cannot determine MSS. This situation is erroneous, because in paragraph 1 we carry out a similar check, and if the results do not match, this is an occasion to think.
After receiving the desired MSS, you must enter it in the corresponding rule. You can do without a script by lowering the value of MSS by eye, but it’s better to find out for sure - there is less overhead for sending packets.
Often on the forums you can find tips to lower the MTU on a given interface. You need to understand that this is not a panacea, and the result depends on which interface to lower. If we lower TCP connections on one of the interfaces of the participants, this will bring effect, since the declared MSS will correspond to the minimum packet size. But if it is not the endpoints, but one of the transit routers, then without enabling the --clamp-mss-to-pmtu option, there will be no effect.
I hope this article will help you solve a similar problem both at home and with your friends and acquaintances. Once again I turn to the experts of the providers - WITHOUT AN EXTREME NEEDS DO NOT BLOCK ICMP TYPE 3 CODE 4 - this creates problems for your colleagues.