Linux QoS: U32 Filter

It so happened that the U32 filter in the Linux kernel traffic control subsystem is considered simple and straightforward, and therefore does not need detailed documentation. For example, in LARTC (Linux Advanced Routing and Traffic Control) there are only a few paragraphs about it. But in fact, U32 is much more complicated and interesting, but it is not as simple to use as it may seem. Under cat is an article on this filter with usage examples and detailed explanations.


Mapping


And so, the main function of the U32 filter is that it takes a block of data from the packet and compares it with the given value. If the values ​​match, then some actions are performed on the package. The data block in the package is set by the following parameters:

  • The size of the data block. Defined by parameters u32 / u16 / u8. Intuitively, numbers are the length of a block in bits. The kernel operates with 32bit blocks. In tc, you can set the block length to 8, 16 and 32 bits.
  • Bit mask. This is necessary if you need to check not the entire data block, but only its individual bits. Naturally, the length of the mask should match the length of the block. A mask is applied to the block using the bitwise AND operation, and the result of this operation is already compared with the set value.
  • Offset from the start of the packet. All offsets are aligned on a 32-bit boundary. Not everything is as obvious as it seems, and we will talk about this later, but for now this simplification is enough. An offset of zero, in most cases, corresponds to the beginning of the network layer header, i.e. top of the IPv4 / IPv6 packet.


For example, take the following tc command example:

tc filter add                         \
dev eth0                              \
parent 1:                             \
pref 10                               \
protocol ip                           \
u32                                   \
match u32 0xc0a80100 0xffffff00 at 12 \
classid 1:8


We add a filter of type u32 with a priority of 10 to discipline 1: 0 of eth0. And if the comparison is successful, then the package will fall into class 1: 8. The seventh line is especially interesting to us, we will consider it in detail:

  • match u32 - sets the beginning of the matching condition and the size of the data block to be taken from the packet.
  • 0xc0a80100 is the value with which we will compare the bits from the packet.
  • 0xffffff00 is a bitmask that will be superimposed on the data from the packet, and the result will already be compared with what we set. If our value and the masked data block are equal, then certain actions are performed. Most often, this action will be the classification of the package.
  • at 12 - offset from the beginning, at which the beginning of the data block is located. If this parameter is not specified, then the offset is considered equal to zero. At the very beginning of the filter, the zero boundary is considered the beginning of a network layer header, such as IPv4. Negative offsets are also possible.


With these actions we check the source address (this can be seen if you look at the format of the IPv4 header), and if it belongs to the 192.168.1.0/24 subnet, then we send the packet to the 1: 8 class. Of course, constantly digging into the RFC is too tedious, and it’s difficult to keep all the biases in mind, so tc provides syntactic sugar for commonly used cases. For example, our example becomes much clearer if you write it like this:

tc filter add               \
dev eth0                    \
parent 1:                   \
pref 10                     \
protocol ip                 \
u32                         \
match ip src 192.168.1.0/24 \
classid 1:8


You can specify several “match” parameters in one command, and the match will be successful only if all conditions are met. Let's send all packets with a source address of 192.168.1.0/24 and a ToS value of 0x10 (interactive traffic) to class 1: 1:

tc filter add               \
dev eth0                    \
parent 1:                   \
pref 10                     \
protocol ip                 \
u32                         \
match ip src 192.168.1.0/24 \
match ip tos 0x10 0x1e      \
classid 1:1


Check if everything works correctly:
#смотрим наш фильтр
~$ tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1
filter parent 1: protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 1911 success 0)
  match c0a80100/ffffff00 at 12 (success 0 )
  match 00100000/001e0000 at 0 (success 0 )
#если задать ключ -p, то tc будет некоторые смещения преобразовывать
#в более удобочитаемый вид, а другие - скрывать
~$ tc -s -p f ls dev eth0
filter parent 1: protocol ip pref 10 u32
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1
filter parent 1: protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 3413 success 0)
  match IP src 192.168.1.0/24 (success 0 )  (success 0 )
#пингуем что-нибудь с заданным адресом источника и значением ToS
$ ping -f -I 192.168.1.1 -Q 0x10 www.ya.ru
PING ya.ru (93.158.134.3) from 192.168.1.1 : 56(84) bytes of data.
--- ya.ru ping statistics ---
107 packets transmitted, 107 received, 0% packet loss, time 619ms
rtt min/avg/max/mdev = 4.492/5.240/7.560/0.536 ms, ipg/ewma 5.842/5.403 ms
#смотрим, отрабатывает ли фильтр как положено
~$ tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32 
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 354903 success 107)
  match c0a80100/ffffff00 at 12 (success 107 ) 
  match 00100000/001e0000 at 0 (success 107 )


As you can see, everything works as it should - packages get into the desired class. But if you carefully look at the output of tc, then some points are still unknown to us. For example, these cryptic fh values ​​are 800 :: 800. These are the so-called handles (tracing paper from the English “handle”) - filter identifiers inside U32.

Each handle of an individual filter consists of three hexadecimal numbers and is unique within the U32 filter space for an interface. In our case, this is the handle 800 :: 800. We are now interested in only the last digit - the number of the added filter. If you do not specify it, then the system itself assigns it - the next after the oldest, starting from 0x800. The filter number ranges from 0x001 to 0xfff. You can manually specify the filter number with the appropriate parameter:

tc filter add               \
dev eth0                    \
parent 1:                   \
pref 10                     \
protocol ip                 \
handle ::1                  \
u32                         \
match ip src 192.168.1.0/24 \
match ip tos 0x10 0x1e      \
classid 1:1


Filters are executed in the order of their numbers.

Binding


Let us add two filters:

tc filter add               \
dev eth0                    \
parent 1:                   \
pref 10                     \
protocol ip                 \
handle ::1                  \
u32                         \
match ip src 192.168.1.0/24 \
match ip tos 0x10 0x1e      \
classid 1:1
tc filter add               \
dev eth0                    \
parent 1:                   \
pref 10                     \
protocol ip                 \
handle ::2                  \
u32                         \
match ip src 192.168.1.0/24 \
match ip tos 0x08 0x1e      \
classid 1:2


Both filters will have the same prefix - the first two numbers of the handle (800: 0). If you imagine them written down one after another in the order of the number - the last number of the handle, you get a list of filters:

800: 0 list:
:: 1 - ip src 192.168.1.0/24 tos 0x10 -> classid 1: 1
:: 2 - ip src 192.168.1.0/24 tos 0x08 -> classid 1: 2


Since in the filters of one list the handles differ only in the last numbers, the list itself can be identified by the first two numbers of handles - by their prefix (800: 0). As previously mentioned, filters are executed in the order they are followed. If the match was successful in any of the filters, some action is taken and U32 exits. If we reach the end of the list and have not received a single successful match, then the U32 filter will return the unclassified packet to the top.

u32 simple flow chart

In addition to classification, the action can be a transition to another list of filters to check the package. This is done using the "link" parameter instead of the "classid" (even if the "classid" parameter is specified, it is still ignored), for example, like this:

tc filter add               \
dev eth0                    \
parent 1:                   \
prio 10                     \
protocol ip                 \
handle ::2                  \
u32                         \
match ip src 192.168.1.0/24 \
link 1:


If the packet has a source address from the subnet 192.168.1.0/24, then a check starts on the filter list with a handle of 1: 0. If there are no successful comparisons in it, then we will return to the previous 800: 0 list and continue checking on it. In turn, filters from the 1: 0 list can transition to other lists, and those to the third, etc. And so up to seven transitions (for whom this is not enough, you can change the macro substitution TC_U32_MAXDEPTH in the kernel source).



It should also be taken into account that splitting a large number of filters into many lists and organizing such transitions on average will not be much faster than checking against one large list. But, as a rule, binding is used in conjunction with another U32 mechanism - hashing.

Hashing


Before that, we looked at filter lists. But in reality they are only part of a larger structure called a hash table. The hash table, in this case, is a one-dimensional array of cells (English array of buckets), each of which stores one filter list.

In handles, just the first number is the hash table number, and the second is the cell number. Hash table numbers range from 0x000 to 0xfff, and cell numbers range from 0x00 to 0xff. The number of cells can be from 1 to 256, and it must be a power of two (you cannot set another value - tc will give an error message). The hash table with the number 0x800 is called the root and consists of one cell, it is created automatically. A package check always starts by looking at the list in cell 800: 0.

You can create additional hash tables like this:

tc filter add     \
dev eth0          \
parent 1: pref 10 \
protocol ip       \
handle 1:         \
u32 divisor 1


Where “handle 1:” sets the hash table number, and “divisor 1” - the number of cells in it (in this case, the hash table with number 1 has only one cell in which the list of filters 1: 0 will be located).

We will expand our example with classification by source address and ToS field using jumps through lists:

#добавляем классовую дисциплину prio с восемью дочерними классами
~$ tc q add \
dev eth0 \
root \
est 0.1s 10s \
handle 1: \
prio bands 8
#добавляем хэш-таблицу 1: с одной ячейкой
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle 1: \
u32 \
divisor 1
#добавляем сопоставление по айпи-адресу источника
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle ::1 \
u32 \
match ip src 192.168.1.0/24 \
link 1:
#добавляем сопоставление по полю ToS в хэш-таблицу 1:
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle ::1 \
u32 \
ht 1: \
match ip tos 0x08 0x1e \
classid 1:3
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle ::2 \
u32 \
ht 1: \
match ip tos 0x10 0x1e \
classid 1:1
#для того, чтобы показать, что происходит возврат в прежний
#список, добавим ещё одно правило
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle ::2 \
u32 \
match ip src 192.168.1.0/24 \
classid 1:7
#просмотрим наши фильтры
~$ tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32
filter parent 1: protocol ip pref 10 u32 fh 1: ht divisor 1
filter parent 1: protocol ip pref 10 u32 fh 1::1 order 1 key ht 1 bkt 0 flowid 1:3 (rule hit 0 success 0)
  match 00080000/001e0000 at 0 (success 0 )
filter parent 1: protocol ip pref 10 u32 fh 1::2 order 2 key ht 1 bkt 0 flowid 1:1 (rule hit 0 success 0)
  match 00100000/001e0000 at 0 (success 0 )
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1
filter parent 1: protocol ip pref 10 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1: (rule hit 33900 success 0)
  match c0a80100/ffffff00 at 12 (success 0 )
filter parent 1: protocol ip pref 10 u32 fh 800::2 order 2 key ht 800 bkt 0 flowid 1:7 (rule hit 3583 success 0)
  match c0a80100/ffffff00 at 12 (success 0 )
#а теперь проверим работу, послав пакеты с различными значениями
#ToS из подсети 192.168.1.0/24, а затем проверив значения счётчиков
~$ ping -fc10 -I 192.168.1.1 -Q 0x08 8.8.8.8
~$ ping -fc15 -I 192.168.1.1 -Q 0x10 www.kernel.org
~$ ping -fc25 -I 192.168.1.1 -Q 0xaa www.habrahabr.ru
#смотрим ещё раз счётчики
~$ tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32 
filter parent 1: protocol ip pref 10 u32 fh 1: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 1::1 order 1 key ht 1 bkt 0 flowid 1:3 (rule hit 50 success 10)
  match 00080000/001e0000 at 0 (success 10 ) 
filter parent 1: protocol ip pref 10 u32 fh 1::2 order 2 key ht 1 bkt 0 flowid 1:1 (rule hit 40 success 15)
  match 00100000/001e0000 at 0 (success 15 ) 
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1: (rule hit 578192 success 0)
  match c0a80100/ffffff00 at 12 (success 50 ) 
filter parent 1: protocol ip pref 10 u32 fh 800::2 order 2 key ht 800 bkt 0 flowid 1:7 (rule hit 547850 success 25)
  match c0a80100/ffffff00 at 12 (success 25 ) 


But using the “link” parameter, you can only go to the filter lists located in cell number 0. How, then, can check the lists in other cells? To do this, the hashkey parameter and the hash mechanism are used. The meaning of hashing in U32 is that the system receives the number of the cell to the list in which it is based on the data from the packet. This is necessary to reduce the number of checks. Unlike a list, the viewing time of which linearly depends on the number of filters in it, hashing is performed for a constant time (moreover, a very short one). Using large numbers of filters, hashing can achieve an order of magnitude greater performance than using lists and transitions alone.



We rewrite our previous example using hashing:

#добавляем хэш-таблицу с номером 1: на 32 ячейки
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle 1: \
u32 \
divisor 32
#номер ячейки в данном случае совпадает со значением ToS
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle 1:08:1 \
u32 \
ht 1:08 \
match u32 0 0 \
classid 1:3
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle 1:10:1 \
u32 \
ht 1:10 \
match u32 0 0 \
classid 1:1
#все пакеты с адресом источника 192.168.1.0/24 отправляем
#на проверку в хэш-таблицу 1: в ячейку равным значению полю ToS
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle ::1 \
u32 \
match ip src 192.168.1.0/24 \
link 1: \
hashkey mask 0x001f0000 at 0
#правило для проверки того, что неклассифицированные пакеты
#возвращаются обратно в корневую таблицу
~$ tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle ::2 \
u32 \
match ip src 192.168.1.0/24 \
classid 1:7
#посылаем пакеты с различным значением ToS
~$ ping -fc10 -I 192.168.1.1 -Q 0x08 8.8.8.8
~$ ping -fc30 -I 192.168.1.1 -Q 0x10 www.kernel.org
~$ ping -fc50 -I 192.168.1.1 -Q 0xaa www.habrahabr.ru
#и проверяем значения счётчиков фильтров
~$ tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32
filter parent 1: protocol ip pref 10 u32 fh 1: ht divisor 32
filter parent 1: protocol ip pref 10 u32 fh 1:8:1 order 1 key ht 1 bkt 8 flowid 1:3 (rule hit 10 success 10)
  match 00000000/00000000 at 0 (success 10 )
filter parent 1: protocol ip pref 10 u32 fh 1:10:1 order 1 key ht 1 bkt 10 flowid 1:1 (rule hit 30 success 30)
  match 00000000/00000000 at 0 (success 30 )
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1
filter parent 1: protocol ip pref 10 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1: (rule hit 30135 success 0)
  match c0a80100/ffffff00 at 12 (success 90 )
    hash mask 001f0000 at 0
filter parent 1: protocol ip pref 10 u32 fh 800::2 order 2 key ht 800 bkt 0 flowid 1:7  (rule hit 27250 success 50)
  match c0a80100/ffffff00 at 12 (success 50 )


Consider the hashing algorithm in our example. This algorithm was introduced in kernel version 2.6 and has not yet been changed (in the kernel version 2.4, the algorithm is different).

  • Take a 32-bit word at the “at” offset specified in the “hashkey” parameter. Since our offset is zero, the first 32 bits of the IP packet are taken.
  • Apply a bitmask from the same parameter to it. The bit mask for us is "0x001f0000", which corresponds to the position of the ToS field.
  • Shift the result in the direction of the least significant bits by n-bits. The value of n is calculated from the mask and is equal to the number of least significant zero bits in it. In our mask, the number of least significant zero bits is 16, which means that the masked double word will be shifted to these 16 bits. After that, the ToS value will appear in the lower five digits.
  • Apply a 0xff mask to the resulting value. In this case, it will not change anything. In other cases, it will reset all octets except the minor one.
  • And finally, if the target table has less than 256 cells, a bit mask equal to the number k is superimposed on the result. The value of k is one less than the cells in the hash table that will be transitioned to. Our target table has 32 cells, which means the value of k will be 31. After applying this mask, our result is guaranteed to be no more than the number of cells.


Unfortunately, tc does not provide any means to simplify the writing of offsets in this case, so it will not work to write something like “hashkey ip tos”. Another difficulty is determining which name box to put filters in. There is only one way - to manually read the hash (in tc there is the “sample” parameter, which allows you to automatically calculate the hash in order to place the filter in the desired cell, but the hashing algorithm is still not suitable for kernel 2.4 and for more recent kernels).

Offsets


Everything would be simple if the headers had a fixed length. But, unfortunately, this is not so - headers may have additional optional elements, which makes it very difficult to compare the fields of headers of the next levels. Fortunately, U32 also provides for this. This function is called “header offsets” and is designed to find out the offset of the next header from the package itself.

But first, we’ll take a look at two new concepts of the U32 filter: constant offset (for brevity, we will designate it as “permoff”) and temporary displacement (we will designate it as “tempoff”). The permoff value is always added to all offsets during the next and further transitions made using the “link” parameter, and is used to calculate the new permoff and tempoff values. The tempoff value is added only to the indicated offsets only during the next transition if the keyword “nexthdr +” is used in the parameters and is not affected by further transitions. When you return back to the previous list, the values ​​are "rolled back".

Consider a small hypothetical example. Let permoff be zero when executing the list 1: 0, and we have a successful match; and the new permoff value became equal to 20, while we moved on to the execution of the 2: 5 list. In this case, the value 20 will be added to all the indicated offsets in the filters of the 2: 5 list. If in the 2: 5 list we will have another successful match with the transition to the 3:12 list, and the constant offset will be changed to 8, then the new the permoff value will become 28. And when executing the list 3:12, the value 28 will already be added to all offsets. If there are no successful matches in the list 3:12, then U32 will go back to the list 2: 5 and permoff will again be 20. If and there will not be any successful comparisons, and we will return to the list 1: 0, then permoff will again be equal null zero. In fact, we simply shift the zero mark, with respect to which we further indicate the offsets. The permoff and tempoff values ​​are initially zero.

Take a look at the RFC and look at the IPv4 header format. Fortunately for us, it has a special field, the value of which is equal to the length of the title in double words. And we can use this information for that. to calculate the position of the next level header.

It is done like this:
tc filter add                              \
dev eth0                                   \
parent 1:                                  \
pref 10                                    \
protocol ip                                \
u32                                        \
match ip src 192.168.1.0/24                \
match protocol 0x01 0xff                   \
link 1:                                    \
offset at 0 mask 0x0f00 shift 6 plus 0 eat 


Consider in detail the last line and what happens when this happens:
  • offset - indicates that we are changing the permoff or tempoff values ​​when matching is successful.
  • A single word is taken at the offset specified in the “at” parameter. In our case, it is zero, so the first 16 bits of the packet will be taken. The value of the parameter “at” is reduced to a multiple of two.
  • A bitmask is superimposed on the value received from the packet. We have it equal to 0x0f00, which corresponds to the position of the IPv4 header length field. Suppose that we have the value 5 in this field (in binary representation 0101). In binary format, the result of previous operations will be equal to "0000.0101.0000.0000".
  • We know that the value of the IP header length field is equal to the length of the header in double words. So, it is necessary to somehow convert the obtained value into bytes. The matter is complicated by the fact that as a result, the necessary bits are additionally offset. We shift the value so that it appears in the lower octets - we shift it to the lower bits by 8 bits and get "0000.0000.0000.0101". Now we multiply this by 4, since the size of the title is indicated in double words. Multiplication is replaced by a shift in the direction of the higher bits by 2 bits. After the shift, we get "0000.0000.0001.0100". We got the right value. Combine both shift operations into one. As a result, in order to get the length of the header in bytes, we will need to shift the initial masked value by six bits in the direction of the least significant bits. This is done using the “shift” parameter.
  • The value of the plus parameter is added to the obtained value. In our case, this is 0, so nothing is added. You could not even specify this parameter.
  • The eat parameter says that the computed value is added to the permoff value. Otherwise, the tempoff value will be computed.
  • As a result, when checking filters from the 1: 0 list, the zero offset will correspond to the beginning of the next header, and we can build filters based on matching ICMP header fields, for example, checking message type values. If you change the tempoff value, then you must explicitly specify it in the called list, using the "nexthdr +" keyword to specify the offsets.


After that, we can add to the list 1: 0 filters with matching by type of ICMP messages. For example, ICMP packets of type Echo-Request will be sent to class 1: 8.

~$tc f add \
dev eth0 \
parent 1: \
pref 10 \
protocol ip \
handle 1::1 \
u32 \
ht 1: \
match u8 0x08 0xff \
classid 1:8
#посмотрим на счётчики фильтров
~$ sudo tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32 
filter parent 1: protocol ip pref 10 u32 fh 1: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 1::1 order 1 key ht 1 bkt 0 flowid 1:8  (rule hit 0 success 0)
match 08000000/ff000000 at 0 (success 0 ) 
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1:  (rule hit 48553 success 0)
  match c0a80100/ffffff00 at 12 (success 0 ) 
  match 00010000/00ff0000 at 8 (success 0 ) 
    offset 0f00>>6 at 0  eat
#отправим несколько эхо-реквестов
~$ ping -fc5 -I 192.168.1.1 www.ixbt.com
#и снова взглянем на значения счётчиков
~$ tc -s f ls dev eth0
filter parent 1: protocol ip pref 10 u32 
filter parent 1: protocol ip pref 10 u32 fh 1: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 1::1 order 1 key ht 1 bkt 0 flowid 1:8  (rule hit 5 success 5)
  match 08000000/ff000000 at 0 (success 5 ) 
filter parent 1: protocol ip pref 10 u32 fh 800: ht divisor 1 
filter parent 1: protocol ip pref 10 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1:  (rule hit 149972 success 0)
  match c0a80100/ffffff00 at 12 (success 5 ) 
  match 00010000/00ff0000 at 8 (success 5 ) 
    offset 0f00>>6 at 0  eat 


That's all. In the original you will find some more information - a dry guide and a list of comparisons (the same syntactic sugar).

Original - The u32 filter. - the author, unfortunately, is unknown to me, but everything points to Russell Stewart.

Also popular now: