How Kad Network Search Works

Quite often there are complaints that there is no keyword search in the Mainline DHT bittorrent. Typically, requests to add such a search on the BitTorrent Inc. forums ignored or received the traditional answer that DHT does not allow searching by keywords, but only by individual keys. This is true in principle, but there is such a way out of this unfortunate situation as creating an inverted keyword index for each node.

Actually, this is how the search in the Kad Network, the implementation of a distributed hash table, based on the fairly widely used Kademlia protocol, works. Kad Network is used by programs such as eMule, iMule, aMule and MLDonkey to search for file hashes by keywords and file sources by their hashes.

A general description of the search process on the Kad Network.
(For a better understanding, it’s useful to imagine how Kademlia works)

So, each node in the Kad Network has a 128-bit ID, which is randomly generated every time you enter the network. For hashing files and words, an algorithm based on MD4 is used, which gives the distribution of key hashes in the same space as the ID of all nodes. Each node is responsible for storing several types of information. Firstly, each node stores a map of sources, i.e. Information about network members who are willing to share a specific file. A map is a hash of a file as a key and a list of sources as a value. The node source map contains contact information on the hashes of all files whose hash matches or is close to the ID of the node (XOR metric is used). So, having a hash of a file, you can find nodes with IDs that are close to the hash of the file (a specific search mechanism roughly corresponds to which is used in Kademlia) and ask them for a list of sources for downloading the file. Secondly, if there is no file hash, then you can search by keywords. To do this, the name of the published file is divided into several consecutive words, and each word is hashed separately. In order to search for these keywords, each node has a second map, which contains hashes of words close to the ID of the node itself, as keys and file hashes and their names as values.

Now in more detail about how the search is carried out specifically in eMule.

When the search process is initiated, the corresponding object is created and entered into the search map. Each search query has a target key (file hash, word hash) that is searched on the Kad Network. Searches are performed using two parallel processes.
The first process searches for nodes with IDs that are close to the hash key of the desired object. To find the necessary nodes, the searching node takes 50 nodes with the IDs closest to the key of the object to be searched from its routing table and places them in the temporary map. Then, the searching node starts accessing these 50 nodes in order to get the nodes closer to the key of the desired object. Only 3 nodes are interrogated at a time. The searching node sends a KADEMLIA_REQ message with such parameters as the object key (word hash, file hash or node ID) and the number of requested nodes (i.e. how many nodes the requested node should return as an answer). The last parameter depends on the type of search query. If this is a node search, then the parameter is 11, if it is a source search or a word search, then - 2. When a node receives such a message, it responds with a KADEMLIA_RES message with the requested number of contacts (nodes). When the searching node receives this message, it adds the received contacts to its routing table and to the temporary map of the current search. Then the searching node again polls the three nodes closest to the object key, but only if their ID is closer to the key than the ID of the nodes that provided their contact information. This procedure is repeated until there are new nodes in the temporary map whose ID is closer to the key of the object being searched. If the nodes do not respond to the KADEMLIA_REQ request, then they are deleted. who provided their contact information. This procedure is repeated until there are new nodes in the temporary map whose ID is closer to the key of the object being searched. If the nodes do not respond to the KADEMLIA_REQ request, then they are deleted. who provided their contact information. This procedure is repeated until there are new nodes in the temporary map whose ID is closer to the key of the object being searched. If the nodes do not respond to the KADEMLIA_REQ request, then they are deleted.

The second process sends the nodes that responded with a KADEMLIA_RES message, starting with the closest, the corresponding search request, depending on what is being searched. It can be SEARCH_REQ to search for sources or files by keywords, PUBLISH_REQ to publish sources or keywords, etc. These requests use an additional flag, which takes the values ​​0 or 1, to determine what exactly is being requested - sources or hashes with names by keywords. When a node receives such a search query, it must process it (search for the requested information in its keyword map or source map) and respond with the SEARCH_RES message with the selected results. For each type of result, there is a limit value, upon reaching which the search stops. To search for sources or search by keywords, this limit is 300 received original objects. Also, the search for sources or files stops if 45 seconds have passed since the start of the search.

Useful articles:
Petar Maymounkov, David Mazières "Kademlia: A Peer-to-peer information system based on the XOR Metric".
Detailed description of the Kad Network protocol:
David Mysicka "Reverse Engineering of eMule: An Analysis of the Implementation of Kademlia in eMule".
Stefan Schmid, Thomas Locher "When KAD meets BitTorrent - Building a Stronger P2P Network."

Also popular now: