inkpot April 4, 2017 at 14:52

Kademlia (DHT) - A Practical Guide

We will talk about DHT on the example of its implementation known as Kademlia. DHT translates as a distributed hash table and is intended to build a decentralized information exchange network. All of the above works in the client for ED2K networks for the Android platform and in the form of a daemon on Linux. Implementation details below.

Resource

Each host has its own identifier. In addition, any resource in DHT, whether it is a keyword or a file, also has an identifier. As identifiers, the hash value of the function is used - in torrents it is SHA1, in ED2K it is MD4. The difference is only in length - 160 bits versus 128. Thus, in order to publish or find something on the network, a hash of the desired resource is required.

In Kademlia, the hash used in the ED2K network is taken to publish the file, although with some reservations. If the file hash itself is serialized simply as a sequence of bytes, the KAD identifier is serialized as a sequence of 4 32-bit words. This is a feature of Cademlia.

We have to rotate the bytes:

/**
 * save/load as 4 32 bits digits in little endian save as 4 32 bits digits network byte order
 * rotate bytes in each 4 byte portion
 *
 */
@Override
public ByteBuffer get(ByteBuffer src) throws JED2KException {
    for (short i = 0; i < value.length; ++i) {
        byte b = src.get();
        value[(i / 4)*4 + 3 - (i % 4)] = b;
    }
    return src;
}
@Override
public ByteBuffer put(ByteBuffer dst) throws JED2KException {
    for (short i = 0; i < value.length; ++i) {
        dst.put(value[(i / 4) * 4 + 3 - (i % 4)]);
    }
    return dst;
}

Serialization in ED2K networks

Historically, all primitives that occupy more than one byte are serialized in the same order as on the target platform, namely LITTLE ENDIAN (the older part at the senior addresses). I think this happened due to the fact that the eDonkey / eMule client was originally developed only for the Windows platform and the corresponding architecture. The exception was an MD4 hash serializable as a sequence of bytes. The IP address is stored in a 32-bit unsigned integer LITTLE ENDIAN and is also written to the packet. Support for Kadelia was added later and most likely by other people. For some reason, these people decided that the network byte order should be applied - perhaps because it is (or is considered) standard for TCP / IP networks. And they applied the new byte order only to the IP address - when writing to KAD packets, it is converted to BIG ENDIAN, everything else remains as it was. The new KadId primitive was added not as a hash, but as an array of 4 32-bit numbers - that is why when writing and reading the KAD identifier you have to make some transformations.

To publish keywords, file names are broken into words and each word is hashed. Received hashes are published. The hash can be represented as a large integer with the byte order big endian - high bytes at the lowest addresses (at the beginning) or just an array of bytes.

The work of DHT is based on the ability to calculate the distance between hashes, this allows you to consistently approach the target reducing the distance. HASH_SIZE is 128.

Just xor:

// returns the distance between the two nodes
// using the kademlia XOR-metric
public static KadId distance(final KadId n1, final KadId n2) {
    assert n1 != null;
    assert n2 != null;
    KadId ret = new KadId();
    for(int i = 0; i < MD4.HASH_SIZE; ++i) {
        ret.set(i, (byte)(n1.at(i) ^ n2.at(i)));
    }
    return ret;
}

Comparator of two hashes relative to some target hash. Typically, the target hash is the host's own hash.

// returns -1 if: distance(n1, ref) < distance(n2, ref)
// returns 1 if: distance(n1, ref) > distance(n2, ref)
// returns 0 if: distance(n1, ref) == distance(n2, ref)
public static int compareRef(final KadId n1, final KadId n2, final KadId ref) {
    for (int i = 0; i != MD4.HASH_SIZE; ++i) {
        int lhs = (n1.at(i) ^ ref.at(i)) & 0xFF;
        int rhs = (n2.at(i) ^ ref.at(i)) & 0xFF;
        if (lhs < rhs) return -1;
        if (lhs > rhs) return 1;
    }
    return 0;
}

The most running distance determination function in the K-bucket, so to speak.

// returns n in: 2^n <= distance(n1, n2) < 2^(n+1)
// useful for finding out which bucket a node belongs to
public static int distanceExp(final KadId n1, final KadId n2) {
    int bt = MD4.HASH_SIZE - 1;
    for (int i = 0; i != MD4.HASH_SIZE; ++i, --bt) {
        assert bt >= 0;
        int t = (n1.at(i) ^ n2.at(i)) & 0xFF;
        if (t == 0) continue;
        assert t > 0;
        // we have found the first non-zero byte
        // return the bit-number of the first bit
        // that differs
        int bit = bt * 8;
        for (int b = 7; b >= 0; --b)
            if (t >= (1 << b)) return bit + b;
        return bit;
    }
    return 0;
}

Network connection

First of all, the client generates its identifier. This is a random MD4 hash (SHA1 for torrents). Part of the data during the generation of the hash can be mixed with the address, port, and the like, for the greater chance.

One of the incomprehensible at first glance moments is how the hash of the node that he assigned to himself and the resources that he provides is connected. The answer is no way. A random choice of a hash means that the client on the network will be responsible for resources close to his hash - other clients will come to him for publication and search. The client will also publish his resources on other sites, indicating himself as a source.

Although DHT and a decentralized network, to connect to it you need to know at least one node connected to the network. Knowing the address of such a node, the client sends a special bootstrap request and receives a list of nodes in response. Then you can send bootstrap to these nodes and so on. There are also sites from which you can download files with sets of nodes in ED2K format.

Routing table

The routing table in particular is designed to select the nodes closest to a certain hash. The table contains K-buckets. A K-bucket is really nothing more than a container with structures describing a host. Typically, the table is illustrated in the form of a tree, such as here .

The routing table itself can be represented simply by a list of K-bucket sorted in descending order of distance to our identifier.

Initially, the table does not contain any K-bucket - they are added in the process of adding a node.

Let the table contain such a parameter as the number of already created K-bucket - N (numBuckets). The K-bucket number is calculated using the aforementioned distanceExp function as 128 - 1 - distanceExp, the nodes closer to our hash are located closer to the tail of the list.

Each K-bucket position less than N-2 may contain nodes whose distance from our hash is n. The K-bucket whose number is N-1 (extreme) contains not only nodes with a distance n, but also all nodes located closer, in other words, all other nodes. The range of values of n is [0..127]. It looks more clear in the code of the search function K-bucket (below).

Algorithm for adding a node

We are looking for the K-bucket number for the new node. It’s easier to illustrate this with code.

public int findBucket(final KadId id) {
    int numBuckets = buckets.size();
    if (numBuckets == 0) {
        buckets.add(new RoutingTable.RoutingTableBucket());       
        ++numBuckets;
    }
    int bucketIndex = Math.min(KadId.TOTAL_BITS - 1 - KadId.distanceExp(self, id), numBuckets - 1);
    assert (bucketIndex < buckets.size());
    assert (bucketIndex >= 0);
    return bucketIndex;
}

If the K-bucket has free space - add a node
If the K-bucket does not have free space, we try to clear it using metrics of nodes like the last update, the number of unanswered requests and so on. If a place appears - the node is added
If the K-bucket does not have free space after all attempts, try to split it. Partition failed - node ignored

K-bucket split

A K-bucket can only be split if it is the last container and this is not the last possible K-bucket. Theoretically, the total K-bucket can be 128 for MD4, but the last bouquet cannot be used since hashes matching the hash of the node are not processed. The principle is simple - nodes with a distance equal to n remain in the current container, all the others are moved to the new one. Thus, after splitting, the table will grow by one K-bucket.

The routing table is a K-bucket sheet. K-bucket is a list of structures describing a node from a DHT network. The implementation may be arbitrary, for example, you can put all this into a database table. The table is not compressed - if the nodes from a certain K-bucket disappear until the container is completely empty, it will remain in the table. Nodes can be deleted, for example, if unavailable for some time.

Table update

There is nothing to consider in detail - an update is an internal process designed to keep the routing table up to date. From time to time, a K-bucket is selected where an update is required, a random hash is generated that belongs to this K-bucket, but does not match any of the existing ones, and the search process starts. The condition for the update to start - for example, the last update was more than 15 minutes ago.

Publish and Search

Publishing and searching is one and the same process at the end of which either publishing on found sites or a search request is performed. Its essence is to approach the nodes by successive iterations whose identifiers are close to the identifiers of the desired resources. According to the logic of the network, these nodes will contain information about keywords and files whose hash is close to their hash.

The first 50 (or how many) of the closest nodes are extracted from the table and a survey list is formed.
A Kad2Req request is sent, or simply a request for nodes that are even closer than the requested node. A node looks in its routing table and forms a response from nodes that are closer than it to the requested hash.
The answer is put back in the poll and it all starts again
Nodes located close enough to the goal are polled for the availability of the desired resource or publication is being performed. There is such a thing as tolerance zone - how much the hash can be different in order to make a direct search or publication start
Point 2 is repeated until the limiting conditions are reached - enough nodes are found, all are interviewed, and so on

Point 4 is a separate process that can be launched in parallel with the main one. But in the presented implementation, it starts separately at the end of the main process.

Structure of publishing keywords and sources

Without further ado, I’ll give a table structure for storing keywords and sources. This structure is used in the aggregator on a separate host.

create table kad.sources (
  kad_id character(32) not null
  , host inet not null
  , port_tcp int not null default 0
  , port_udp int not null default 0
  , packet bytea not null
  , last_update timestamp not null default current_timestamp
  , total_updates int not null default 0
  , source_type int not null
  , constraint sources_pk primary key (kad_id, host, port_tcp, port_udp)
);
create table kad.keywords (
  kad_id character(32)
  , file_id character(32)
  , host inet not null
  , packet bytea not null
  , last_update timestamp not null default current_timestamp
  , total_updates int not null default 0
  , constraint keywords_pk primary key(kad_id, file_id, host)
);

It can be seen that the source is published for some hash of the file one to one. The keyword is published with the related file hashes in the name of which it is referred to as one to many. Tables are denormalized for ease of use - you could have some kind of key table as a master over sources / keywords.

Conclusion

A few words about architecture. DHT support is implemented in a separate tracker, which is an asynchronous UDP server running in a dedicated thread. It can be launched separately from the client application, which is convenient for testing. Actually now this tracker works as a daemon on a separate machine. Requests to the network are organized in RPC calls through the RPC manager - this solves the problem of limiting the time for waiting for a response, allows you to mark non-responding nodes, and so on.

Logically outgoing requests are combined in a manager (algorithm). An observer is created for each request. Well, all this starts as mentioned through the RPC manager. More details can be found in the code at the links.

A feature of Kademlia is that it is not always possible to accurately determine the answer to which request the host sent if several processes are running at the same time and send several requests to the same node at the same time. For example, the search processes for nodes may well intersect, and here you have to resort to some tricks.

In torrents, more competent transaction support - when sending a request, the client sends a special transaction_id block that should be returned to it. This not only allows you to accurately identify the transaction, but also gives a little network protection.

I did not consider publishing and searching for notes (notes) because I did not realize support for this feature of Cadelia.

I hope I managed to present the material without confusing anything.

References

Tags: