Corosan January 28, 2018 at 13:19

Comparing C and C ++ Performance with Huffman Compression Example

From the sandbox

Introduction

When people ask the question "Is the programming language X faster than the Y language" on IT forums, this usually causes streams of emotions and is considered incorrect. From a relative question about the religion or preference of a political party. Indeed, language is a way of expressing thoughts, ideas. In this case, the ideas of a software system. He is neither fast nor slow. It can be more or less concise, more or less accurate. And the speed is determined not so much by the language as by the final code that the compiler of this language generates. Or the speed of the interpreter in the case of an interpreted language.

But this is all philosophy. But in practice, there is usually a practical task of software development. And, indeed, you can implement this software in a dozen different programming languages. Therefore, although this is a “religious question” in the case of public discussion, this question often arises in the head of an IT specialist who is facing a specific task. “How much time will it take for me to implement the task in the X language and what characteristics the resulting software will have, including high-speed ones. Compared to the implementation of this task in the Y language. ” Of course, there is no exact answer to this question, the specialist relies on his personal experience and answers something like “with a probability of 95%, written in assembler, this task will work faster than in php”. But, honestly, this experience is rarely based on exact numbers of real tasks that this specialist himself implemented. Not, Well, who in their right mind would write complex software first in php, and then rewrite it in assembler, only to measure the characteristics? Basically limited to synthetic tests such as sorting an array, building and traversing a binary tree, and the like.

I, as a specialist who writes 90% in C ++, often stumbles upon the “hollywood” topics of comparing this language with others. And one of them is the progenitor - the C language. On the same quora.com they often raise this question “Is the C ++ language faster” (which is incorrect, as I explained above), or “Why is the Linux kernel or a ton of GNU utilities is written in C and not in C ++ ”(which is a completely correct question). I answered the second question for myself like this:

Mastering the C language requires an order of magnitude less effort, which means that more people can participate in the development of this software.
Complex actions that are potentially costly in memory or speed in C will probably take more lines of code and require effort from the author. So, non-optimality in the program will be easier to notice during writing or review. A C ++ program can be much more concise and, in appearance, easy to understand. But to notice that behind the overload of the “+” operator, for example, the launch of a spacecraft to the moon is hidden, it will be more difficult to notice.

Since the C language is part of the C ++ language, in the course of my everyday tasks I have to decide whether to express some part of the logic “more in C style” (with working with raw pointers, clearing memory via memset, passing context through void * ), or type-safe in C ++ style (pointers are wrapped in unique_ptr / shared_ptr, memory is cleared with normally written constructors, the context is passed as a typed object: either a pointer to a base class with virtual functions, or generally as a template).

Task

In order to answer this question a little more thoroughly, I decided to write another (yes, also a little synthetic) test - data encoding using the Huffman method. The article “The Huffman Algorithm on Fingers” ( https://habrahabr.ru/post/144200/ ) led me to the idea .

First, I implemented encoding on pure C. If you remember, it requires a priority queue to implement it, because there you need to quickly find characters ordered by the number of repetitions to build a coding tree. I will omit the algorithmic details by sending the reader to the link above (sorry for the tautology). Actually, that would have ended here, and there would have been no article, because I implemented coding only as a training in algorithmics. But in the process, I noticed how quickly a C program compiles compared to a similar size C ++ source. And he mentioned this work colleague. Having suggested that C ++ compilation probably includes many more optimization methods. So like written C ++ code, probably should be faster - the magic of the most-most gurus in the field of writing optimizing compilers will work in the same place. Grinning, a colleague replied: "Check it out."

And then I rewrote Huffman coding in C ++. For the purity of the experiment, I did not change the fundamental principles, for example, I did not insert a custom memory allocator. This can be done in C (more "custom") and in C ++ (more "native"). What then is “C ++ - nost”?

Priority queue

The first thing that is logical to express through templates in C ++ is a priority queue. In C, it is represented as a structure, the main element of which is a pointer to an array of pointers to nodes with data:

struct priority_queue
{
    // A number of active (carrying data) nodes currently in the queue
    unsigned int size;
    // A total number of nodes in "nodes" array
    unsigned int capacity;
    // An array of pointers to nodes
    struct priority_queue_node** nodes;
};

The data node can be any type, but its first element should be the following structure:

struct priority_queue_node
{
    unsigned int weight;
};

Since the queue does not manage memory for the nodes themselves, it does not need to know what the node really consists of. All that is required to work is to get its weight: ((struct priority_queue_node *) node_ptr) → weight. Adding a node to the queue, taking into account the possible memory allocation, looks somewhat cumbersome:

int priority_queue_push(struct priority_queue* queue, struct priority_queue_node* node)
{
    if (queue->size >= queue->capacity)
    {   
        int new_capacity = queue->capacity * 2;
        if (new_capacity == 0)
            new_capacity = 1;
        struct priority_queue_node** new_nodes = (struct priority_queue_node**) malloc(sizeof(struct priority_queue_node*) * new_capacity);
        if (! new_nodes)
        {
            return 0;
        }
        memcpy(new_nodes, queue->nodes, sizeof(struct priority_queue_node*) * queue->size);
        if (queue->capacity)
            free(queue->nodes);
        queue->nodes = new_nodes;
        queue->capacity = new_capacity;
    }
    queue->nodes[queue->size++] = node;
    heapify(queue);
    return 1;
}

Creating a queue and deleting it with processing all errors is also a lot of lines of code compared to the C ++ version, which is expected. Actually, the version of the queue in C ++ looks like this (attention is to the presentation of data):

template  class priority_queue
{
    struct node
    {   
        unsigned int m_weight;
        T m_data;
    };
    using node_ptr = std::unique_ptr;
    std::size_t m_capacity;
    std::size_t m_size;
    std::unique_ptr m_nodes;
    void heapify() noexcept;
    void increase_capacity();
public:
    explicit priority_queue(std::size_t capacity = 16) ;
    // …
};

There is the same arrangement of the elements of the node in memory as in the C version - first comes the weight, then the payload of the node, which, as will be shown below, consists of an integer scalar - the height of the tree contained in this node and a pointer to the root of this tree . The nodes themselves are also stored in this priority queue - the m_nodes pointer points to an array of pointers to the nodes.

For comparison, putting a new element in the queue now looks more concise and type-safe (memory reallocation is in a separate increase_capacity method, which does not change the essence):

template 
push(unsigned int weight, U&& obj)
{   
    if (m_size >= m_capacity)
        increase_capacity();
    m_nodes[m_size++].reset(new node({weight, std::forward(obj)}));
    heapify();
}
void increase_capacity()
{
    const auto new_capacity = m_capacity ? m_capacity * 2 : 1;
    std::unique_ptr new_nodes(new node_ptr[new_capacity]);
    for (auto src = m_nodes.get(), dest = new_nodes.get(); src != m_nodes.get() + m_size; ++src, ++dest)
        *dest = std::move(*src);
    m_nodes = std::move(new_nodes);
    m_capacity = new_capacity;
}

Character Tree (Coding Tree)

Subsequently, parts of the symbol tree will be inserted into the queue, which will then be combined in accordance with the number of repetitions of each symbol so as to obtain an encoding tree at the end. In C, this tree is represented by the simplest structures with raw pointers; the base one contains an identifier of the final type:

#define NODE_TYPE_TERM 1
#define NODE_TYPE_NODE 2
struct char_node_base
{
    int type;
};
struct char_node_terminal
{
    struct char_node_base base;
    char c;
};
struct char_node
{
    struct char_node_base base;
    struct char_node_base* left;
    struct char_node_base* right;
};

And in order to put the root of such a tree in the priority queue, the structure with the required queue, a member that stores the node weight, is defined:

struct char_node_root
{
    struct priority_queue_node pq_node;
    int height;
    struct char_node_base* node;
};

In C ++, all this is expressed somewhat more elegantly:

struct char_node_base
{
    virtual ~char_node_base() = default;
};
using char_node_ptr = std::unique_ptr;
struct char_node_terminal : char_node_base
{
    const unsigned char m_c;
    char_node_terminal(char c) noexcept : m_c(c) {}
};
struct char_node : char_node_base
{
    char_node_ptr m_left;
    char_node_ptr m_right;
};
struct nodes_root
{
    int m_height;
    char_node_ptr m_node;
};

Here you can see the key advantage of C ++ - nothing needs to be done to correctly remove this tree. Just remove the root, auto pointers will do everything themselves. In C, a fair amount of code is written for this task.

Queue filling and tree building

This step is not at all different for the C and C ++ implementation. First, the number of repetitions of each character in the input data block is calculated, and a plate of 256 bytes is filled. Then, micro-trees consisting of only one terminal node are placed in the priority queue. Then the nodes are combined by sequentially extracting from the queue the pair closest to its vertex and inserting there an intermediate node containing the ones that were previously extracted.

In C (here - without checking for errors) it looks like this:

static struct priority_queue* build_priority_queue(
    char* buffer, unsigned int size)
{
    unsigned char table[256];
    memset(table, 0, sizeof(table));
    for (unsigned int i = 0; i < size; ++i)
        if (table[(unsigned char)buffer[i]] != 255)
            ++table[(unsigned char)buffer[i]];
    struct priority_queue* queue = priority_queue_create(16);
    for (unsigned short i = 0; i < 256; ++i)
    {
        if (table[i])
        {
            struct char_node_root* node = (struct char_node_root*) malloc(sizeof(struct char_node_root));
            struct char_node_terminal* term = (struct char_node_terminal*) malloc(sizeof(struct char_node_terminal));
            term->base.type = NODE_TYPE_TERM;
            term->c = (char)i;
            node->node = (struct char_node_base*) term;
            node->height = 0;
            node->pq_node.weight = table[i];
            priority_queue_push(queue, (struct priority_queue_node*) node);
        }
    }
    return queue;
}
static struct char_node_root* queue_to_tree(struct priority_queue* queue)
{
    while (priority_queue_size(queue) > 1)
    {
        struct char_node_root* node1 = (struct char_node_root*) priority_queue_pop(queue);
        struct char_node_root* node2 = (struct char_node_root*) priority_queue_pop(queue);
        struct char_node_base* int_node1 = node1->node;
        struct char_node_base* int_node2 = node2->node;
        struct char_node* join_node = (struct char_node*) malloc(sizeof(struct char_node));
        join_node->base.type = NODE_TYPE_NODE;
        join_node->left = int_node1;
        join_node->right = int_node2;
        int new_weight = node1->pq_node.weight;
        if (new_weight + node2->pq_node.weight <= 65535)
            new_weight += node2->pq_node.weight;
        else
            new_weight = 65535;
        node1->pq_node.weight = new_weight;
        if (node1->height > node2->height)
            ++node1->height;
        else
            node1->height = node2->height + 1;
        free(node2);
        node1->node = (struct char_node_base*) join_node;
        priority_queue_push(queue, (struct priority_queue_node*) node1);
    }
    return (struct char_node_root*) priority_queue_pop(queue);
}

In C ++, it’s even shorter and more beautiful despite the fact that any memory allocation errors will be processed correctly due to exceptions and the use of auto pointers:

void fill_priority_queue(
    const unsigned char* buffer,
    std::size_t buffer_size,
    queue_t& queue)
{
    unsigned char counts_table[256]{};
    for (auto ptr = buffer; ptr != buffer + buffer_size; ++ptr)
        if (counts_table[*ptr] != 255)
            ++counts_table[*ptr];
    for (unsigned short i = 0; i != 256; ++i)
        if (counts_table[i])
            queue.push(counts_table[i], nodes_root {0, char_node_ptr(new char_node_terminal(i))});
}
void queue_to_tree(queue_t& queue)
{
    while (queue.size() > 1)
    {
        auto old_root1_node = std::move(queue.top());
        const auto old_root1_weight = queue.top_weight();
        queue.pop();
        auto old_root2_node = std::move(queue.top());
        const auto old_root2_weight = queue.top_weight();
        queue.pop();
        auto joined_node = std::unique_ptr(new char_node);
        joined_node->m_left = std::move(old_root1_node.m_node);
        joined_node->m_right = std::move(old_root2_node.m_node);
        const auto new_weight = std::min(old_root1_weight + old_root2_weight, 65535U);
        const auto new_height = std::max(old_root1_node.m_height, old_root2_node.m_height) + 1;
        queue.push(new_weight, nodes_root {new_height, std::move(joined_node)});
    }
}

Coding table

The next step is to build a table from the encoding tree, where each character is assigned a sequence of bits that reflect the passage through the newly constructed encoding tree to this symbol. The code for traversing the tree and "stuffing" the sequence of bits in both versions is almost identical. Of greatest interest is how to store this sequence of bits and operate on it. After all, after the construction of this table, what’s all started for the sake of starting. Passing through the input buffer for each character found there, its bit sequence will be taken and added to the resulting output sequence.

In the C version, a sequence of bits is represented by a trivial structure consisting of a pointer to an array of bytes and a counter of actually filled bits. There are no special operations to work with it. At the stage of constructing the coding table, for each character, a structure is created where the sequence of bits found for the current character is copied. Thus, a codebook is simply an array of these structures for each character.

struct bits_line
{
    unsigned char bits_count;
    unsigned char* bits;
};
static int build_encoding_map_node(struct char_node_base* node, struct bits_line* bits_table, unsigned char* bits_pattern, int bits_count)
{
    if (node->type == NODE_TYPE_TERM)
    {
        unsigned char index = (unsigned char)((struct char_node_terminal*)node)->c;
        bits_table[index].bits_count = bits_count;
        bits_table[index].bits = (unsigned char*) malloc(bytes_count_from_bits(bits_count + 1));
        if (! bits_table[index].bits)
            return 0;
        memcpy(bits_table[index].bits, bits_pattern, bytes_count_from_bits(bits_count));
        return 1;
    }
    static const unsigned char bit_mask[] = {1, 2, 4, 8, 16, 32, 64, 128};
    bits_pattern[bits_count >> 3] &= ~bit_mask[bits_count & 7];
    if (! build_encoding_map_node(((struct char_node*)node)->left, bits_table, bits_pattern, bits_count + 1))
        return 0;
    bits_pattern[bits_count >> 3] |= bit_mask[bits_count & 7];
    if (! build_encoding_map_node(((struct char_node*)node)->right, bits_table, bits_pattern, bits_count + 1))
        return 0;
    return 1;
}

In the C ++ version, it is more convenient to present a bitmap as a full-fledged class that will not only manage resources, but also support the operation necessary to add another bit sequence.

using unique_bytes_ptr = std::unique_ptr;
class bit_ostream
{
    std::size_t m_capacity;
    unsigned long m_bits_count = 0;
    unique_bytes_ptr m_data;
public:
    explicit bit_ostream(std::size_t initial_capacity = 0) noexcept
        : m_capacity(initial_capacity)
    {
    }
    bit_ostream& push(const unsigned char* bits, unsigned long const bits_count)
    {
        if (bits_count == 0)
            return *this;
        const auto new_bits_count = m_bits_count + bits_count;
        if (covered_bytes(new_bits_count) + 1 > m_capacity || m_bits_count == 0)
        {
            decltype(m_capacity) new_capacity = m_capacity * 2;
            const auto cov_bytes = static_cast(covered_bytes(new_bits_count) + 1);
            if (new_capacity < cov_bytes)
                new_capacity = cov_bytes;
            unique_bytes_ptr new_data(new unsigned char[new_capacity]);
            std::memcpy(new_data.get(), m_data.get(), covered_bytes(m_bits_count));
            m_capacity = new_capacity;
            m_data = std::move(new_data);
        }
        unsigned char* curr = m_data.get() + (m_bits_count >> 3);
        if ((m_bits_count & 7) == 0)
        {
            // All it's simple when current output data size is integer number of bytes
            std::memcpy(curr, bits, covered_bytes(bits_count));
        }
        else
        {
            const unsigned char shift = m_bits_count & 7;
            for (auto bytes_count = covered_bytes(bits_count); bytes_count > 0; ++curr, ++bits, --bytes_count)
            {
                unsigned short val = static_cast(*bits) << shift;
                val |= static_cast(*curr & g_bits_fill_mask[shift]);
                *curr = static_cast(val & 0xff);
                *(curr + 1) = static_cast(val >> 8);
            }
        }
        m_bits_count += bits_count;
        assert(covered_bytes(m_bits_count) <= m_capacity);
        return *this;
    }
    bit_ostream& push(const bit_ostream& other)
    {
        return push(other.data(), other.bits_count());
    }
    bit_ostream& clear_tail() noexcept
    {
        if (m_bits_count & 7)
            m_data.get()[m_bits_count >> 3] &= g_bits_fill_mask[m_bits_count & 7];
        return *this;
    }
    unsigned long bits_count() const noexcept { return m_bits_count; }
    bool empty() const noexcept { return ! m_bits_count; }
    unsigned char* data() noexcept { return m_data.get(); }
    const unsigned char* data() const noexcept { return m_data.get(); }
};
template 
constexpr inline std::size_t covered_bytes(T bits_count) noexcept
{
    return (bits_count >> 3) + (bits_count & 7 ? 1 : 0); 
}

Putting it all together

So, all the components of the encoding procedure are discussed above. Once again, briefly repeat the sequence of steps. It is important to mention here that for further measurements in order to compare performance, this sequence is divided into stages. I did the measurements by stages in the program itself in the fastest way I know - by reading the CPU cycle counter using the rdtsc instruction.

Remember point in time ts1 with rdtsc.
Fill the priority queue with characters based on the number of repetitions.
Build a character tree using this queue. Delete the queue itself.
Calculate the number of cycles t1 elapsed since ts1 and remember the next point ts2.
Build a coding table bypassing the coding tree. Destroy the coding tree.
Calculate the number of cycles t2 elapsed since ts2 and remember the next point ts3.
Actually encode the input stream, replacing each input character with a sequence of bits from the encoding table.
Calculate the number of cycles t3 elapsed since ts3.

In addition to counting cycles in stages, there is also a common timer that measures the entire encoding time of the input buffer, using the posix function clock_gettime.

Measurements and Optimization

An eager reader who has overlooked so much code above is already fidgeting in his chair, wondering, “So what happened there?” Let's try to run both versions, compiled by the gcc-5.4.0 compiler with the optimization level of “O3”, the packer on a file about 31 Mb in size. It should be noted that for encoding, you can choose different sizes of the data block from the input file. By default, it is 64 Kb. That is, 31 Mb / 64 Kb blocks are encoded, and all time indicators are summed up.

> build-c/pack-c -m3 ../sample-1.dat data-c.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1053432 (0.754 seconds), t1 = 209957, t2 = 31023, t3 = 811377.
> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1182005 (0.846 seconds), t1 = 228527, t2 = 52680, t3 = 894081

You can ignore the “-m3” option, it’s just a switch that means test mode.

Well, somehow, it’s not very fun. That is, C ++ was not given for free, a performance failure of about 12%. All three steps take longer than in the C version. And if you choose a smaller block size, say, 1 KB?

> build-c/pack-c -m3 -b1024 ../sample-1.dat data-c.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 9397894 (6.731 seconds), t1 = 5320910, t2 = 1943422, t3 = 2094688.
> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 11586220 (8.3 seconds), t1 = 6399593, t2 = 3125111, t3 = 1663035

Of course, everything dipped, because now you need to rebuild the encoding tree much more often. But C again pulled ahead - as much as 23%!

“By eye” optimization

What is wrong with the C ++ implementation? It seems to be one compiler, the same optimizer there. The cycle counts above show that the biggest contribution is made by the steps where the manipulation of bits begins. The bit_ostream class is good. But when filling the coding table, is its push method, which is overloaded to compile the table, so good? After all, putting an array of bits into an initially empty object should be much easier than all the code in that method. And the coding table made up of 256 entities of this class takes up much more space than the 256 bits_line structures from the C version. Let's try to make a variant of this class for the coding table.

class small_bit_ostream
{
    unique_bytes_ptr m_data;
    unsigned short m_bits_count = 0;
public:
    small_bit_ostream& push(const unsigned char* bits, const unsigned short bits_count)
    {   
        const auto cov_bytes {covered_bytes(bits_count)};
        m_data.reset(new unsigned char[cov_bytes]);
        std::memcpy(m_data.get(), bits, cov_bytes);
        m_bits_count = bits_count;
        return *this;
    }   
    unsigned long bits_count() const noexcept { return m_bits_count; }
    bool empty() const noexcept { return ! m_bits_count; }
    unsigned char* data() noexcept { return m_data.get(); }
    const unsigned char* data() const noexcept { return m_data.get(); }
};

Simply. Handsomely. Nothing extra. Does it give at least something? (I will not bring the untouched C version here.)

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1173692 (0.84 seconds), t1 = 229942, t2 = 46677, t3 = 890323
> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 11198578 (8.02 seconds), t1 = 6404650, t2 = 2752852, t3 = 1641317

Well, for a large block, the improvement is at the level of error. For a small block, the improvement is a bit more noticeable - now C ++ is only 19% worse. It can be seen from the indicator t2 that filling the table has become better.

Profiling

Let's start by checking how the CPU caches are doing. Run both versions of the software under valgrind with the cachegrind toolkit. Here is a quick summary for the C version.

==2794== I   refs:      2,313,382,347
==2794== I1  misses:           14,482
==2794== LLi misses:            1,492
==2794== I1  miss rate:          0.00%
==2794== LLi miss rate:          0.00%
==2794== 
==2794== D   refs:        601,604,444  (472,330,278 rd   + 129,274,166 wr)
==2794== D1  misses:        3,966,884  (  2,279,553 rd   +   1,687,331 wr)
==2794== LLd misses:            7,030  (      3,034 rd   +       3,996 wr)
==2794== D1  miss rate:           0.7% (        0.5%     +         1.3%  )
==2794== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==2794== 
==2794== LL refs:           3,981,366  (  2,294,035 rd   +   1,687,331 wr)
==2794== LL misses:             8,522  (      4,526 rd   +       3,996 wr)
==2794== LL miss rate:            0.0% (        0.0%     +         0.0%  )
==2794== 
==2794== Branches:        299,244,261  (298,085,895 cond +   1,158,366 ind)
==2794== Mispredicts:       8,779,093  (  8,778,920 cond +         173 ind)
==2794== Mispred rate:            2.9% (        2.9%     +         0.0%   )

And here is the output for the C ++ version with the same parameters:

==2994== I   refs:      2,464,681,889
==2994== I1  misses:            2,032
==2994== LLi misses:            1,888
==2994== I1  miss rate:          0.00%
==2994== LLi miss rate:          0.00%
==2994== 
==2994== D   refs:        633,267,329  (491,590,332 rd   + 141,676,997 wr)
==2994== D1  misses:        3,992,071  (  2,298,593 rd   +   1,693,478 wr)
==2994== LLd misses:            8,292  (      3,173 rd   +       5,119 wr)
==2994== D1  miss rate:           0.6% (        0.5%     +         1.2%  )
==2994== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==2994== 
==2994== LL refs:           3,994,103  (  2,300,625 rd   +   1,693,478 wr)
==2994== LL misses:            10,180  (      5,061 rd   +       5,119 wr)
==2994== LL miss rate:            0.0% (        0.0%     +         0.0%  )
==2994== 
==2994== Branches:        348,146,710  (346,241,481 cond +   1,905,229 ind)
==2994== Mispredicts:       6,977,260  (  6,792,066 cond +     185,194 ind)
==2994== Mispred rate:            2.0% (        2.0%     +         9.7%   )

You may notice that getting into the cache using both data and C ++ instructions is no worse, and sometimes even better than C code. Transition prediction generally works better. And why does he fail slightly? Obviously, more instructional is simply being executed in it - 2,464 million against 2,331. Which gives approximately the difference in performance that was noticeable when using large blocks.

When analyzing the callgrind toolkit, it is seen that a lot of instructions are spent on working with a bunch - malloc and free. But besides this, in C ++ versions there are also significant, from the point of view of instructions, references to the new and delete operators. But are they always needed? The same operations with bitmaps that were separately mentioned above are implemented using the unique_ptr auto pointer, for which memory is allocated using new []. Recall that this statement internally accesses the C-th malloc, and then initializes each object in the created array. That is, in our case, it fills the array with zeros. Why is this program? The bit_ostream class immediately after receiving the array fills it with bits, appending them to the end. And it stores a counter of recorded bits. It does not need at all that the byte array be previously cleared.

struct free_deleter
{
    void operator()(void* p) const noexcept { std::free(p); }
};
template  inline T* allocate_with_malloc(std::size_t size)
{
    T* res = static_cast(std::malloc(sizeof(T) * size));
    if (! res)
        throw std::bad_alloc();
    return res;
}
template 
using unique_malloc_array_ptr = std::unique_ptr;
template 
inline unique_malloc_array_ptr unique_allocate_with_malloc(std::size_t size)
{
    return unique_malloc_array_ptr(allocate_with_malloc(size));
}
// Typedefs for byte arrays
using unique_bytes_ptr = unique_malloc_array_ptr;
inline unique_bytes_ptr allocate_bytes(std::size_t size)
{
    return unique_bytes_ptr(unique_allocate_with_malloc(size));
}

Let's check the assumption on the same file with encoding large and small blocks (as before, the C version did not change, so I will not give it).

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1042665 (0.746 seconds), t1 = 250480, t2 = 45393, t3 = 740163
> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 11068384 (7.93 seconds), t1 = 6488100, t2 = 2694562, t3 = 1501027

Well, on large blocks of C ++, the implementation overtook C! On small ones, everything is so far worse, although better than in the previous experiment. The number of instructions for the entire program was 2,430 million instead of 2,464. The number of data accesses also decreased from 633 million to 536. It is clear that on small blocks the new implementation practically remained as it was - the construction of the coding tree plays the main role there, but his code did not change.

A couple more drops

Let's pay attention to the priority queue, which was so beautifully and succinctly implemented in C ++. She, like everything else, uses auto pointers to manage memory. There is one main m_nodes pointer that points to an array of pointers to nodes. During any mutating operation, the contents of the final pointers are rearranged, as a rule, by the expression ptr1 = std :: move (ptr2). What is "buried" here? The ptr1 pointer must verify that it is not pointing at anything, otherwise delete the resource. The ptr2 pointer should reset to zero after the resource is taken from him. Yes, this is a small number of instructions, there is almost nothing to talk about. But! In all operations with the queue, it is strictly known when and what indicates what. Therefore, such checks and zeroing is not necessary. And copying raw pointers takes one (!) Instruction.

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1008990 (0.722 seconds), t1 = 221001, t2 = 44870, t3 = 736557
> build-cpp/pack-cpp -m3 -b1024 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 31160081 bytes. Total ticks = 10683068 (7.65 seconds), t1 = 6101534, t2 = 2689178, t3 = 1505929

Well, on large blocks C ++ is faster by 4.3%, on small ones - slower by only 13.6%. There are now 2,413 million instructions, 531 million data accesses.

Conclusion

What thoughts did I come up with during such a comparative analysis on the example of the Huffman coding problem?

At first, I implemented both versions of the program “on the forehead”, without really thinking about any specific optimizations. But it so happened that the C version immediately turned out to be faster, and C ++ I “reached out” to it, studied, profiled, etc. As a result, I got the same fast solution on average. (I think that the stage of constructing a coding tree can also be “finished” so that it does not “sag” in speed.) But I wanted to note that the process of writing a program in C “led me” along the path of creating fast code, and in C ++ case had to do further analysis.
For those who know C ++ well, it is much more convenient to write on it than on C. Programs are more concise and devoid of many potential errors that can be made on C. As I wrote and debugged the C program, I came across many (comparing with C ++) cases improper use of pointers, access to cleared memory, incorrect type conversion. In a well-typed C ++ program, the rule applies to the maximum - it compiles, which means it is correct. Removing complex structures together with the exception mechanism is generally a force, because here you do not need a single line of user code (unless you get an error message), and the program correctly processes such situations out of the box.
To notice that something works slowly during profiling is very difficult, because it is a subjective assessment. I had two implementations, and I could compare the equivalent steps. But in reality there will be only one implementation, because in the beginning they chose the language “XYZ” for it. How to understand whether it works fast or slowly? There is nothing to compare. If I wrote only a C ++ implementation and at the end measured that it processes 31 MB in 0.85 seconds, I would immediately consider that this is the standard of speed!
As for my personal preferences when writing programs in the future, then the approach is as follows. If I need to write a “thresher”, the actions of which will mainly consist of millions of similar actions, then it is better to write it in the C style in order to see / implement independently all, even the most insignificant steps. After all, every extra instruction here in millions of repetitions will give a drawdown. If I write complex control logic with very diverse actions, not limited to “repeat the calculations A, B, C a million times,” it is better to take into account the full power of typed C ++. Because the final operating time of such a program will no longer be reduced to the question “how much time will it process a terabyte of data”, the control logic will depend on many external factors and usage scenarios. And talking about the exact speed characteristics will not have to. But the correctness of such a logic “out of the box” will be a hundred times more valuable, because no one will want to clear bugs from it in the C version until the end of time. And in the end, to have a “hard” version of the code that cannot be changed, adapted to new needs, because wherever you touch, everything will crumble. And start debugging all over again.

PS The source code can be taken here: http://corosan.ru/data-exp/haffman_pack_for_article.tar.bz2

Update 1. At the request of the comments, I checked the work of the packers compiled by the clang-3.8.0 compiler. The results are as follows.

C version:

> build-c/pack-c -m3 ../sample-1.dat data-c.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1139373 (0.816 seconds), t1 = 206305, t2 = 29559, t3 = 902493.

Non-optimized C ++ version:

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1417576 (1.01 seconds), t1 = 223441, t2 = 53057, t3 = 1134400

Optimized C ++ version:

> build-cpp/pack-cpp -m3 ../sample-1.dat data-cpp.dat
File packing is done. Read 31962362 bytes, written 32031809 bytes. Total ticks = 1174028 (0.84 seconds), t1 = 215059, t2 = 59738, t3 = 892821

The relative alignment of forces does not change, and in absolute terms, clang generates code a little slower than gcc.

Tags: