korisk April 11, 2013 at 01:01

A few details about std :: string

I recently became interested in implementing std :: string in libstdc ++. Not in connection with the adoption of the new standard, but to understand. Fortunately, the requirements for the string type have not changed much.

The main tool for code analysis is undoubtedly the method of peering, but in order to narrow the gaze and make the procedure more exciting, you can implement the “tracer” idiom for the line in the “C ++ Templates: The Complete Guide”. Tracing allows you to identify ~~suspicious~~ interesting operations on strings.

As you know, std :: string is an alias for

std::basic_string

and nothing prevents us from identifying

std::basic_string

. In X, you can define several static counters and iterate them in the constructor, destructor, and other methods. By performing various operations on such a string, it will be possible to trace the effectiveness of the applied algorithms in terms of the number of operations.
Also in g ++ for

std::string a(«entrails»);

expression

std::cout << reinterpret_cast(*((void**)(&a)));

will display the contents of the string. Those. std :: string - is, in fact, a pointer to char.
In general, these and other ~~shocking~~ details under the cut.

Structure

std :: string is declared as:

  typedef basic_string    string;

 basic_string

, in turn, is an alias for:

template, typename _Alloc = allocator<_CharT> >  class basic_string;

The definition of basic_string is done in the files c ++ / bits / basic_string.h c ++ / bits / basic_string.tcc.

If you remove all the methods of the class, it will remain

not so much

template
class basic_string{
public:
     static const size_type    npos = static_cast(-1);
private:
    struct _Rep_base{
        size_type            _M_length;  //по умолчанию инстанцируется в size_t
        size_type            _M_capacity;
        _Atomic_word   _M_refcount;
    };
    struct Rep:_Rep_base{
        static const size_type         _S_max_size; //(((npos - sizeof(_Rep_base))/sizeof(_CharT)) - 1) / 4
        static const _CharT            _S_terminal;   //терминальный символ. Инициализируется величиной _CharT()
        static size_type                    _S_empty_rep_storage[]; // Массив заполняемый нулями длиной (sizeof(_Rep_base) + sizeof(_CharT) + sizeof(size_type) - 1) /sizeof(size_type)
    };
    struct _Alloc_hider : _Alloc{    
        _Alloc_hider(_CharT* __dat, const _Alloc& __a) : _Alloc(__a), _M_p(__dat) { } 
       _CharT* _M_p; // указатель на массив символов. 
      };  
      mutable _Alloc_hider      _M_dataplus; //единственный не статический член класса string
};

, which in turn are expressed in a single pointer. Moreover, the pointer does not indicate the beginning of the allocated piece of memory, but as shown in the figure, _M_p indicates the beginning of the _CharT array.
Such a tricky structure is achieved by dividing the initialization into two stages. Inside the std :: string constructor, the allocate () method of the allocator is first called, then construct (). Allocate () using malloc () allocates the necessary piece of memory, and construct () calls placement new with the desired offset relative to the beginning of the selected area. The offset is determined at the compilation stage, so destruct () and deallocate () have no problems calculating the source address.
As written in basic_string.h, such a construction procedure is implemented so that the variable std :: string is a pointer to an array of string elements. Then gdb is able to display the contents of std :: string.

The picture seems to be self-documenting, however, I note that _M_refcount takes 4 bytes, but is aligned to 8 (for x86-64). Lines 1) and 2) will occupy the same amount of memory. X is a tracer class, but more on that later. First, a few words about the static members of the std :: string class.
It would seem that npos needs no introduction. But ... this is not the maximum possible length of std :: string; it is only the maximum value of type std :: string :: size_type. The maximum length of the string is at least four less and it is determined by the static member:
_S_max_size;

quote from c ++ / bits / basic_string.tcc

  template
    const typename basic_string<_CharT, _Traits, _Alloc>::size_type
    basic_string<_CharT, _Traits, _Alloc>::
    _Rep::_S_max_size = (((npos - sizeof(_Rep_base))/sizeof(_CharT)) - 1) / 4;

Why it was chosen a quarter - I could not find out.
_S_terminal - the end of line character is initialized by the constructor without parameters when the program starts.

Reference counting

Now we come to a separate topic of link counting. Counting string references is mentioned in the standard, but is not a requirement.
The _S_empty_rep_storage variable is the memory block that all empty lines created in the program refer to.
The number of links is stored in the _M_refcount variable field.
_M_refcount:

-1: the line has one link, increasing the number of links is not possible;
0: normal value. there is only one line with this content;
n> 0: there are n + 1 rows with this content. (when working in a multi-threaded program, locks are required for such lines)

Despite the fact that _M_refcount with a value of -1 should prohibit "lazy" string construction, there is no method in the public interface std :: string that allows you to enable or disable link counting. There is a macro definition _GLIBCXX_FULLY_DYNAMIC_STRING that seems to be intended for this, but it didn’t work in my tests, I didn’t go deeper.

(Upd: _GLIBCXX_FULLY_DYNAMIC_STRING have nothing to do with refcounting. This is a simple optimization so as not to allocate memory for empty lines. You need to disable it if you have several instances of libstdc ++ in the process (this happens on Windows))

In general, these links simply transfer memory allocation from the call of the constructor to the moment of the first write to the line and sometimes make themselves felt when working in multi-threaded mode. Helgrind and drd give nontrivial hints about this.

Tracer

Now let's see how well the methods of the std :: string class implement their actions. And we will measure the quality of actions in the number of operations performed on line characters. To do this, we will use the tracer idiom. It is intended for debugging containers and is involved in creating a class of counting operations performed on it. By instantiating a container of such classes, one can easily calculate the number of comparison operations, for example, when sorting or when performing any other action on the container. Well, std :: string more or less falls under the definition of a container, so the string can be examined using this idiom. You can estimate which constructors, methods, algorithms are more effective in a particular situation. Actually, here is our

tracer X

struct X{
        char v; //value
        X():v(){ X::c[ctor]++; };
        X(char x):v(x){ X::c[ctor]++; };
        X(X const& x):v(x.v) { X::c[copy]++; };
        ~X(){ X::c[dtor]++; };
        X& operator=(X const& x) { X::c[assgn]++; v=x.v; return *this; };
        X& operator=(char x) { X::c[assgn]++; v = x;   return *this; };
        X& operator()(char x) { X::c[cast]++; v = x;   return *this; };// X a = X('c');
        operator char() const { X::c[cast]++; return v; };             // char x = a;
        bool operator==( X const & x) const { X::c[eql]++; return v == x.v; };
        bool operator==( char x) const { X::c[eql]++; return v == x; };
        bool operator<( X const& x) const { X::c[lss]++; return v < x.v; };    
        bool operator<( int x) const { X::c[lss]++; return v < x; };    //for boost
        enum { ctor=0, copy, assgn, dtor, lss, eql, cast, last };    
        static int c[]; //counters
        static ::std::string n[]; //names of the counters
        static void show(const ::std::string &title, const ::std::string& end);
        static void head(const ::std::string &header);
};

Structure X has one regular field of class char v; - its meaning. Two static ones are an array of counters and an array of names for these counters. Calls of constructors, destructors, assignments, comparisons (==, <) and conversions to / from char are counted. Only seven counters. The two static methods show and head are used to visualize the results. Show displays the values of the counters and resets them. Head shows the names of the counters.
To calibrate the tracer, let's see how it counts operations in elementary situations:

operation	ctor	copy	assgn	dtor	lss	eql	cast
{	1	0	0	0	0	0	0
X a;	1	0	0	0	0	0	0
a = '1';	0	0	1	0	0	0	0
X b = a;	0	1	0	0	0	0	0
X c (a);	0	1	0	0	0	0	0
c ('e');	0	0	0	0	0	0	1
c = b;	0	0	1	0	0	0	0
(c = b) = 'e';	0	0	2	0	0	0	0
a <b;	0	0	0	0	1	0	0
a == b;	0	0	0	0	0	1	0
[] (X y) -> X {return yv;} (b);	1	1	0	2	0	0	0
a = [] (X y) -> X {return y;} (b);	0	2	1	2	0	0	0
}	0	0	0	3	0	0	0

Tracer Calibration Code

        X::head();
        X::show("{");
{       //single operations
        X a;
        X::show("X a;");
        a = '1';
        X::show("a = '1';");
        X b = a;
        X::show("X b = a;");
        X c(a);
        X::show("X c(a);");
        c('e');
        X::show("c('e');");
        c = b;
        X::show("c = b;");
        (c = b) = 'e';
        X::show("(c=b)='e';");
        c < b;
        X::show("a < b;");
        a == b;
        X::show("a == b;");
        #if __cplusplus > 199711L
       [](X y)->X{return y.v;}(b);
       X::show("[](X y)->X{return y.v;}(b);");
        a = [](X y)->X{return y;}(b);
        X::show("a = [](X y)->X{return y;}(b);");
        #endif
}
        X::show("}");

Because calibration is performed at the beginning of the application, the constructor of the static terminal symbol _S_terminal falls into this table. Its constructor is counted in the first line.
The following table shows the construction of a pair of arrays:

operation	ctor	assgn	dtor
{	0	0	0
X a [10];	10	0	0
std :: copy (ch, ch + 10, a);	0	10	0
X b [10] = {'o'};	10	0	0
std :: copy (a, a + 10, b);	0	10	0
}	0	0	20

Code for Arrays X

{
        X::show("{");
        X a[10];
        X::show("X a[10];");
        std::copy(ch,ch+10,a);
        X::show("std::copy(ch,ch+10,a);");
        X b[10] = { 'o' };
        X::show("X b[10] = {'o'};");
        std::copy(a,a+10,b);
        X::show("std::copy(a,a+10,b);");
}
        X::show("}");

Now let's move on to the lines. Let's call our trace line like this

typedef std::basic_string xs;

The construction of empty lines, as expected, does not perform operations on characters.

operation	ctor	copy	assgn	dtor	lss	eql	cast
{	0	0	0	0	0	0	0
const xs s1;	0	0	0	0	0	0	0
xs s2 (s1);	0	0	0	0	0	0	0
xs s3 = s1;	0	0	0	0	0	0	0
}	0	0	0	0	0	0	0

Code for empty string constructors

{
        X::head();
        X::show("{");
        const xs s1; 
        X::show("const xs s1;");
        xs s2(s1);
        X::show("xs s2(s1);");
        xs s3 = s1; 
        X::show("xs s3 = s1;");
}
        X::show("}");

The following table speaks for itself. It was embarrassing that when specifying the length (s2), the number of operations is much less than in the case of s1. Moreover, the strings constructed on the basis of s1 and s2 behaved differently. For s6, lazy design worked, for s5 it didn't. The filling constructor surprised me so much - additional copying and destructors came from somewhere. (Upd: they came from passing a class by value)

operation	ctor	copy	assgn	dtor	eql
{	0	0	0	0	0
xs s1 ((X *) "1234567890");	eleven	0	eleven	eleven	eleven
xs s2 ((X *) "1234567890", 7);	0	0	8	0	0
xs s3 (10, 'a');	1	3	eleven	4	0
xs s4 (s1.begin (), s1.begin () + 7);	0	0	8	0	0
xs s5 (s2);	0	0	0	0	0
s5 [0] = 'a';	0	0	9	0	0
xs s6 (s1);	0	0	eleven	0	0
s6 [0] = 'a';	0	0	1	0	0
xs s7 (s1,1,5);	0	0	6	0	0
xs s8 = [] () -> xs {xs a ((X *) "ololo"); return a;} ();	6	0	6	6	6

Now a few basic operations. Unexpectedly, the result for resize was significantly worse than for reserve:

operation	ctor	copy	assgn	dtor
int a = s1.size ();	0	0	0	0
s1.resize (100);	1	3	102	4
s2.reserve (100);	0	0	8	0
s1.erase ();	0	0	1	0

Here are a few more selected operations

operation	ctor	copy	assgn	dtor	lss	eql	cast
s1.insert (0, s2);	0	0	8	0	0	0	0
s1.insert (0, s2,2,4);	0	0	5	0	0	0	0
s3 = s1;	0	0	0	0	0	0	0
s3 == s1;	0	0	0	0	2	0	0
s2! = s1;	0	0	0	0	1	0	0
s1> s2;	0	0	0	0	2	0	0
equal (s1.begin (), s1.end (), s2.begin ());	0	0	0	0	0	1	0
mismatch (s1.begin (), s1.end (), s2.begin ());	0	0	0	0	0	1	0
copy (s1.begin (), s1.end (), out_it);	0	0	0	0	0	0	eleven
std :: sort (s1.begin (), s1.end ());	0	10	24	10	29th	0	0
std :: swap (s1, s2);	0	0	0	0	0	0	0
s3 = s1 + s2;	0	0	20	0	0	0	0
s1 + s2;	0	0	20	0	0	0	0
s3 + = s1;	0	0	27	0	0	0	0
s3.substr (10);	0	0	16	0	0	0	0
s4 = s3.substr (10);	0	0	16	0	0	0	0
s4 = s3.find (s2);	1	3	2	4	26	8	0
}	0	0	0	0	0	0	0

And their code

{
	X::head();
	X::show("{");
	xs s1((X*)"1234567890");
	X::show("xs s1((X*)\"1234567890\");");
	xs s2((X*)"1234567890",7);
	X::show("xs s2((X*)\"1234567890\",7);");
	xs s3(10,'a');
	X::show("xs s3(10,'a');");
	xs s4(s1.begin(), s1.begin()+7);
	X::show("xs s4(s1.begin(), s1.begin()+7);");
	xs s5(s2);
	X::show("xs s5(s2);");
	s5[0]='a';
	X::show("s5[0]='a';");
	xs s6(s1);
	X::show("xs s6(s1);");
	s6[0]='a';
	X::show("s6[0]='a';");
	xs s7(s1,1,5);
	X::show("xs s7(s1,1,5);");
	#if __cplusplus > 199711L
	xs s8 = []()->xs{xs a((X*)"ololo");return a;}();
	X::show("xs s8=[]()->xs{xs a((X*)\"lol\");return a;}();");
	#endif
//----------------------------------------------------------------
	int a = s1.size();
	X::show("int a = s1.size();");
	s1.resize(100);
	X::show("s1.resize(100);");
	s2.reserve(100);
	X::show("s2.reserve(100);");
	s1.erase();
	X::show("s1.erase();");
	s1.insert(0,s2);
	X::show("s1.insert(0,s2);");
	s1.insert(0,s2,2,4);
	X::show("s1.insert(0,s2,2,4);");
//----------------------------------------------------------------
	X::show("s3 = s1;");
	s3 == s1;	
	X::show("s3 == s1;");
	s2 != s1;
	X::show("s2 != s1;");
	s1 > s2;
	X::show("s1 > s2;");
        std::equal(s1.begin(),s1.end(),s2.begin());
        X::show("equal(s1.begin(),s1.end(),s2.begin());");
        std::mismatch(s1.begin(),s1.end(),s2.begin());
        X::show("mismatch(s1.begin(),s1.end(),s2.begin());");
        std::copy(s1.begin(),s1.end(),out_it);
        std::cout << std::endl; 
        X::show("copy(s1.begin(),s1.end(),out_it);");
        std::sort(s1.begin(),s1.end());
        X::show("std::sort(s1.begin(),s1.end());");
        std::swap(s1,s2);
        X::show("std::swap(s1,s2);");
        s3 = s1 + s2;
        X::show("s3 = s1 + s2;");
        s1 + s2;        
        X::show("s1 + s2;");
        s3 += s1;       
        X::show("s3 += s1;");
        s3.substr(10);  
        X::show("s3.substr(10);");
        s4 = s3.substr(10);     
        X::show("s4 = s3.substr(10);");
        s4 = s3.find(s2);
        X::show("s4 = s3.find(s2);");
}
	X::show("}");

What can be learned from this? Well, firstly, to use reserve, not resize, to avoid assignments and filling constructors, maybe something else. The code works and gives a similar result for g ++ 4.7.2, intel 13.0.1, clang 3.2, which was checked here.
After leaving the scope of the string, destructors for characters are not called. Perhaps there other operations are skipped (for example, memcmp is used instead

operator>()

in the loop). But the string is not a full container. For strings, the tracer method provides a rough estimate of the number of operations. For clean containers, this rating should be stricter.

That’s almost all.
No one was hurt during the study. Sources were
this reference , files:
/usr/include/c++/4.7.2/strings
/usr/include/c++/4.7.2/bits/stringfwd.h
/usr/include/c++/4.7.2./bits/basic_string .h
/usr/include/c++/4.7.2./bits/basic_string.tcc
and gdb.

PS.

As a postscript ~~and illustration of the metaprogramming identity~~ , I’ll add that our tracer can be used to estimate the cost of parsing a line in boost :: spirit. On the github, the source of the tracer is laid out, along with an example calc5.cpp from boost :: spirit.

Upd1:
Thanks to Khim for explaining the situation with _GLIBCXX_FULLY_DYNAMIC_STRING, with c ++ 11 lines and their implementation in libstdc ++.

After the directive is included in the code

#include

,
override

typedef __gnu_cxx::__versa_string xs;

and
builds with the -std = gnu ++ 11 option, link counting has been disabled, as you can see in the table:

Results for __gnu_cxx :: __ versa_string

operation	ctor	copy	assgn	dtor	lss	eql	cast
{	0	0	0	0	0	0	0
const xs s1;	1	0	1	1	0	0	0
xs s2 (s1);	1	0	1	1	0	0	0
xs s3 = s1;	1	0	1	1	0	0	0
}	0	0	0	0	0	0	0
operation	ctor	copy	assgn	dtor	lss	eql	cast
{	0	0	0	0	0	0	0
xs s1 ((X *) "1234567890");	12	0	eleven	12	0	eleven	0
xs s2 ((X *) "1234567890", 7);	1	0	8	1	0	0	0
xs s3 (10, 'a');	2	4	eleven	6	0	0	0
xs s4(s1.begin(), s1.begin()+7);	1	0	8	1	0	0	0
xs s5(s2);	1	0	8	1	0	0	0
s5[0]='a';	0	0	1	0	0	0	0
xs s6(s1);	1	0	11	1	0	0	0
s6[0]='a';	0	0	1	0	0	0	0
xs s7(s1,1,5);	1	0	6	1	0	0	0
xs s8=[]()->xs{xs a((X*)«lol»);return a;}();	7	0	6	7	0	6	0
xs s9((X*)«1234567890»);	12	0	11	12	0	11	0
X n = s9[0];	0	1	0	0	0	0	0
int a = s1.size();	0	0	0	0	0	0	0
s1.resize(100);	2	4	101	6	0	0	0
s2.reserve(100);	0	0	8	0	0	0	0
s1.erase();	1	0	1	1	0	0	0
s1.insert(0,s2);	1	0	8	1	0	0	0
s1.insert(0,s2,2,4);	1	0	5	1	0	0	0
s3 = s1;	0	0	0	0	0	0	0
s3 == s1;	0	0	0	0	2	0	0
s2 != s1;	0	0	0	0	1	0	0
s1 > s2;	0	0	0	0	2	0	0
equal(s1.begin(),s1.end(),s2.begin());	0	0	0	0	0	1	0
mismatch(s1.begin(),s1.end(),s2.begin());	0	0	0	0	0	1	0
copy(s1.begin(),s1.end(),out_it);	0	0	0	0	0	0	11
std::sort(s1.begin(),s1.end());	0	10	24	10	29	0	0
std::swap(s1,s2);	0	0	32	0	0	0	0
s3 = s1 + s2;	3	0	38	3	0	0	0
s1 + s2;	3	0	22	3	0	0	0
s3 += s1;	1	0	8	1	0	0	0
s3.substr(10);	1	0	16	1	0	0	0
s4 = s3.substr(10);	17	0	64	17	0	0	0
s4 = s3.find(s2);	2	3	2	5	26	8	0
}	0	0	0	1	0	0	0

Upd2: According to the standard, basic_string can only operate with POD types. It turned out that X is not a POD type. (it is a “standard layout class”, but not a “trivially copyable class” because it has explicit constructors, destructor and assignment operators)
Therefore, the behavior of the basic_string class is undefined, and the whole post is a lie and a provocation. :)
Which is confirmed by the inability to compile examples using standard libraries other than gnu libstdc ++. (clang 3.2 c libc ++ 1.0 and msvc from vs2012)

Tags:

A few details about std :: string

Structure

Reference counting

Tracer

PS.

Also popular now: