![](http://habrastorage.org/getpro/habr/avatars/7c7/54f/cb4/7c754fcb4fb2fcb71d058fea60a314b3.jpg)
Passing smart pointers by constant link. Autopsy
Smart pointers are often passed to other functions through a constant reference. C ++ experts, Andrei Alexandrescu, Scott Meyers and Herb Sutter, discuss this issue at the C ++ and Beyond 2011 conference (See [04:34] On shared_ptr performance and correctness).
In fact, a smart pointer that is passed by a constant reference already lives in the current scope somewhere in the calling code. If it is stored in a member of the class, then it may happen that this member is reset to zero. But this is not a problem of passing by reference, it is a problem of architecture and ownership policy.
But this post is not about correctness. Here we look at the performance that we can get when switching to constant links. At first glance, it may seem that the only benefit is the absence of atomic increments / decrements of the reference counter when calling the copy constructor and destructor. Let's write some code and take a closer look at what happens under the hood.
For starters, a few helper functions:
Intrusive Link Counter:
Ad hoc smart pointer:
So far, everything is quite simple, right?
We declare a simple class, an instance of which we will pass to the function first by value, and then by a constant reference:
Now you can write the benchmark code directly:
Gather and see what happens. First, let's build an unoptimized version (I use gcc.EXE (GCC) 4.10.0 20140420 (experimental) ):
The operating speed is 0.375 s / M calls for the “by value” version versus 0.124 s / M calls for the “constant-link” version. A convincing 3x difference in the debug build. It's good. Let's look at the assembler listing. Meaningful Version:
Constant-link version. Pay attention to how much everything became cleaner even in the debug build:
All the challenges in their places and all that was saved were two rather expensive atomic operations. But debug builds are not what we need, right? Let's optimize everything and see what happens:
The “by value” version is now completed in 0.168 seconds per 1 million calls. The execution time of the "constant-link" version literally dropped to zero. It's not a mistake. No matter how many iterations we do, the execution time of this simple test will be zero. Let's look at the assembler to see if we made a mistake somewhere. Here is an optimized version of the transmission by value:
Well, but what else can you do when passing by reference that it will work so fast that we can’t measure it? Here she is:
Wow! This listing fits the entire benchmark. The absence of atomic operations allowed the optimizer to crawl into this code and deploy the cycle to one pre-calculated value. Of course, this example is trivial. However, it allows you to clearly talk about the 2 benefits of passing smart pointers via a constant link, which make it not a premature optimization, but a serious means of improving performance:
1) removing atomic operations gives great benefits in itself
2) removing atomic operations allows the optimizer to comb code
Full source here .
The result may vary on your compiler :)
PS Emblem of Sutter has a very detailed essay on this topic, which in great detail affects the language side of passing smart pointers by reference in C ++.
In fact, a smart pointer that is passed by a constant reference already lives in the current scope somewhere in the calling code. If it is stored in a member of the class, then it may happen that this member is reset to zero. But this is not a problem of passing by reference, it is a problem of architecture and ownership policy.
But this post is not about correctness. Here we look at the performance that we can get when switching to constant links. At first glance, it may seem that the only benefit is the absence of atomic increments / decrements of the reference counter when calling the copy constructor and destructor. Let's write some code and take a closer look at what happens under the hood.
Article translation: blog.linderdaum.com/2014/07/03/smart-pointers-passed-by-const-reference
For starters, a few helper functions:
const size_t NUM_CALLS = 10000000;
double GetSeconds()
{
return ( double )clock() / CLOCKS_PER_SEC;
}
void PrintElapsedTime( double ElapsedTime )
{
printf( "%f s/Mcalls\n", float( ElapsedTime / double( NUM_CALLS / 10000000 ) ) );
}
Intrusive Link Counter:
class iIntrusiveCounter
{
public:
iIntrusiveCounter():FRefCounter(0) {};
virtual ~iIntrusiveCounter() {}
void IncRefCount() { FRefCounter++; }
void DecRefCount() { if ( --FRefCounter == 0 ) { delete this; } }
private:
std::atomic FRefCounter;
};
Ad hoc smart pointer:
template class clPtr
{
public:
clPtr(): FObject( 0 ) {}
clPtr( const clPtr& Ptr ): FObject( Ptr.FObject ) { FObject->IncRefCount(); }
clPtr( T* const Object ): FObject( Object ) { FObject->IncRefCount(); }
~clPtr() { FObject->DecRefCount(); }
clPtr& operator = ( const clPtr& Ptr )
{
T* Temp = FObject;
FObject = Ptr.FObject;
Ptr.FObject->IncRefCount();
Temp->DecRefCount();
return *this;
}
inline T* operator -> () const { return FObject; }
private:
T* FObject;
};
So far, everything is quite simple, right?
We declare a simple class, an instance of which we will pass to the function first by value, and then by a constant reference:
class clTestObject: public iIntrusiveCounter
{
public:
clTestObject():FPayload(32167) {}
// сделаем что-нибудь полезное
void Do()
{
FPayload++;
}
private:
int FPayload;
};
Now you can write the benchmark code directly:
void ProcessByValue( clPtr O ) { O->Do(); }
void ProcessByConstRef( const clPtr& O ) { O->Do(); }
int main()
{
clPtr Obj = new clTestObject;
for ( size_t j = 0; j != 3; j++ )
{
double StartTime = GetSeconds();
for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByValue( Obj ); }
PrintElapsedTime( GetSeconds() - StartTime );
}
for ( size_t j = 0; j != 3; j++ )
{
double StartTime = GetSeconds();
for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByConstRef( Obj ); }
PrintElapsedTime( GetSeconds() - StartTime );
}
return 0;
}
Gather and see what happens. First, let's build an unoptimized version (I use gcc.EXE (GCC) 4.10.0 20140420 (experimental) ):
gcc -O0 main.cpp -lstdc++ -std=c++11
The operating speed is 0.375 s / M calls for the “by value” version versus 0.124 s / M calls for the “constant-link” version. A convincing 3x difference in the debug build. It's good. Let's look at the assembler listing. Meaningful Version:
L25:
leal -60(%ebp), %eax
leal -64(%ebp), %edx
movl %edx, (%esp)
movl %eax, %ecx
call __ZN5clPtrI12clTestObjectEC1ERKS1_ // вызываем конструктор копирования
subl $4, %esp
leal -60(%ebp), %eax
movl %eax, (%esp)
call __Z14ProcessByValue5clPtrI12clTestObjectE
leal -60(%ebp), %eax
movl %eax, %ecx
call __ZN5clPtrI12clTestObjectED1Ev // вызываем деструктор
addl $1, -32(%ebp)
L24:
cmpl $10000000, -32(%ebp)
jne L25
Constant-link version. Pay attention to how much everything became cleaner even in the debug build:
L29:
leal -64(%ebp), %eax
movl %eax, (%esp)
call __Z17ProcessByConstRefRK5clPtrI12clTestObjectE // просто один вызов
addl $1, -40(%ebp)
L28:
cmpl $10000000, -40(%ebp)
jne L29
All the challenges in their places and all that was saved were two rather expensive atomic operations. But debug builds are not what we need, right? Let's optimize everything and see what happens:
gcc -O3 main.cpp -lstdc++ -std=c++11
The “by value” version is now completed in 0.168 seconds per 1 million calls. The execution time of the "constant-link" version literally dropped to zero. It's not a mistake. No matter how many iterations we do, the execution time of this simple test will be zero. Let's look at the assembler to see if we made a mistake somewhere. Here is an optimized version of the transmission by value:
L25:
call _clock
movl %eax, 36(%esp)
fildl 36(%esp)
movl $10000000, 36(%esp)
fdivs LC0
fstpl 24(%esp)
.p2align 4,,10
L24:
movl 32(%esp), %eax
lock addl $1, (%eax) // заинлайненный IncRefCount()...
movl 40(%esp), %ecx
addl $1, 8(%ecx) // ProcessByValue() и Do() скомпилированы в 2 строки
lock subl $1, (%eax) // а это DecRefCount(). Впечатляет.
jne L23
movl (%ecx), %eax
call *4(%eax)
L23:
subl $1, 36(%esp)
jne L24
call _clock
Well, but what else can you do when passing by reference that it will work so fast that we can’t measure it? Here she is:
call _clock
movl %eax, 36(%esp)
movl 40(%esp), %eax
addl $10000000, 8(%eax) // предвычесленный окончательный результат, никаких циклов, ничего
call _clock
movl %eax, 32(%esp)
movl $20, 4(%esp)
fildl 32(%esp)
movl $LC2, (%esp)
movl $1, 48(%esp)
flds LC0
fdivr %st, %st(1)
fildl 36(%esp)
fdivp %st, %st(1)
fsubrp %st, %st(1)
fstpl 8(%esp)
call _printf
Wow! This listing fits the entire benchmark. The absence of atomic operations allowed the optimizer to crawl into this code and deploy the cycle to one pre-calculated value. Of course, this example is trivial. However, it allows you to clearly talk about the 2 benefits of passing smart pointers via a constant link, which make it not a premature optimization, but a serious means of improving performance:
1) removing atomic operations gives great benefits in itself
2) removing atomic operations allows the optimizer to comb code
Full source here .
The result may vary on your compiler :)
PS Emblem of Sutter has a very detailed essay on this topic, which in great detail affects the language side of passing smart pointers by reference in C ++.