Passing smart pointers by constant link. Autopsy

    Smart pointers are often passed to other functions through a constant reference. C ++ experts, Andrei Alexandrescu, Scott Meyers and Herb Sutter, discuss this issue at the C ++ and Beyond 2011 conference (See [04:34] On shared_ptr performance and correctness).

    In fact, a smart pointer that is passed by a constant reference already lives in the current scope somewhere in the calling code. If it is stored in a member of the class, then it may happen that this member is reset to zero. But this is not a problem of passing by reference, it is a problem of architecture and ownership policy.

    But this post is not about correctness. Here we look at the performance that we can get when switching to constant links. At first glance, it may seem that the only benefit is the absence of atomic increments / decrements of the reference counter when calling the copy constructor and destructor. Let's write some code and take a closer look at what happens under the hood.



    Article translation: blog.linderdaum.com/2014/07/03/smart-pointers-passed-by-const-reference


    For starters, a few helper functions:

    const size_t NUM_CALLS = 10000000;
    double GetSeconds()
    {
    	return ( double )clock() / CLOCKS_PER_SEC;
    }
    void PrintElapsedTime( double ElapsedTime )
    {
    	printf( "%f s/Mcalls\n", float( ElapsedTime / double( NUM_CALLS / 10000000 ) )  );
    }
    


    Intrusive Link Counter:

    class iIntrusiveCounter
    {
    public:
    	iIntrusiveCounter():FRefCounter(0) {};
    	virtual ~iIntrusiveCounter() {}
    	void    IncRefCount() { FRefCounter++; }
    	void    DecRefCount() { if ( --FRefCounter == 0 ) { delete this; } }
    private:
    	std::atomic FRefCounter;
    };
    


    Ad hoc smart pointer:

    template  class clPtr
    {
    public:
    	clPtr(): FObject( 0 ) {}
    	clPtr( const clPtr& Ptr ): FObject( Ptr.FObject ) { FObject->IncRefCount(); }
    	clPtr( T* const Object ): FObject( Object ) { FObject->IncRefCount(); }
    	~clPtr() { FObject->DecRefCount(); }
    	clPtr& operator = ( const clPtr& Ptr )
    	{
    		T* Temp = FObject;
    		FObject = Ptr.FObject;
    		Ptr.FObject->IncRefCount();
    		Temp->DecRefCount();
    		return *this;
    	}
    	inline T* operator -> () const { return FObject; }
    private:
    	T*    FObject;
    };
    


    So far, everything is quite simple, right?
    We declare a simple class, an instance of which we will pass to the function first by value, and then by a constant reference:

    class clTestObject: public iIntrusiveCounter
    {
    public:
    	clTestObject():FPayload(32167) {}
    	// сделаем что-нибудь полезное
    	void Do()
    	{
    		FPayload++;
    	}
    private:
    	int FPayload;
    };
    


    Now you can write the benchmark code directly:

    void ProcessByValue( clPtr O ) { O->Do(); }
    void ProcessByConstRef( const clPtr& O ) { O->Do(); }
    int main()
    {
    	clPtr Obj = new clTestObject;
    	for ( size_t j = 0; j != 3; j++ )
    	{
    		double StartTime = GetSeconds();
    		for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByValue( Obj ); }
    		PrintElapsedTime( GetSeconds() - StartTime );
    	}
    	for ( size_t j = 0; j != 3; j++ )
    	{
    		double StartTime = GetSeconds();
    		for ( size_t i = 0; i != NUM_CALLS; i++ ) { ProcessByConstRef( Obj ); }
    		PrintElapsedTime( GetSeconds() - StartTime );
    	}
    	return 0;
    }
    


    Gather and see what happens. First, let's build an unoptimized version (I use gcc.EXE (GCC) 4.10.0 20140420 (experimental) ):

    gcc -O0 main.cpp -lstdc++ -std=c++11
    


    The operating speed is 0.375 s / M calls for the “by value” version versus 0.124 s / M calls for the “constant-link” version. A convincing 3x difference in the debug build. It's good. Let's look at the assembler listing. Meaningful Version:

    L25:
    	leal	-60(%ebp), %eax
    	leal	-64(%ebp), %edx
    	movl	%edx, (%esp)
    	movl	%eax, %ecx
    	call	__ZN5clPtrI12clTestObjectEC1ERKS1_		// вызываем конструктор копирования
    	subl	$4, %esp
    	leal	-60(%ebp), %eax
    	movl	%eax, (%esp)
    	call	__Z14ProcessByValue5clPtrI12clTestObjectE
    	leal	-60(%ebp), %eax
    	movl	%eax, %ecx
    	call	__ZN5clPtrI12clTestObjectED1Ev			// вызываем деструктор
    	addl	$1, -32(%ebp)
    L24:
    	cmpl	$10000000, -32(%ebp)
    	jne	L25
    


    Constant-link version. Pay attention to how much everything became cleaner even in the debug build:

    L29:
    	leal	-64(%ebp), %eax
    	movl	%eax, (%esp)
    	call	__Z17ProcessByConstRefRK5clPtrI12clTestObjectE	// просто один вызов
    	addl	$1, -40(%ebp)
    L28:
    	cmpl	$10000000, -40(%ebp)
    	jne	L29
    


    All the challenges in their places and all that was saved were two rather expensive atomic operations. But debug builds are not what we need, right? Let's optimize everything and see what happens:

    gcc -O3 main.cpp -lstdc++ -std=c++11
    


    The “by value” version is now completed in 0.168 seconds per 1 million calls. The execution time of the "constant-link" version literally dropped to zero. It's not a mistake. No matter how many iterations we do, the execution time of this simple test will be zero. Let's look at the assembler to see if we made a mistake somewhere. Here is an optimized version of the transmission by value:

    L25:
    	call	_clock
    	movl	%eax, 36(%esp)
    	fildl	36(%esp)
    	movl	$10000000, 36(%esp)
    	fdivs	LC0
    	fstpl	24(%esp)
    	.p2align 4,,10
    L24:
    	movl	32(%esp), %eax
    	lock addl	$1, (%eax)		// заинлайненный IncRefCount()...
    	movl	40(%esp), %ecx
    	addl	$1, 8(%ecx)		// ProcessByValue() и Do() скомпилированы в 2 строки
    	lock subl	$1, (%eax)		// а это DecRefCount(). Впечатляет.
    	jne	L23
    	movl	(%ecx), %eax
    	call	*4(%eax)
    L23:
    	subl	$1, 36(%esp)
    	jne	L24
    	call	_clock
    


    Well, but what else can you do when passing by reference that it will work so fast that we can’t measure it? Here she is:

    	call	_clock
    	movl	%eax, 36(%esp)
    	movl	40(%esp), %eax
    	addl	$10000000, 8(%eax)		// предвычесленный окончательный результат, никаких циклов, ничего
    	call	_clock
    	movl	%eax, 32(%esp)
    	movl	$20, 4(%esp)
    	fildl	32(%esp)
    	movl	$LC2, (%esp)
    	movl	$1, 48(%esp)
    	flds	LC0
    	fdivr	%st, %st(1)
    	fildl	36(%esp)
    	fdivp	%st, %st(1)
    	fsubrp	%st, %st(1)
    	fstpl	8(%esp)
    	call	_printf
    


    Wow! This listing fits the entire benchmark. The absence of atomic operations allowed the optimizer to crawl into this code and deploy the cycle to one pre-calculated value. Of course, this example is trivial. However, it allows you to clearly talk about the 2 benefits of passing smart pointers via a constant link, which make it not a premature optimization, but a serious means of improving performance:

    1) removing atomic operations gives great benefits in itself
    2) removing atomic operations allows the optimizer to comb code

    Full source here .

    The result may vary on your compiler :)

    PS Emblem of Sutter has a very detailed essay on this topic, which in great detail affects the language side of passing smart pointers by reference in C ++.

    Also popular now: