C vs Go cycles and simple math

When I was tired of C programming, like many, I was interested in the Go language. It is strongly typed, compiled, therefore sufficiently productive. And then I wanted to know how confused the Go creators were to optimize the work with cycles and numbers.

First, look at how things are with C.

We write such a simple code:

#include<stdint.h>#include<stdio.h>intmain(){
        uint64_t i;
        uint64_t j = 0;
        for ( i = 10000000; i>0; i--)
        {
                j ^= i;
        }
        printf("%lu\n", j);
        return0;
}

Compile with O2, disassemble:

564:   31 d2                   xor    %edx,%edx
566:   b8 80 96 98 00          mov    $0x989680,%eax
56b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
570:   48 31 c2                xor    %rax,%rdx
573:   48 83 e8 01             sub    $0x1,%rax
577:   75 f7                   jne    570 <main+0x10>

We

get the execution time: real 0m0.023s
user 0m0.019s
sys 0m0.004s

It would seem that there is no place to accelerate, but we have a modern processor, for such operations we have fast sse registers. We try the options gcc -mfpmath = sse -msse4.2 the same result.
Add -O3 and cheers:

 57a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
 580:   83 c0 01                add    $0x1,%eax
 583:   66 0f ef c8             pxor   %xmm0,%xmm1
 587:   66 0f d4 c2             paddq  %xmm2,%xmm0
 58b:   3d 40 4b 4c 00          cmp    $0x4c4b40,%eax
 590:   75 ee                   jne    580 <main+0x20>

It can be seen that SSE2 commands and SSE registers are used, and we get a triple performance boost:

real 0m0,006s
user 0m0,006s
sys 0m0,000s

Also on Go:

package main
import"fmt"funcmain() {
        i := 0
        j := 0for i = 10000000; i>0; i-- {
                j ^= i
        }
        fmt.Println(j)
}

0x000000000048211a <+42>:    lea    -0x1(%rax),%rdx
0x000000000048211e <+46>:    xor    %rax,%rcx
0x0000000000482121 <+49>:    mov    %rdx,%rax
0x0000000000482124 <+52>:    test   %rax,%rax
0x0000000000482127 <+55>:    ja     0x48211a <main.main+42>


Go timings:
regular go:
real 0m0,021s
user 0m0,018s
sys 0m0,004s

gccgo:
real 0m0,058s
user 0m0,036s
sys 0m0,014s

Performance as in the case of C and O2, also set gccgo the same result, but it works longer the regular Go (1.10.4) compiler. Apparently, due to the fact that the regular compiler perfectly optimizes the launch of threads (in my case, 5 additional threads were created for 4 cores), the application works faster.

Conclusion



I still managed to get the standard Go compiler to work with sse instructions for the loop, slipping it into a native sse float.

package main
// +build amd64import"fmt"funcmain() {
        var i float64 = 0var j float64 = 0for i = 10000000; i>0; i-- {
                j += i
        }
        fmt.Println(j)
}


0x0000000000484bbe <+46>: movsd 0x4252a(%rip),%xmm3 # 0x4c70f0 <$f64.3ff0000000000000>
0x0000000000484bc6 <+54>: movups %xmm0,%xmm4
0x0000000000484bc9 <+57>: subsd %xmm3,%xmm0
0x0000000000484bcd <+61>: addsd %xmm4,%xmm1
0x0000000000484bd1 <+65>: xorps %xmm2,%xmm2
0x0000000000484bd4 <+68>: ucomisd %xmm2,%xmm0
0x0000000000484bd8 <+72>: ja 0x484bbe <main.main+46>

Also popular now: