C vs Go cycles and simple math
When I was tired of C programming, like many, I was interested in the Go language. It is strongly typed, compiled, therefore sufficiently productive. And then I wanted to know how confused the Go creators were to optimize the work with cycles and numbers.
First, look at how things are with C.
We write such a simple code:
Compile with O2, disassemble:
We
get the execution time: real 0m0.023s
user 0m0.019s
sys 0m0.004s
It would seem that there is no place to accelerate, but we have a modern processor, for such operations we have fast sse registers. We try the options gcc -mfpmath = sse -msse4.2 the same result.
Add -O3 and cheers:
It can be seen that SSE2 commands and SSE registers are used, and we get a triple performance boost:
real 0m0,006s
user 0m0,006s
sys 0m0,000s
Also on Go:
Go timings:
regular go:
real 0m0,021s
user 0m0,018s
sys 0m0,004s
gccgo:
real 0m0,058s
user 0m0,036s
sys 0m0,014s
Performance as in the case of C and O2, also set gccgo the same result, but it works longer the regular Go (1.10.4) compiler. Apparently, due to the fact that the regular compiler perfectly optimizes the launch of threads (in my case, 5 additional threads were created for 4 cores), the application works faster.
I still managed to get the standard Go compiler to work with sse instructions for the loop, slipping it into a native sse float.
First, look at how things are with C.
We write such a simple code:
#include<stdint.h>#include<stdio.h>intmain(){
uint64_t i;
uint64_t j = 0;
for ( i = 10000000; i>0; i--)
{
j ^= i;
}
printf("%lu\n", j);
return0;
}
Compile with O2, disassemble:
564: 31 d2 xor %edx,%edx
566: b8 80 96 98 00 mov $0x989680,%eax
56b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
570: 48 31 c2 xor %rax,%rdx
573: 48 83 e8 01 sub $0x1,%rax
577: 75 f7 jne 570 <main+0x10>
We
get the execution time: real 0m0.023s
user 0m0.019s
sys 0m0.004s
It would seem that there is no place to accelerate, but we have a modern processor, for such operations we have fast sse registers. We try the options gcc -mfpmath = sse -msse4.2 the same result.
Add -O3 and cheers:
57a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
580: 83 c0 01 add $0x1,%eax
583: 66 0f ef c8 pxor %xmm0,%xmm1
587: 66 0f d4 c2 paddq %xmm2,%xmm0
58b: 3d 40 4b 4c 00 cmp $0x4c4b40,%eax
590: 75 ee jne 580 <main+0x20>
It can be seen that SSE2 commands and SSE registers are used, and we get a triple performance boost:
real 0m0,006s
user 0m0,006s
sys 0m0,000s
Also on Go:
package main
import"fmt"funcmain() {
i := 0
j := 0for i = 10000000; i>0; i-- {
j ^= i
}
fmt.Println(j)
}
0x000000000048211a <+42>: lea -0x1(%rax),%rdx
0x000000000048211e <+46>: xor %rax,%rcx
0x0000000000482121 <+49>: mov %rdx,%rax
0x0000000000482124 <+52>: test %rax,%rax
0x0000000000482127 <+55>: ja 0x48211a <main.main+42>
Go timings:
regular go:
real 0m0,021s
user 0m0,018s
sys 0m0,004s
gccgo:
real 0m0,058s
user 0m0,036s
sys 0m0,014s
Performance as in the case of C and O2, also set gccgo the same result, but it works longer the regular Go (1.10.4) compiler. Apparently, due to the fact that the regular compiler perfectly optimizes the launch of threads (in my case, 5 additional threads were created for 4 cores), the application works faster.
Conclusion
I still managed to get the standard Go compiler to work with sse instructions for the loop, slipping it into a native sse float.
package main
// +build amd64import"fmt"funcmain() {
var i float64 = 0var j float64 = 0for i = 10000000; i>0; i-- {
j += i
}
fmt.Println(j)
}
0x0000000000484bbe <+46>: movsd 0x4252a(%rip),%xmm3 # 0x4c70f0 <$f64.3ff0000000000000>
0x0000000000484bc6 <+54>: movups %xmm0,%xmm4
0x0000000000484bc9 <+57>: subsd %xmm3,%xmm0
0x0000000000484bcd <+61>: addsd %xmm4,%xmm1
0x0000000000484bd1 <+65>: xorps %xmm2,%xmm2
0x0000000000484bd4 <+68>: ucomisd %xmm2,%xmm0
0x0000000000484bd8 <+72>: ja 0x484bbe <main.main+46>