Defective Embedding of Functions in Go
Is the performance below equivalent in performance?
// (A). Вызов HasPrefix будет встроен.return strings.HasPrefix(s, "#")
// (B). Ручное встраивание тела HasPrefix.returnlen(s) >= len("#") && s[:len("#")] == "#"
The answer is no .
For details and explanations I ask under cat.
Good day, before you open the topic, I would like to introduce myself.
My name is Iskander and I send commits to the golang / go repository from time to time .

I used to do this on behalf of the Intel Go team , but our paths diverged and now I am an independent contributor. Recently I work in vk in the infrastructure team.
In my free time I make different tools for Go, such as go-critic and go-consistent . I also draw gophers .
Measure it!
Immediately proceed to the comparison and outline the benchmark:
package benchmark
import (
"strings""testing"
)
var s = "#string"funcBenchmarkHasPrefixCall(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = strings.HasPrefix(s, "#")
_ = strings.HasPrefix(s, "x")
}
}
funcBenchmarkHasPrefixInlined(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = len(s) >= len("#") && s[:len("#")] == "#"
_ = len(s) >= len("x") && s[:len("x")] == "x"
}
}
Instead of recommending benchstat to you , I will show you benchrun .
With the help of one command, we can run both benchmarks and get a comparison:
go-benchrun HasPrefixCall HasPrefixInlined -v -count=10 .
Benchstat results:
name old time/op new time/op delta
HasPrefixCall-8 9.15ns ± 1% 0.36ns ± 3% -96.09% (p=0.000 n=10+9)
The version with manual embedding is much faster than the code that was obtained by embedding the function body with the compiler. Let's try to figure out why this is happening.
strings.HasPrefix
Recall the implementation strings.HasPrefix
:
// HasPrefix tests whether the string s begins with prefix.funcHasPrefix(s, prefix string)bool {
returnlen(s) >= len(prefix) && s[0:len(prefix)] == prefix
}
The function is HasPrefix
built in by the compiler.
You can check this as follows:
go build -gcflags='-m=2' strings 2>&1 | grep 'can inline HasPrefix'
To call strings.HasPrefix
from a variant, (A)
we get the following machine code:
MOVQ (TLS), CX
CMPQ SP, 16(CX)
JLS more_stack
fn_body:
SUBQ $40, SP
MOVQ BP, 32(SP)
LEAQ 32(SP), BP
XCHGL AX, AX
MOVQ s+56(SP), AX
CMPQ AX, $1
JGE compare_strings
XORL AX, AX
MOVB AL, ~ret1+64(SP)
MOVQ 32(SP), BP
ADDQ $40, SP
return:
RET
compare_strings:
MOVQ s+48(SP), AX
MOVQ AX, (SP)
LEAQ go.string."#"(SB), AX
MOVQ AX, 8(SP)
MOVQ $1, 16(SP)
CALL runtime.memequal(SB)
MOVBLZX 24(SP), AX
JMP return
more_stack:
CALL runtime.morestack_noctxt(SB)
JMP fn_body
Do not pay attention to what the code looks like noodles.
What you should pay attention to:
strings.HasPrefix
was really inserted, no call.- To compare strings called
runtime.memequal
.
But what then is generated for the manually built-in variant, the code from the example (B)
?
MOVQ s+16(SP), AX
CMPQ AX, $1
JLT different_length
MOVQ s+8(SP), AX
CMPB (AX), $35 // 35 - код символа "#"
SETEQ AL
return:
MOVB AL, "".~ret1+24(SP)
RET
different_length:
XORL AX, AX
JMP 22
And here the compiler does not generate a call runtime.memequal
and a single character is compared directly. Ideally, he should have done the same for the first option.
We observe the weak side of the Go optimizer, and we will analyze it.
Optimize constant expressions
The reason a call strings.HasPrefix(s, "#")
can be optimized is because the prefix argument is constant. We know its length and content. It makes no sense to call runtime.memequal
for short lines, faster to make a comparison of the characters "in place", avoiding an extra call.
As you know, compilers usually have at least two parts: the compiler frontend and the compiler backend . The first works with a higher-level view, the second is closer to the machine and the intermediate view will look like a stream of instructions. Already several versions in Go use the SSA representation for optimizations in the backend part of the compiler.
Collapsing constants, such as {10*2 => 20}
, is implemented in the backend. In general, most of the operations associated with lowering the computational cost of expressions are in this part of the compiler. But there are exceptions.
One exception is the optimization of constant string comparisons. When the compiler sees a string comparison (or substrings) in which one or both operands are constants, a more efficient code is generated than a call runtime.memequal
.
You can view the source code for this in cmd / compile / internal / gc / walk.go: 3362 .
Embedding of functions occurs before running these optimizations, but also in the frontend part of the compiler.
It would seem that all the same does not allow this optimization to work in our case?
How Go builds function calls
This is how embedding will occur:
// Вот как выглядел вызов:return strings.HasPrefix(s, "#")
// Вот сигнатура:funcHasPrefix(s, prefix string)bool
// А вот результат встраивания:
_s, _prefix := s, "#"
returnlen(s) >= len(prefix) && s[:len(prefix)] == prefix
When embedding functions, the compiler assigns arguments to temporary variables, which breaks the optimization, since the algorithm in walk.go does not see the constants, but the arguments with variables. This is the problem.
By the way, optimizations from backend, which are available to SSA, do not interfere. But there are other problems there, for example, the inability to restore high-level language constructs to effectively match them (work to eliminate this shortcoming has been “planned” for several years).
Another example: escape analysis
Imagine a function that is important to allocate a temporary buffer on the stack:
funcbusinessLogic()error {
buf := make([]byte, 0, 16)
// buf используется только локально// для временного хранения результатов.returnnil
}
Since it buf
does not "run away", the compiler will be able to allocate these 16 bytes on the stack, without allocating it to the heap. Again, all thanks to the constant value when calling make
. To allocate memory on the stack, it is important for us to know the required size, which will be part of the frame allocated for the function call.
Suppose in the future we wanted to allocate temporary buffers of different sizes and encapsulate some logic in the methods. We introduced a new abstraction and decided to use the new type everywhere tmpBuf
. The design function is extremely simple:
funcnewTmpBuf(sizeHint int)tmpBuf {
return tmpBuf{buf: make([]byte, 0, sizeHint)}
}
Adapt the original example:
func businessLogic() error {
- buf := make([]byte, 0, 16)+ buf := newTmpBuf(16)
// buf используется только локально
// для временного хранения результатов.
return nil
}
The constructor will be built in, but the allocation will now always be on the heap, for the same reason passing arguments through temporary variables. Escape analysis will see make([]byte, 0, _sizeHint)
that it does not fall under its pattern of recognition of optimized calls make
.
If we had “everything like people’s”, there would be no problem, after embedding the constructor, newTmpBuf
it would be clear that the size is still known at the compilation stage.
This saddens almost more than the situation with string comparison.
Horizons Go 1.13
The situation can be corrected quite easily and I have already sent the first part of the solution .

If you think that the problem described in the article really needs a solution, please put a thumbs up in the appropriate issue .
My position is that embedding the code with your hands just because it runs faster in the current version of Go is wrong. It is necessary to correct this flaw in the optimizer, at least to the level that the examples described above work without unexpected performance regressions.
If everything goes according to plan, this optimization will go into the release of Go 1.13.
Thanks for attention.
Addition: proposed solution
This section is for the bravest, those who are not yet tired of reading.
So, we have several places that work worse when using variables instead of their values directly. The proposed solution is to introduce a new function in the frontend part of the compiler, which allows you to get the last bound value by name. After that, in each optimization that expects a constant value, do not give up when a variable is detected, but receive this previously saved state.
The signature of our new feature might look like this:
funcgetConstValue(n *Node) *Node
The definition Node
can be viewed in the file syntax.go .
Each variable definition is Node
tagged ONAME
. Inside Node.Name.Defn
for most of these variables there is an initial value.
If Node
already literal, nothing needs to be done and we simply return n
. If this is ONAME
(variable), then you can try to extract from n.Name.Defn
the same initializing value.
But what about the modifications between the declaration and the reading of the variable for which we are calling getConstValue
? If we confine ourselves to read-only variables, then there is no problem. In the frontend go, there are already special node flags that mark similar names. If the variable is modified, getConstValue
it will not return an initialization value.
Programmers, as a rule, do not modify the input arguments of numeric and string types, and this makes it possible to cover a fairly large number of cases with this primitive algorithm.
Now we are ready to consider the implementation of:
funcgetConstValue(n *Node) *Node {
// Мы раскрываем только ONAME у которых доступен definition.if n.Op != ONAME || n.Name.Defn == nil {
return n
}
// Проверка на то, что инициализирующее значение не изменялось.// Заметим, что это очень консервативно, но нашу задачу по// исправлению проблемы встраивания функций и escape analysis'а решает.
maybeModified := n.Assigned() || n.Name.Defn.Assigned() || n.Addrtaken()
if maybeModified {
return n
}
// OAS - Node типа присваивания.// n.Name.Defn.Left - это LHS.// n.Name.Defn.Right - это RHS.// consttype(v) возвращает константный тип инициализирующего выражения.// Если это CTxxx, то переданное выражение не является константным.if n.Name.Defn.Op == OAS {
v := n.Name.Defn.Right
if v != nil && consttype(v) != CTxxx {
return v
}
}
return n
}
Something like this changes the code, which depends on the constants:
- i := indexconst(r)+ i := indexconst(getConstValue(r))
Great, and it even works:
n := 10
xs := make([]int, n) // Теперь не убегает в кучу!
Before this change, escape analysis could not get through the n
value 10
, which made it an assumption about the need to be placed xs
on the heap.
The code above is syntactically similar to the situation observed when embedding. n
may be a temporary variable that is added when the argument is passed.
Unfortunately, there are nuances.
We solved the problem for local variables entered through OAS , but Go initializes the variables for the built-in functions through OAS2 . Because of this, we will need a second change, which expands the function getConstValue
and slightly modifies the inliner's code, because, among other things, it OAS2
does not have a suitable field Defn
.
That was bad news. Good news: a channel appeared in the Russian-language slak#gocontributing
where you can share your ideas and plans, ask questions, and discuss everything related to participation in the development of Go.