Defective Embedding of Functions in Go

    Is the performance below equivalent in performance?

    // (A). Вызов HasPrefix будет встроен.return strings.HasPrefix(s, "#")
    // (B). Ручное встраивание тела HasPrefix.returnlen(s) >= len("#") && s[:len("#")] == "#"

    The answer is no .

    For details and explanations I ask under cat.

    Good day, before you open the topic, I would like to introduce myself.
    My name is Iskander and I send commits to the golang / go repository from time to time .


    I used to do this on behalf of the Intel Go team , but our paths diverged and now I am an independent contributor. Recently I work in vk in the infrastructure team.

    In my free time I make different tools for Go, such as go-critic and go-consistent . I also draw gophers .

    Measure it!

    Immediately proceed to the comparison and outline the benchmark:

    package benchmark
    import (
    var s = "#string"funcBenchmarkHasPrefixCall(b *testing.B) {
        for i := 0; i < b.N; i++ {
            _ = strings.HasPrefix(s, "#")
            _ = strings.HasPrefix(s, "x")
    funcBenchmarkHasPrefixInlined(b *testing.B) {
        for i := 0; i < b.N; i++ {
            _ = len(s) >= len("#") && s[:len("#")] == "#"
            _ = len(s) >= len("x") && s[:len("x")] == "x"

    Instead of recommending benchstat to you , I will show you benchrun .

    With the help of one command, we can run both benchmarks and get a comparison:

    go-benchrun HasPrefixCall HasPrefixInlined -v -count=10 .
      Benchstat results:
    name             old time/op  new time/op  delta
    HasPrefixCall-8  9.15ns ± 1%  0.36ns ± 3%  -96.09%  (p=0.000 n=10+9)

    The version with manual embedding is much faster than the code that was obtained by embedding the function body with the compiler. Let's try to figure out why this is happening.


    Recall the implementation strings.HasPrefix:

    // HasPrefix tests whether the string s begins with prefix.funcHasPrefix(s, prefix string)bool {
        returnlen(s) >= len(prefix) && s[0:len(prefix)] == prefix

    The function is HasPrefixbuilt in by the compiler.
    You can check this as follows:

    go build -gcflags='-m=2' strings 2>&1 | grep 'can inline HasPrefix'

    To call strings.HasPrefixfrom a variant, (A)we get the following machine code:

        MOVQ    (TLS), CX
        CMPQ    SP, 16(CX)
        JLS more_stack
        SUBQ    $40, SP
        MOVQ    BP, 32(SP)
        LEAQ    32(SP), BP
        XCHGL   AX, AX
        MOVQ    s+56(SP), AX
        CMPQ    AX, $1
        JGE compare_strings
        XORL    AX, AX
        MOVB    AL, ~ret1+64(SP)
        MOVQ    32(SP), BP
        ADDQ    $40, SP
        MOVQ    s+48(SP), AX
        MOVQ    AX, (SP)
        LEAQ    go.string."#"(SB), AX
        MOVQ    AX, 8(SP)
        MOVQ    $1, 16(SP)
        CALL    runtime.memequal(SB)
        MOVBLZX 24(SP), AX
        JMP return
        CALL    runtime.morestack_noctxt(SB)
        JMP fn_body

    Do not pay attention to what the code looks like noodles.

    What you should pay attention to:

    • strings.HasPrefix was really inserted, no call.
    • To compare strings called runtime.memequal.

    But what then is generated for the manually built-in variant, the code from the example (B)?

        MOVQ    s+16(SP), AX
        CMPQ    AX, $1
        JLT different_length
        MOVQ    s+8(SP), AX
        CMPB    (AX), $35 // 35 - код символа "#"
        SETEQ   AL
        MOVB    AL, "".~ret1+24(SP)
        XORL    AX, AX
        JMP 22

    And here the compiler does not generate a call runtime.memequaland a single character is compared directly. Ideally, he should have done the same for the first option.

    We observe the weak side of the Go optimizer, and we will analyze it.

    Optimize constant expressions

    The reason a call strings.HasPrefix(s, "#")can be optimized is because the prefix argument is constant. We know its length and content. It makes no sense to call runtime.memequalfor short lines, faster to make a comparison of the characters "in place", avoiding an extra call.

    As you know, compilers usually have at least two parts: the compiler frontend and the compiler backend . The first works with a higher-level view, the second is closer to the machine and the intermediate view will look like a stream of instructions. Already several versions in Go use the SSA representation for optimizations in the backend part of the compiler.

    Collapsing constants, such as {10*2 => 20}, is implemented in the backend. In general, most of the operations associated with lowering the computational cost of expressions are in this part of the compiler. But there are exceptions.

    One exception is the optimization of constant string comparisons. When the compiler sees a string comparison (or substrings) in which one or both operands are constants, a more efficient code is generated than a call runtime.memequal.

    You can view the source code for this in cmd / compile / internal / gc / walk.go: 3362 .

    Embedding of functions occurs before running these optimizations, but also in the frontend part of the compiler.

    It would seem that all the same does not allow this optimization to work in our case?

    How Go builds function calls

    This is how embedding will occur:

    // Вот как выглядел вызов:return strings.HasPrefix(s, "#")
    // Вот сигнатура:funcHasPrefix(s, prefix string)bool
    // А вот результат встраивания:
    _s, _prefix := s, "#"
    returnlen(s) >= len(prefix) && s[:len(prefix)] == prefix

    When embedding functions, the compiler assigns arguments to temporary variables, which breaks the optimization, since the algorithm in walk.go does not see the constants, but the arguments with variables. This is the problem.

    By the way, optimizations from backend, which are available to SSA, do not interfere. But there are other problems there, for example, the inability to restore high-level language constructs to effectively match them (work to eliminate this shortcoming has been “planned” for several years).

    Another example: escape analysis

    Imagine a function that is important to allocate a temporary buffer on the stack:

    funcbusinessLogic()error {
        buf := make([]byte, 0, 16)
        // buf используется только локально// для временного хранения результатов.returnnil

    Since it bufdoes not "run away", the compiler will be able to allocate these 16 bytes on the stack, without allocating it to the heap. Again, all thanks to the constant value when calling make. To allocate memory on the stack, it is important for us to know the required size, which will be part of the frame allocated for the function call.

    Suppose in the future we wanted to allocate temporary buffers of different sizes and encapsulate some logic in the methods. We introduced a new abstraction and decided to use the new type everywhere tmpBuf. The design function is extremely simple:

    funcnewTmpBuf(sizeHint int)tmpBuf {
        return tmpBuf{buf: make([]byte, 0, sizeHint)}

    Adapt the original example:

    func businessLogic() error {
    -   buf := make([]byte, 0, 16)+   buf := newTmpBuf(16)
        // buf используется только локально
        // для временного хранения результатов.
        return nil

    The constructor will be built in, but the allocation will now always be on the heap, for the same reason passing arguments through temporary variables. Escape analysis will see make([]byte, 0, _sizeHint)that it does not fall under its pattern of recognition of optimized calls make.

    If we had “everything like people’s”, there would be no problem, after embedding the constructor, newTmpBufit would be clear that the size is still known at the compilation stage.

    This saddens almost more than the situation with string comparison.

    Horizons Go 1.13

    The situation can be corrected quite easily and I have already sent the first part of the solution .


    If you think that the problem described in the article really needs a solution, please put a thumbs up in the appropriate issue .

    My position is that embedding the code with your hands just because it runs faster in the current version of Go is wrong. It is necessary to correct this flaw in the optimizer, at least to the level that the examples described above work without unexpected performance regressions.

    If everything goes according to plan, this optimization will go into the release of Go 1.13.

    Thanks for attention.

    Addition: proposed solution

    This section is for the bravest, those who are not yet tired of reading.

    So, we have several places that work worse when using variables instead of their values ​​directly. The proposed solution is to introduce a new function in the frontend part of the compiler, which allows you to get the last bound value by name. After that, in each optimization that expects a constant value, do not give up when a variable is detected, but receive this previously saved state.

    The signature of our new feature might look like this:

    funcgetConstValue(n *Node) *Node

    The definition Nodecan be viewed in the file syntax.go .

    Each variable definition is Nodetagged ONAME. Inside Node.Name.Defnfor most of these variables there is an initial value.

    If Nodealready literal, nothing needs to be done and we simply return n. If this is ONAME(variable), then you can try to extract from n.Name.Defnthe same initializing value.

    But what about the modifications between the declaration and the reading of the variable for which we are calling getConstValue? If we confine ourselves to read-only variables, then there is no problem. In the frontend go, there are already special node flags that mark similar names. If the variable is modified, getConstValueit will not return an initialization value.

    Programmers, as a rule, do not modify the input arguments of numeric and string types, and this makes it possible to cover a fairly large number of cases with this primitive algorithm.

    Now we are ready to consider the implementation of:

    funcgetConstValue(n *Node) *Node {
        // Мы раскрываем только ONAME у которых доступен definition.if n.Op != ONAME || n.Name.Defn == nil {
            return n
        // Проверка на то, что инициализирующее значение не изменялось.// Заметим, что это очень консервативно, но нашу задачу по// исправлению проблемы встраивания функций и escape analysis'а решает.
        maybeModified := n.Assigned() || n.Name.Defn.Assigned() || n.Addrtaken()
        if maybeModified {
            return n
        // OAS - Node типа присваивания.// n.Name.Defn.Left - это LHS.// n.Name.Defn.Right - это RHS.// consttype(v) возвращает константный тип инициализирующего выражения.// Если это CTxxx, то переданное выражение не является константным.if n.Name.Defn.Op == OAS {
            v := n.Name.Defn.Right
            if v != nil && consttype(v) != CTxxx {
                return v
        return n    

    Something like this changes the code, which depends on the constants:

    - i := indexconst(r)+ i := indexconst(getConstValue(r))

    Great, and it even works:

    n := 10
    xs := make([]int, n) // Теперь не убегает в кучу!

    Before this change, escape analysis could not get through the nvalue 10, which made it an assumption about the need to be placed xson the heap.

    The code above is syntactically similar to the situation observed when embedding. nmay be a temporary variable that is added when the argument is passed.

    Unfortunately, there are nuances.

    We solved the problem for local variables entered through OAS , but Go initializes the variables for the built-in functions through OAS2 . Because of this, we will need a second change, which expands the function getConstValueand slightly modifies the inliner's code, because, among other things, it OAS2does not have a suitable field Defn.

    That was bad news. Good news: a channel appeared in the Russian-language slak#gocontributing where you can share your ideas and plans, ask questions, and discuss everything related to participation in the development of Go.

    Also popular now: