Accelerate do-it-yourself string concatenation


    Today, we will accelerate the gluing of short strings in Go by 30%. And for this, we will not need to modify Go itself, all of this will be implemented as a third-party library .


    Under the cut you are waiting for:


    • Comparison +, strings.Builderand eigenconfiguration
    • Details of the internal arrangement of strings in Go
    • Quite a bit of assembler

    This article can also be considered a pretext to discuss CL123256: runtime, cmd / compile: specialize concatstring2 . Ideas for improving this change list are welcome.


    Immediately results


    The comparison was made with the go tip(master) version of the compiler. You can get similar results on versions around Go 1.5. The last major feature change concatstringswas CL3120: cmd / gc: allocate buffers for non-escaped strings on stack .


    BenchmarkConcat2Operator-8      20000000            83.8 ns/op
    BenchmarkConcat2Builder-8       20000000            70.9 ns/op
    BenchmarkConcat2-8              20000000            62.1 ns/op
    BenchmarkConcat3Operator-8      20000000            104  ns/op
    BenchmarkConcat3Builder-8       20000000            89.9 ns/op
    BenchmarkConcat3-8              20000000            82.1 ns/op

    ConcatOperatoruses +.
    ConcatBuilderuses strings.Builderwith proper pre-allocation.
    Concatuses the function that we implement as part of this story.


    Compare over benchstat :


    nameoldtime/op  newtime/op  delta
    Concat2-884.2ns ± 1%  62.7ns ± 2%  -25.49%  (p=0.000 n=9+10)
    Concat3-8103ns ± 3%    83ns ± 4%  -19.83%  (p=0.000 n=10+9)

    The assembler implementation is a GOARCH=AMD64bit faster and has additional optimization, which is present in the built-in operator +, but more on that below:


    nameoldtime/op  newtime/op  delta
    Concat2-884.2ns ± 1%  57.1ns ± 3%  -32.20%  (p=0.000 n=9+9)

    The assembler function will be taken as 100% of the performance (relative to the other implementations under consideration).


    Results for longer lines can be seen in the README.md . The longer the string, the less pronounced the difference between implementations.

    Naive concatenation


    The simplest solution is to use the operator +.


    The semantics of this operator are as follows: take two strings and return a result string that contains the concatenation of both strings. There is no guarantee that a new string will be returned. For example, if an empty string and any other chaining occurs, the runtime may return a non-empty argument, avoiding the need to allocate new memory and copy data there.


    But, as can be seen from the results at the beginning of the article, this is the slowest way.


    funcconcat2operator(x, y string)string { 
        return x + y
    }

    Performance rating: 67.8% .

    strings.Builder


    Not so long ago, Go added a new type - strings.Builder . This is an analogue bytes.Buffer, but when you call the method String()does not re-allocate memory and copy data.


    In contrast bytes.Buffer, the builder does not have a small buffer optimization and, therefore, a previously allocated memory for the string. If you do not use the method Grow, the performance will be worse than with bytes.Buffer. Several regressions in Go 1.11 are caused by this particular feature (see CL113235 ).


    In our code, for the purity of the experiment, we will avoid this error.


    funcconcat2builder(x, y string)string {
        var builder strings.Builder
        builder.Grow(len(x) + len(y)) // Только эта строка выделяет память
        builder.WriteString(x)
        builder.WriteString(y)
        return builder.String()
    }

    Performance rating: 80.5% (+12.7).

    Code Generation for Concatenation


    If you look at what code the compiler generates for the operator +, we will see function calls concatstring2, concatstring3and so on (up to and concatstring5including).


    funcconcat2codegen(x, y)string { return x + y }
    // => CALL  runtime.concatstring2(SB)funcconcat3codegen(x, y, z)string { return x + y + z }
    // => CALL  runtime.concatstring3(SB)

    Let's look at the runtime / string.go itself :


    funcconcatstring2(buf *tmpBuf, a [2]string)string {
        return concatstrings(buf, a[:])
    }
    funcconcatstring3(buf *tmpBuf, a [3]string)string {
        return concatstrings(buf, a[:])
    }

    So it remains to study the function concatstrings.
    The full listing is available below under the spoiler, but the high-level description:


    1. Parameter bufcan be nil. This buffer is allocated by the compiler, if the string does not "run away" from the scope of its definition. If the string lives longer than the frame, then this buffer will always be nil(as often happens). However, if this buffer is available, it will be possible to avoid allocation if the result fits into it (its size is 32 bytes).
    2. If all the lines except one are empty, the function will return this line. But at the same time, the lines allocated on the stack and leaving their frame bypass the optimization, so that the caller does not receive the already freed memory.
    3. Further, all lines are copied to the new memory.

    Full listing of the concatstrings function
    // concatstrings implements a Go string concatenation x+y+z+...// The operands are passed in the slice a.// If buf != nil, the compiler has determined that the result does not// escape the calling function, so the string data can be stored in buf// if small enough.funcconcatstrings(buf *tmpBuf, a []string)string {
        idx := 0
        l := 0
        count := 0for i, x := range a {
            n := len(x)
            if n == 0 {
                continue
            }
            if l+n < l {
                throw("string concatenation too long")
            }
            l += n
            count++
            idx = i
        }
        if count == 0 {
            return""
        }
        // If there is just one string and either it is not on the stack// or our result does not escape the calling frame (buf != nil),// then we can return that string directly.if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) {
            return a[idx]
        }
        s, b := rawstringtmp(buf, l)
        for _, x := range a {
            copy(b, x)
            b = b[len(x):]
        }
        return s
    }

    Here we see several places at once that can be optimized for a particular case:


    • bufmost often empty. When the compiler failed to prove that it is safe to place the string on the stack, passing the extra parameter and checking it on the nilinside of the function only overhead.
    • For the particular case, len(a) == 2we do not need a cycle and the calculations can be simplified. And this is the most common type of concatenation.

    Statistics on the use of concatenation

    At execution ./make.bash(build of the Go compiler and stdlib) we see 445 concatenations with two operands:


    • 398 результатов "убегают". В этом случае наша специализация имеет смысл.
    • 47 результатов не покидают своего фрейма.

    Итого 89% конкатенаций от двух аргументов попадают пот оптимизацию.


    Для утилиты go имеем:


    • 501 вызовов concatstring2
    • 194 вызовов concatstring3
    • 55 вызовов concatstring4

    Version for all architectures


    To implement the specialization, we need to know how the lines are represented in Go. Binary compatibility is important to us, and unsafe.Pointeryou can replace it *bytewith no victims.


    type stringStruct struct {
        str *bytelenint
    }

    The second important conclusion that we can make from runtime: the lines begin their lives mutable. The part of the memory referenced []byteto which the contents of the new line are written is allocated , and only then is []bytethrown out, and the memory to which it refers is saved to stringStruct.


    For those who want more details, it is proposed to explore the functions rawstringtmpand rawstring.


    runtime.rawstring
    // rawstring allocates storage for a new string. The returned// string and byte slice both refer to the same storage.// The storage is not zeroed. Callers should use// b to set the string contents and then drop b.funcrawstring(size int)(s string, b []byte) {
        p := mallocgc(uintptr(size), nil, false)
        stringStructOf(&s).str = p
        stringStructOf(&s).len = size
        *(*slice)(unsafe.Pointer(&b)) = slice{p, size, size}
        return
    }

    We can do about the same thing by using the dark side of the package unsafe:


    funcconcat2(x, y string)string {
        length := len(x) + len(y)
        if length == 0 {
            return""
        }
        b := make([]byte, length)
        copy(b, x)
        copy(b[len(x):], y)
        return goString(&b[0], length)
    }

    We select []bytewhich we use to form the contents of the new line. Then we can only finalize the line by casting it to the expected runtime representation. The function is responsible for this goString:


    funcgoString(ptr *byte, length int)string {
        s := stringStruct{str: ptr, len: length}
        return *(*string)(unsafe.Pointer(&s))
    }

    Performance rating: 91.9% (+10.9).

    AMD64 version


    Unfortunately, the previous version of the function does not have optimization for concatenation with an empty string, and we also perform a number of unnecessary calculations because of the inability to allocate memory directly, we have to work with a byte slice.


    One of the interesting features of Go assembler is that it allows you to call, for example, non-exported runtime functions. We can call runtime·mallocgcfrom assembly code even if it is not part of the package runtime. We will use this property.


    We can also check the ownership of rows of stack memory, which makes it safe to optimize the return of one of the arguments as a result.


    Suppose a function is called with arguments concat2("", "123"). x- an empty string, and if ynot allocated on the stack, we can return it as a result of concatenation.


    //; Считаем, что x и y имеют тип stringStruct.//; CX - y.str.//; SI - y.len.
    maybe_return_y:
            //; Проверка на вхождения указателя в стек.
            MOVQ (TLS), AX //; *gCMPQ CX, (AX)
            JL return_y //; если y_str < g.stack.loCMPQ CX, 8(AX)
            JGE return_y //; если y_str >= g.stack.hi
            JMP concatenate //; y на стеке, нужна новая аллокация
    return_y:
            MOVQ CX, ret+32(FP) //; stringStruct.len
            MOVQ SI, ret+40(FP) //; stringStruct.str
            RET

    MOVQ (TLS), AXwill move * g to the register AX. Reading at zero offset will give the field g.stack.lo, and from the 8th byte begins g.stack.hi(for a 64-bit platform).


    type g struct {
            stack struct {
                    lo uintptr// 0(AX)
                    hi uintptr// 8(AX)
            }
            stackguard0 uintptr// 16(AX)
            stackguard1 uintptr// 24(AX)// ... другие поля
    }

    The body concatenateallocates memory, fills it with both strings, and returns a new string.


    Full listing with comments
    #include "textflag.h"#include "funcdata.h"
    TEXT ·Strings(SB), 0, $48-48
            NO_LOCAL_POINTERS // Костыль для избежания ошибки.
            MOVQ x+0(FP), DX
            MOVQ x+8(FP), DI
            MOVQ y+16(FP), CX
            MOVQ y+24(FP), SI
            TESTQ DI, DI
            JZ maybe_return_y // x - пустая строка, попробуем вернуть y
            TESTQ SI, SI
            JZ maybe_return_x // y - пустая строка, попробуем вернуть x
    concatenate:
            LEAQ (DI)(SI*1), R8 // len(x) + len(y)// Выделяем память для новой строки.
            MOVQ R8, 0(SP)
            MOVQ $0, 8(SP)
            MOVB $0, 16(SP)
            CALL runtime·mallocgc(SB)
            MOVQ 24(SP), AX // Указатель на выделенную память
            MOVQ AX, newstr-8(SP)
            // Копируем x.
            MOVQ x+0(FP), DX
            MOVQ x+8(FP), DI
            MOVQ AX, 0(SP)
            MOVQ DX, 8(SP)
            MOVQ DI, 16(SP)
            CALL runtime·memmove(SB)
            // Копируем y со смещения len(x).
            MOVQ x+8(FP), DI
            MOVQ y+16(FP), CX
            MOVQ y+24(FP), SI
            MOVQ newstr-8(SP), AX
            LEAQ (AX)(DI*1), BX
            MOVQ BX, 0(SP)
            MOVQ CX, 8(SP)
            MOVQ SI, 16(SP)
            CALL runtime·memmove(SB)
            // Возврат новой строки.
            MOVQ newstr-8(SP), AX
            MOVQ x+8(FP), R8
            ADDQ y+24(FP), R8
            MOVQ AX, ret+32(FP)
            MOVQ R8, ret+40(FP)
            RET
    maybe_return_y:
            // Проверка на вхождения указателя в стек.
            MOVQ (TLS), AX // *gCMPQ CX, (AX)
            JL return_y // если y_ptr < stk.loCMPQ CX, 8(AX)
            JGE return_y // если y_ptr >= stk.hi
            JMP concatenate // y на стеке, нужна новая аллокация
    return_y:
            MOVQ CX, ret+32(FP)
            MOVQ SI, ret+40(FP)
            RET
    maybe_return_x:
            // Проверка на вхождения указателя в стек.
            MOVQ (TLS), AX // *gCMPQ DX, (AX)
            JL return_x // если x_ptr < stk.loCMPQ DX, 8(AX)
            JGE return_x // если x_ptr >= stk.hi
            JMP concatenate // x на стеке, нужна новая аллокация
    return_x:
            MOVQ DX, ret+32(FP)
            MOVQ DI, ret+40(FP)
            RET

    Если вам интересна природа NO_LOCAL_POINTERS в этом коде, можете почитать Calling a Go function from asm ("fatal error: missing stackmap").


    Performance rating: 100% (+8.6).

    As a conclusion


    All code is provided as a concat package .


    Is the world ready for such a quick concatenation? Who knows.


    At the beginning of the article was mentioned CL123256 . He has several paths of development:


    1. Variable specialization for the case when the compiler does not allocate a temporary buffer. A smaller increase for each case, but it covers more types of concatenation and practically does not increase the size of the code (both machine and Go code).
    2. More specializations for special cases. Higher growth, but more machine code, can harm the instruction cache.
    3. Tons of machine code, for each special case and specialized memmove, in the manner that is done in glibc. There are mainly questions of expediency.

    The current proposed version accelerates only the most frequent and simplest case of concatenation of a pair of strings (arity = 2).


    If Go does not accept this change, then a comparable acceleration can be achieved by implementing the operations on strings as a third-party library. Less comfortable, beautiful and elegant, but it works.


    Also popular now: