Accelerate do-it-yourself string concatenation
Today, we will accelerate the gluing of short strings in Go by 30%. And for this, we will not need to modify Go itself, all of this will be implemented as a third-party library .
Under the cut you are waiting for:
- Comparison
+
,strings.Builder
and eigenconfiguration - Details of the internal arrangement of strings in Go
- Quite a bit of assembler
This article can also be considered a pretext to discuss CL123256: runtime, cmd / compile: specialize concatstring2 . Ideas for improving this change list are welcome.
Immediately results
The comparison was made with the go tip
(master) version of the compiler. You can get similar results on versions around Go 1.5. The last major feature change concatstrings
was CL3120: cmd / gc: allocate buffers for non-escaped strings on stack .
BenchmarkConcat2Operator-8 20000000 83.8 ns/op
BenchmarkConcat2Builder-8 20000000 70.9 ns/op
BenchmarkConcat2-8 20000000 62.1 ns/op
BenchmarkConcat3Operator-8 20000000 104 ns/op
BenchmarkConcat3Builder-8 20000000 89.9 ns/op
BenchmarkConcat3-8 20000000 82.1 ns/op
ConcatOperator
uses +
. ConcatBuilder
uses strings.Builder
with proper pre-allocation. Concat
uses the function that we implement as part of this story.
Compare over benchstat :
nameoldtime/op newtime/op delta
Concat2-884.2ns ± 1% 62.7ns ± 2% -25.49% (p=0.000 n=9+10)
Concat3-8103ns ± 3% 83ns ± 4% -19.83% (p=0.000 n=10+9)
The assembler implementation is a GOARCH=AMD64
bit faster and has additional optimization, which is present in the built-in operator +
, but more on that below:
nameoldtime/op newtime/op delta
Concat2-884.2ns ± 1% 57.1ns ± 3% -32.20% (p=0.000 n=9+9)
The assembler function will be taken as 100% of the performance (relative to the other implementations under consideration).
Results for longer lines can be seen in the README.md . The longer the string, the less pronounced the difference between implementations.
Naive concatenation
The simplest solution is to use the operator +
.
The semantics of this operator are as follows: take two strings and return a result string that contains the concatenation of both strings. There is no guarantee that a new string will be returned. For example, if an empty string and any other chaining occurs, the runtime may return a non-empty argument, avoiding the need to allocate new memory and copy data there.
But, as can be seen from the results at the beginning of the article, this is the slowest way.
funcconcat2operator(x, y string)string {
return x + y
}
Performance rating: 67.8% .
strings.Builder
Not so long ago, Go added a new type - strings.Builder . This is an analogue bytes.Buffer
, but when you call the method String()
does not re-allocate memory and copy data.
In contrast bytes.Buffer
, the builder does not have a small buffer optimization and, therefore, a previously allocated memory for the string. If you do not use the method Grow
, the performance will be worse than with bytes.Buffer
. Several regressions in Go 1.11 are caused by this particular feature (see CL113235 ).
In our code, for the purity of the experiment, we will avoid this error.
funcconcat2builder(x, y string)string {
var builder strings.Builder
builder.Grow(len(x) + len(y)) // Только эта строка выделяет память
builder.WriteString(x)
builder.WriteString(y)
return builder.String()
}
Performance rating: 80.5% (+12.7).
Code Generation for Concatenation
If you look at what code the compiler generates for the operator +
, we will see function calls concatstring2
, concatstring3
and so on (up to and concatstring5
including).
funcconcat2codegen(x, y)string { return x + y }
// => CALL runtime.concatstring2(SB)funcconcat3codegen(x, y, z)string { return x + y + z }
// => CALL runtime.concatstring3(SB)
Let's look at the runtime / string.go itself :
funcconcatstring2(buf *tmpBuf, a [2]string)string {
return concatstrings(buf, a[:])
}
funcconcatstring3(buf *tmpBuf, a [3]string)string {
return concatstrings(buf, a[:])
}
So it remains to study the function concatstrings
.
The full listing is available below under the spoiler, but the high-level description:
- Parameter
buf
can benil
. This buffer is allocated by the compiler, if the string does not "run away" from the scope of its definition. If the string lives longer than the frame, then this buffer will always benil
(as often happens). However, if this buffer is available, it will be possible to avoid allocation if the result fits into it (its size is 32 bytes). - If all the lines except one are empty, the function will return this line. But at the same time, the lines allocated on the stack and leaving their frame bypass the optimization, so that the caller does not receive the already freed memory.
- Further, all lines are copied to the new memory.
// concatstrings implements a Go string concatenation x+y+z+...// The operands are passed in the slice a.// If buf != nil, the compiler has determined that the result does not// escape the calling function, so the string data can be stored in buf// if small enough.funcconcatstrings(buf *tmpBuf, a []string)string {
idx := 0
l := 0
count := 0for i, x := range a {
n := len(x)
if n == 0 {
continue
}
if l+n < l {
throw("string concatenation too long")
}
l += n
count++
idx = i
}
if count == 0 {
return""
}
// If there is just one string and either it is not on the stack// or our result does not escape the calling frame (buf != nil),// then we can return that string directly.if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) {
return a[idx]
}
s, b := rawstringtmp(buf, l)
for _, x := range a {
copy(b, x)
b = b[len(x):]
}
return s
}
Here we see several places at once that can be optimized for a particular case:
buf
most often empty. When the compiler failed to prove that it is safe to place the string on the stack, passing the extra parameter and checking it on thenil
inside of the function only overhead.- For the particular case,
len(a) == 2
we do not need a cycle and the calculations can be simplified. And this is the most common type of concatenation.
At execution ./make.bash
(build of the Go compiler and stdlib) we see 445 concatenations with two operands:
- 398 результатов "убегают". В этом случае наша специализация имеет смысл.
- 47 результатов не покидают своего фрейма.
Итого 89% конкатенаций от двух аргументов попадают пот оптимизацию.
Для утилиты go
имеем:
- 501 вызовов concatstring2
- 194 вызовов concatstring3
- 55 вызовов concatstring4
Version for all architectures
To implement the specialization, we need to know how the lines are represented in Go. Binary compatibility is important to us, and unsafe.Pointer
you can replace it *byte
with no victims.
type stringStruct struct {
str *bytelenint
}
The second important conclusion that we can make from runtime: the lines begin their lives mutable. The part of the memory referenced []byte
to which the contents of the new line are written is allocated , and only then is []byte
thrown out, and the memory to which it refers is saved to stringStruct
.
For those who want more details, it is proposed to explore the functions rawstringtmp
and rawstring
.
// rawstring allocates storage for a new string. The returned// string and byte slice both refer to the same storage.// The storage is not zeroed. Callers should use// b to set the string contents and then drop b.funcrawstring(size int)(s string, b []byte) {
p := mallocgc(uintptr(size), nil, false)
stringStructOf(&s).str = p
stringStructOf(&s).len = size
*(*slice)(unsafe.Pointer(&b)) = slice{p, size, size}
return
}
We can do about the same thing by using the dark side of the package unsafe
:
funcconcat2(x, y string)string {
length := len(x) + len(y)
if length == 0 {
return""
}
b := make([]byte, length)
copy(b, x)
copy(b[len(x):], y)
return goString(&b[0], length)
}
We select []byte
which we use to form the contents of the new line. Then we can only finalize the line by casting it to the expected runtime representation. The function is responsible for this goString
:
funcgoString(ptr *byte, length int)string {
s := stringStruct{str: ptr, len: length}
return *(*string)(unsafe.Pointer(&s))
}
Performance rating: 91.9% (+10.9).
AMD64 version
Unfortunately, the previous version of the function does not have optimization for concatenation with an empty string, and we also perform a number of unnecessary calculations because of the inability to allocate memory directly, we have to work with a byte slice.
One of the interesting features of Go assembler is that it allows you to call, for example, non-exported runtime functions. We can call runtime·mallocgc
from assembly code even if it is not part of the package runtime
. We will use this property.
We can also check the ownership of rows of stack memory, which makes it safe to optimize the return of one of the arguments as a result.
Suppose a function is called with arguments concat2("", "123")
. x
- an empty string, and if y
not allocated on the stack, we can return it as a result of concatenation.
//; Считаем, что x и y имеют тип stringStruct.//; CX - y.str.//; SI - y.len.
maybe_return_y:
//; Проверка на вхождения указателя в стек.
MOVQ (TLS), AX //; *gCMPQ CX, (AX)
JL return_y //; если y_str < g.stack.loCMPQ CX, 8(AX)
JGE return_y //; если y_str >= g.stack.hi
JMP concatenate //; y на стеке, нужна новая аллокация
return_y:
MOVQ CX, ret+32(FP) //; stringStruct.len
MOVQ SI, ret+40(FP) //; stringStruct.str
RET
MOVQ (TLS), AX
will move * g to the register AX
. Reading at zero offset will give the field g.stack.lo
, and from the 8th byte begins g.stack.hi
(for a 64-bit platform).
type g struct {
stack struct {
lo uintptr// 0(AX)
hi uintptr// 8(AX)
}
stackguard0 uintptr// 16(AX)
stackguard1 uintptr// 24(AX)// ... другие поля
}
The body concatenate
allocates memory, fills it with both strings, and returns a new string.
#include "textflag.h"#include "funcdata.h"
TEXT ·Strings(SB), 0, $48-48
NO_LOCAL_POINTERS // Костыль для избежания ошибки.
MOVQ x+0(FP), DX
MOVQ x+8(FP), DI
MOVQ y+16(FP), CX
MOVQ y+24(FP), SI
TESTQ DI, DI
JZ maybe_return_y // x - пустая строка, попробуем вернуть y
TESTQ SI, SI
JZ maybe_return_x // y - пустая строка, попробуем вернуть x
concatenate:
LEAQ (DI)(SI*1), R8 // len(x) + len(y)// Выделяем память для новой строки.
MOVQ R8, 0(SP)
MOVQ $0, 8(SP)
MOVB $0, 16(SP)
CALL runtime·mallocgc(SB)
MOVQ 24(SP), AX // Указатель на выделенную память
MOVQ AX, newstr-8(SP)
// Копируем x.
MOVQ x+0(FP), DX
MOVQ x+8(FP), DI
MOVQ AX, 0(SP)
MOVQ DX, 8(SP)
MOVQ DI, 16(SP)
CALL runtime·memmove(SB)
// Копируем y со смещения len(x).
MOVQ x+8(FP), DI
MOVQ y+16(FP), CX
MOVQ y+24(FP), SI
MOVQ newstr-8(SP), AX
LEAQ (AX)(DI*1), BX
MOVQ BX, 0(SP)
MOVQ CX, 8(SP)
MOVQ SI, 16(SP)
CALL runtime·memmove(SB)
// Возврат новой строки.
MOVQ newstr-8(SP), AX
MOVQ x+8(FP), R8
ADDQ y+24(FP), R8
MOVQ AX, ret+32(FP)
MOVQ R8, ret+40(FP)
RET
maybe_return_y:
// Проверка на вхождения указателя в стек.
MOVQ (TLS), AX // *gCMPQ CX, (AX)
JL return_y // если y_ptr < stk.loCMPQ CX, 8(AX)
JGE return_y // если y_ptr >= stk.hi
JMP concatenate // y на стеке, нужна новая аллокация
return_y:
MOVQ CX, ret+32(FP)
MOVQ SI, ret+40(FP)
RET
maybe_return_x:
// Проверка на вхождения указателя в стек.
MOVQ (TLS), AX // *gCMPQ DX, (AX)
JL return_x // если x_ptr < stk.loCMPQ DX, 8(AX)
JGE return_x // если x_ptr >= stk.hi
JMP concatenate // x на стеке, нужна новая аллокация
return_x:
MOVQ DX, ret+32(FP)
MOVQ DI, ret+40(FP)
RET
Если вам интересна природа NO_LOCAL_POINTERS
в этом коде, можете почитать Calling a Go function from asm ("fatal error: missing stackmap").
Performance rating: 100% (+8.6).
As a conclusion
All code is provided as a concat package .
Is the world ready for such a quick concatenation? Who knows.
At the beginning of the article was mentioned CL123256 . He has several paths of development:
- Variable specialization for the case when the compiler does not allocate a temporary buffer. A smaller increase for each case, but it covers more types of concatenation and practically does not increase the size of the code (both machine and Go code).
- More specializations for special cases. Higher growth, but more machine code, can harm the instruction cache.
- Tons of machine code, for each special case and specialized memmove, in the manner that is done in glibc. There are mainly questions of expediency.
The current proposed version accelerates only the most frequent and simplest case of concatenation of a pair of strings (arity = 2).
If Go does not accept this change, then a comparable acceleration can be achieved by implementing the operations on strings as a third-party library. Less comfortable, beautiful and elegant, but it works.