Simple Methods for Optimizing Go Programs

Original author: Stephen Whitworth
  • Transfer
I always care about performance. I don’t know exactly why. But I just get pissed off by slow services and programs. Looks like I'm not alone .

In the A / B tests, we tried to slow down the output of pages in increments of 100 milliseconds and found that even very small delays lead to a significant drop in revenue.  - Greg Linden, Amazon.com

From experience, low productivity is manifested in one of two ways:

  • Operations that perform well on a small scale become unviable with an increasing number of users. Usually these are O (N) or O (N²) operations. When the user base is small, everything works fine. The product is in a hurry to bring to the market. As the base grows, more and more unexpected pathological situations arise - and the service stops.
  • Many individual sources of suboptimal work, "death from a thousand cuts."

For most of my career, I either studied data science with Python or created services on Go. In the second case, I have much more experience in optimization. Go is usually not a bottleneck in the services I write - database programs are often limited by I / O. However, in the machine learning batch pipelines I developed, the program is often limited by CPU. If Go uses the processor too much, there are various strategies.

This article explains some methods that can be used to significantly increase productivity without much effort. I deliberately ignore methods that require significant effort or large changes in the structure of the program.

Before you start


Before making any changes to the program, take the time to create a suitable baseline for comparison. If you do not, then you will wander in the dark, wondering if there is any benefit from the changes made. First, write benchmarks and take profiles for use in pprof. It’s best to write the benchmark also on Go : this makes it easier to use pprof and memory profiling. Also use benchcmp: a useful tool for comparing performance differences between tests.

If the code is not very compatible with benchmarks, just start with something that can be measured. You can profile the code manually with runtime / pprof .

So, let's begin!

Use sync.Pool to reuse previously selected objects


sync.Pool implements a release list . This allows you to reuse previously allocated structures and amortizes the distribution of the object over many uses, reducing the work of the garbage collector. The API is very simple. Implement a function that allocates a new instance of an object. The API will return the type of pointer.

var bufpool = sync.Pool{
    New: func() interface{} {
        buf := make([]byte, 512)
        return &buf
    }}

After that, you can make Get()objects from the pool and Put()return them when you are done.

// sync.Pool returns a interface{}: you must cast it to the underlying type
// before you use it.
b := *bufpool.Get().(*[]byte)
defer bufpool.Put(&b)
// Now, go do interesting things with your byte buffer.
buf := bytes.NewBuffer(b)

There are nuances. Prior to Go 1.13, the pool was cleared with every garbage collection. This can adversely affect the performance of programs that allocate a lot of memory. As of 1.13, it seems that more objects survive after the GC .

!!! Before returning an object to the pool, be sure to reset the structure fields.

If you do not, then you can get a dirty object from the pool that contains data from previous use. This is a serious security risk!

type AuthenticationResponse {
    Token string
    UserID string
}
rsp := authPool.Get().(*AuthenticationResponse)
defer authPool.Put(rsp)
// If we don't hit this if statement, we might return data from other users! 
if blah {
    rsp.UserID = "user-1"
    rsp.Token = "super-secret"
}
return rsp

A safe way to always guarantee zero memory is to do this explicitly:

// reset resets all fields of the AuthenticationResponse before pooling it.
func (a* AuthenticationResponse) reset() {
    a.Token = ""
    a.UserID = ""
}
rsp := authPool.Get().(*AuthenticationResponse)
defer func() {
    rsp.reset()
    authPool.Put(rsp)
}()

The only case when this is not a problem is when you use the exact memory you wrote to. For instance:

var (
    r io.Reader
    w io.Writer
)
// Obtain a buffer from the pool.
buf := *bufPool.Get().(*[]byte)
defer bufPool.Put(&buf)
// We only write to w exactly what we read from r, and no more. 
nr, er := r.Read(buf)
if nr > 0 {
    nw, ew := w.Write(buf[0:nr])
}

Avoid using structures containing pointers as keys for a large map


Fuh, I was too verbose. I apologize. They often talked (including my former colleague Phil Pearl ) about Go performance with a large heap size . During garbage collection, the runtime scans objects with pointers and tracks them. If you have a very large map map[string]int, then GC should check every line. This happens with every garbage collection, because the lines contain pointers.

In this example, we write 10 million items in map[string]intand measure the duration of garbage collection. We allocate our map in the package area to guarantee memory allocation from the heap.

package main
import (
    "fmt"
    "runtime"
    "strconv"
    "time"
)
const (
    numElements = 10000000
)
var foo = map[string]int{}
func timeGC() {
    t := time.Now()
    runtime.GC()
    fmt.Printf("gc took: %s\n", time.Since(t))
}
func main() {
    for i := 0; i < numElements; i++ {
        foo[strconv.Itoa(i)] = i
    }
    for {
        timeGC()
        time.Sleep(1 * time.Second)
    }
}

Running the program, we will see the following:

 inthash → go install && inthash
gc took: 98.726321ms
gc took: 105.524633ms
gc took: 102.829451ms
gc took: 102.71908ms
gc took: 103.084104ms
gc took: 104.821989ms

This is quite a long time in a computer country!

What can be done to optimize? Removing pointers everywhere is a good idea, so as not to load the garbage collector. There are pointers in the lines ; so let's implement this as map[int]int.

package main
import (
    "fmt"
    "runtime"
    "time"
)
const (
    numElements = 10000000
)
var foo = map[int]int{}
func timeGC() {
    t := time.Now()
    runtime.GC()
    fmt.Printf("gc took: %s\n", time.Since(t))
}
func main() {
    for i := 0; i < numElements; i++ {
        foo[i] = i
    }
    for {
        timeGC()
        time.Sleep(1 * time.Second)
    }
}

Running the program again, we see:

 inthash → go install && inthash
gc took: 3.608993ms
gc took: 3.926913ms
gc took: 3.955706ms
gc took: 4.063795ms
gc took: 3.91519ms
gc took: 3.75226ms

Much better. We have accelerated garbage collection by 35 times. When used in production, it will be necessary to hash the strings into integers before inserting into the card.

By the way, there are many more ways to avoid GC. If you allocate gigantic arrays of meaningless structures, ints or bytes, the GC will not scan this : that is, you save on GC time. Such methods usually require substantial revision of the program, so today we will not delve into this topic.

As with any optimization, the effect may vary. See the thread of tweets from Damian Gryski for an interesting example of how deleting lines from a large map in favor of a smarter data structure actually increased memory consumption. In general, read everything that he publishes.

Marshaling code generation to avoid runtime reflection


Marshaling and unmarshaling your structure into various serialization formats, such as JSON, is a typical operation, especially when creating microservices. For many microservices, this is generally the only job. Functions like json.Marshaland json.Unmarshalrely on reflection during runtime to serialize the structure of the fields to bytes and vice versa. This may work slowly: reflection is not as efficient as explicit code.

However, there are optimization options. The JSON marshaling mechanics look something like this:

package json
// Marshal take an object and returns its representation in JSON.
func Marshal(obj interface{}) ([]byte, error) {
    // Check if this object knows how to marshal itself to JSON
    // by satisfying the Marshaller interface.
    if m, is := obj.(json.Marshaller); is {
        return m.MarshalJSON()
    }
    // It doesn't know how to marshal itself. Do default reflection based marshallling.
    return marshal(obj)
}

If we know the marshalization process in JSON, we have a clue to avoid reflection in runtime. But we don’t want to manually write all the marshalization code, so what should we do? Let the computer generate this code! Code generators like easyjson look at the structure and generate highly optimized code that is fully compatible with existing marshaling interfaces, such as json.Marshaller.

Download the package and write the next command in $file.go, containing the structures for which you want to generate code.

easyjson -all $ file.go

A file should be generated $file_easyjson.go. Since I easyjsonimplemented the interface for you json.Marshaller, instead of reflecting, these functions will be called by default. Congratulations: you just accelerated your JSON code three times. There are many tricks to further increase productivity.

I recommend this package because I have used it myself before, and successfully. But be careful. Please do not take this as an invitation to start aggressive debates with me about the fastest JSON packages.

Make sure to re-generate the marshaling code when the structure changes. If you forget to do this, the newly added fields will not be serialized, which will lead to confusion! You can use go generatefor these tasks. To keep in sync with the structures, I prefer to putgenerate.goto the root of the package, which calls go generatefor all the package files: this can help when you have many files that need to generate such code. The main tip: to ensure that the structures are updated, call go generatein the CI and check that there are no differences with the registered code.

Use strings.Builder to build strings


In Go, strings are immutable: think of them as read-only bytes. This means that every time you create a string, you allocate memory and potentially create more work for the garbage collector.

Go 1.10 implemented strings.Builder as an efficient way to create strings. Internally, it writes to a byte buffer. Only when called String()in the builder is a line actually created. He relies on some insecure tricks to return the base bytes as a string with a zero allocation: see this blog for further study on how this works.

Compare the performance of the two approaches:

// main.go
package main
import "strings"
var strs = []string{
    "here's",
    "a",
    "some",
    "long",
    "list",
    "of",
    "strings",
    "for",
    "you",
}
func buildStrNaive() string {
    var s string
    for _, v := range strs {
        s += v
    }
    return s
}
func buildStrBuilder() string {
    b := strings.Builder{}
    // Grow the buffer to a decent length, so we don't have to continually
    // re-allocate.
    b.Grow(60)
    for _, v := range strs {
        b.WriteString(v)
    }
    return b.String()
}

// main_test.go
package main
import (
    "testing"
)
var str string
func BenchmarkStringBuildNaive(b *testing.B) {
    for i := 0; i < b.N; i++ {
        str = buildStrNaive()
    }
}
func BenchmarkStringBuildBuilder(b *testing.B) {
    for i := 0; i < b.N; i++ {
        str = buildStrBuilder()
    }

Here are the results on my Macbook Pro:

strbuild -> go test -bench =. -benchmem
goos: darwin
goarch: amd64
pkg: github.com/sjwhitworth/perfblog/strbuild
BenchmarkStringBuildNaive-8 5,000,000 255 ns / op 216 B / op 8 allocs / op
BenchmarkStringBuildBuilder-8 20,000,000 54.9 ns / op 64 B / op 1 allocs / op

As you can see, it strings.Builderis 4.7 times faster, causes eight times less allocations and takes up four times less memory.

When performance matters, use strings.Builder. In general, I recommend using it everywhere, except for the most trivial cases of building strings.

Use strconv instead of fmt


fmt is one of the most famous packages in Go. You probably used it in your first program to display “hello, world”. But when it comes to converting integers and floats to strings, it is not as efficient as its younger brother strconv . This package shows decent performance with very few changes to the API.

fmtbasically takes interface{}functions as arguments. There are two drawbacks:

  • You are losing type safety. It is very important for me.
  • This can increase the amount of secretions needed. Passing a type without a pointer to quality interface{}usually results in a heap allocation. In this blog post explains why.
  • The following program shows the difference in performance:

    // main.go
    package main
    import (
        "fmt"
        "strconv"
    )
    func strconvFmt(a string, b int) string {
        return a + ":" + strconv.Itoa(b)
    }
    func fmtFmt(a string, b int) string {
        return fmt.Sprintf("%s:%d", a, b)
    }
    func main() {}

    // main_test.go
    package main
    import (
        "testing"
    )
    var (
        a    = "boo"
        blah = 42
        box  = ""
    )
    func BenchmarkStrconv(b *testing.B) {
        for i := 0; i < b.N; i++ {
            box = strconvFmt(a, blah)
        }
        a = box
    }
    func BenchmarkFmt(b *testing.B) {
        for i := 0; i < b.N; i++ {
            box = fmtFmt(a, blah)
        }
        a = box
    }

    Benchmarks on Macbook Pro:

    strfmt → go test -bench =. -benchmem
    goos: darwin
    goarch: amd64
    pkg: github.com/sjwhitworth/perfblog/strfmt
    BenchmarkStrconv-8 30,000,000 39.5 ns / op 32 B / op 1 allocs / op
    BenchmarkFmt-8 10,000,000 143 ns / op 72 B / op 3 allocs / op

    As you can see, the strconv option is 3.5 times faster, causes three times less allocations and takes up half as much memory.

    Allocate the slice tank with make to avoid redistribution


    Before moving on to improving performance, let's quickly update sliced ​​information in memory. A slice is a very useful construct in Go. It provides a scalable array with the ability to accept different views in the same base memory without reallocation. If you look under the hood, then the slice consists of three elements:

    type slice struct {
        // pointer to underlying data in the slice.
        data uintptr
        // the number of elements in the slice.
        len int
        // the number of elements that the slice can 
        // grow to before a new underlying array
        // is allocated.
        cap int     
    }

    What are these fields?

    • data: pointer to basic data in a slice
    • len: current number of elements in a slice
    • cap: number of elements a slice can grow to before redistributing

    Sections under the hood are arrays of fixed length. When the maximum value ( cap) is reached , a new array with a double value is allocated, the memory is copied from the old slice to the new one, and the old array is discarded.

    I often see something like this code where a slice with a zero boundary capacity is allocated if the slice capacity is known in advance:

    var userIDs []string
    for _, bar := range rsp.Users {
        userIDs = append(userIDs, bar.ID)
    }

    In this case, the slice begins with a zero size lenand zero boundary capacity cap. Having received the answer, we add elements to the slice, and at the same time we reach the boundary capacity: a new base array is selected, where it is doubled cap, and the data is copied to it. If we get 8 elements in the answer, this leads to 5 redistributions.

    The following method is much more efficient:

    userIDs := make([]string, 0, len(rsp.Users))
    for _, bar := range rsp.Users {
        userIDs = append(userIDs, bar.ID)
    }

    Here we explicitly allocated the capacity for the slice using make. Now we can safely add data there, without additional redistribution and copying.

    If you don’t know how much memory to allocate, because the capacity is dynamic or later calculated in the program, measure the final distribution of the slice size after the program runs. I usually take the 90th or 99th percentile and hard code the value in the program. In cases where the CPU is more expensive than RAM for you, set this value higher than you think is necessary.

    The tip also applies to cards: make(map[string]string, len(foo))allocate enough memory to avoid redistribution.

    See this article on how slices actually work.

    Use methods to transfer byte slices


    When using packets, use methods that allow the transmission of a byte slice: these methods usually give more control over the distribution.

    A good example is comparing time.Format and time.AppendFormat . The first returns a string. Under the hood, this selects a new byte slice and calls on it time.AppendFormat. The second takes a byte buffer, writes a formatted time representation, and returns an extended byte slice. This is often found in other packages in the standard library: see strconv.AppendFloat or bytes.NewBuffer .

    Why does this increase productivity? Well, now you can transfer the byte slices that you received fromsync.Pool, instead of allocating a new buffer each time. Or you can increase the initial buffer size to a value that is more suitable for your program in order to reduce the number of repeated copies of the slice.

    Summary


    You can apply all these methods to your code base. Over time, you will build a mental model for reasoning about performance in Go programs. This will greatly help in their design.

    But use them depending on the situation. These are advice, not the gospel. Measure and check everything with benchmarks.

    And know when to stop. Increasing productivity is a good exercise: the task is interesting, and the results are immediately visible. However, the usefulness of increasing productivity is highly dependent on the situation. If your service gives an answer in 10 ms, and the network delay is 90 ms, you probably should not try to reduce these 10 ms to 5 ms: you still have 95 ms. Even if you optimize the service to the utmost up to 1 ms, the total delay will still be 91 ms. Probably eat bigger fish.

    Optimize wisely!

    References


    If you want more information, here are great sources of inspiration:


Also popular now: