zuborg July 14, 2014 at 16:19

On the pros and cons of Go

In this article I want to share the experience gained from rewriting one project from Perl to Go. It will be more about the minuses than the pluses, because a lot has been said about the virtues of Go, but you can often find out about the pitfalls awaiting new developers, except from your own cones. Fasting in no way seeks to cry out the Go language, although, to admit, I would be glad not to write some things. It also covers a relatively small section of the entire platform, in particular, there will be nothing about templates, regexp, unpacking / packing of data, and the like, often used in web programming, functionality.

Since the post is not in the “I PR” hub, I will outline the features of the project only briefly. This is a highly loaded web application that now processes about 600M hits per day (peak load more than 10k requests per second). About 80% of requests can be sent from the cache, and the rest must be fully processed. The working data is mainly based on PostgreSQL, partly in binary files with a flat structure (i.e., in fact, an array, but not in memory, but in the file). The Perl cluster consisted of eight 24 nuclear machines with practically exhausted performance margins; the Go cluster will already be six in number with more than triple reserves confirmed. Moreover, the bottleneck is not so much the processor as the OS and the rest of the hardware and software - it’s physically not easy to process 10k nontrivial requests in one second on one machine,

Development speed

My experience with Go before refactoring was minimal. For more than a year, I looked at the language, managed to study the specification from cover to cover, studied useful materials on the official website and beyond, and felt ready to roll up my sleeves and get to work. The initial assessment of the deadlines for the work was 3-6 weeks. The working beta was ready in time for the end of the 6th week, although closer to the end I had already begun to think that I would not have time. Cleaning up bugs and optimizing performance took another whole month.

At first it was especially difficult, but over time, the specification had to be looked at less and less, and the code turned out to be cleaner. If at first I had to use the functionality that I could code in Perl in an hour, spend all day on Go, then this gap was narrowed significantly. But anyway, programming on Go is significantly longer than on Perl — you have to think over the structures, data types and interfaces you need for work, write all this in code, take care of initializing slices, maps and channels, write checks on nil ... In Perl- e with this, everything is much simpler: you have to use hashes for structures, you don’t have to declare fields there before, and there is a lot more syntactic sugar for programmers. Compare at least sorting - in Go there is no way to specify a closure to compare data, you need to register separate functions to get the length, and in addition to the index comparison function, you also need to write a separate function for exchanging elements in places in the array. And why? Because there are no generics, and the sorting function is easier to call the specially declared Swap (i, j) than to figure out what they slipped into and what offsets this exchange of values should be done.

In addition to sorting, I was also struck by the lack of the Perl construct for / while () {...} continue {...} (the continue block will be executed even if the current iteration is interrupted early through the next operator ). In Go, you have to use the non-kosher goto , which also forces you to write all the variable declarations before it, even those that are not used after the jump label:

var cnt int
for ;; {
        goto NEXT
        a := int(0) // ./main.go:16: goto NEXT jumps over declaration of a at ./main.go:17
        cnt += a
NEXT:
        cnt ++
}

Also, the paradigm of syntax unification for pointers and non-pointers does not work - in case of using structures, the compiler gives us the opportunity to use the same syntax, and for map - we already need to dereference and use brackets, although the compiler could determine everything:

type T struct { cnt int }
s := T{}
p := new(T)
s.cnt ++
p.cnt ++

but

m := make(map[int]T)
p := new(map[int]T)
*p = make(map[int]T)
m[1] = T{}
(*p)[1] = T{}
p[1] = T{}  // ./main.go:13: invalid operation: p[1] (type *map[int]T does not support indexing)

Already at the end of the work, I still had to spend time rewriting that part of the functionality that was implemented at the very beginning, due to incorrect initial architecture. The experience gained offers new architectural paradigms, but this experience still needs to be obtained))

By the way, the total amount of code in characters almost coincided (only for space alignment in Perl two spaces were used, and in Go one tab), but the lines in Go turned out to be 20% more. True, the functionality is slightly different, in Go, for example, work with GC is added, but in Perl a separate library is still taken into account for caching SQL queries in an external file cache (with access via mmap ()). In general, the amount of code is almost equal, but Perl is still a bit more compact. But Go has fewer brackets and semicolons - the code looks more concise and easier to read.

In general, Go code is written quite quickly and accurately, much faster than, say, in C / C ++, but for simple tasks without special performance requirements, I will continue to use Perl.

Performance

Let's face it, I have no particular complaints about Go in terms of performance, but I expected more. The difference with Perl (it depends a lot on the type of calculation, in arithmetic, for example, Perl does not shine at all) is about 5-10 times. I did not have the opportunity to try gccgo, as it is not going hard on FreeBSD, but it's a pity. But now the backend software has ceased to be a bottleneck, cpu consumption is about 50% of one core, and with increasing load, problems will be the first to start with Nginx, PostgreSQL and the OS.

In the process of optimizing performance, the profiler showed that, in addition to my code, runtime consumes the active part of the CPU (this is not only about the runtime package).
Here is one example of top10 --cum:

Total: 1945 samples
       0   0.0%   0.0%     1309  67.3% runtime.gosched0
       1   0.1%   0.1%     1152  59.2% bitbucket.org/mjl/scgi.func·002
       1   0.1%   0.1%     1151  59.2% bitbucket.org/mjl/scgi.serve
       0   0.0%   0.1%      953  49.0% net/http.HandlerFunc.ServeHTTP
       3   0.2%   0.3%      952  48.9% main.ProcessHttpRequest
       1   0.1%   0.3%      535  27.5% main.ProcessHttpRequestFromCache
       0   0.0%   0.3%      418  21.5% main.ProcessHttpRequestFromDb
      16   0.8%   1.1%      387  19.9% main.(*RequestRecord).SelectServerInDc
       0   0.0%   1.1%      367  18.9% System
       0   0.0%   1.1%      268  13.8% GC

As you can see, only 49% of the consumed cpu is spent on processing the actual scgi request by the handler, and as much as 33% is spent on System + GC

But here simply top20 from the same profile:

Total: 1945 samples
     179   9.2%   9.2%      186   9.6% syscall.Syscall
     117   6.0%  15.2%      117   6.0% runtime.MSpan_Sweep
     114   5.9%  21.1%      114   5.9% runtime.kevent
      93   4.8%  25.9%       96   4.9% runtime.cgocall
      93   4.8%  30.6%       93   4.8% runtime.sys_umtx_op
      67   3.4%  34.1%      152   7.8% runtime.mallocgc
      63   3.2%  37.3%       63   3.2% runtime.duffcopy
      56   2.9%  40.2%       99   5.1% hash_insert
      56   2.9%  43.1%       56   2.9% scanblock
      53   2.7%  45.8%       53   2.7% runtime.usleep
      39   2.0%  47.8%       39   2.0% markonly
      36   1.9%  49.7%       41   2.1% runtime.mapaccess2_fast32
      28   1.4%  51.1%       28   1.4% runtime.casp
      25   1.3%  52.4%       34   1.7% hash_init
      23   1.2%  53.6%       23   1.2% hash_next
      22   1.1%  54.7%       22   1.1% flushptrbuf
      22   1.1%  55.8%       22   1.1% runtime.xchg
      21   1.1%  56.9%       29   1.5% runtime.mapaccess1_fast32
      21   1.1%  58.0%       21   1.1% settype
      20   1.0%  59.0%       31   1.6% runtime.mapaccess1_faststr

The calculations of my code are simply lost against the background of the tasks that runtime has to deal with (however, it should be so, I don’t have any hard mathematics).

IMHO, there is still a huge reserve for optimizing the compiler and libraries. For example, I did not notice inlining - all my mutexes are perfectly visible in the sweeps of the goroutin stack. The compiler optimization process does not stand still (not so long ago, Dmitry Vyukov presented a significantly accelerated implementation of channels, for example), but cardinal shifts are not often noticeable so far. For example, after switching from Go 1.2 to Go 1.3, I almost did not see a difference in performance at all.

Even during the optimization, I had to abandon the math / rand package. The fact is that during query processing, pseudo-random numbers were often needed, but with data binding, and rand.Seed () used too much CPU (the profiler showed 13% of the total). Anyone who needs it will google the method of generating pseudo-random numbers with fast Seed (), but still - for cryptographic purposes there is a crypto / rand package, and in math / rand they might not bother so much with high-quality bit mixing during initialization.
By the way, I ended up focusing on the following algorithm:

func RandFloat64(seed uint64) float64 {
        seed ^= seed >> 12
        seed ^= seed << 25
        seed ^= seed >> 27
        return float64((seed*2685821657736338717)&0x7fffffffffffffff) / (1 << 63)
}

It’s very convenient that all the calculations take place in one process, on Perl, separate worker processes were used and I had to organize a common cache — something through memcached, something through a file. On Go, this is much simpler and more natural. But now, in the absence of an external cache, the problem of a cold start arises, here I had to tinker a bit - at first I tried to limit to nginx (to prevent one hundred thousand goroutines from starting at once and the whole thing would not get up) the number of simultaneous requests to upstream via the https: // module github.com/cfsego/nginx-limit-upstream, but something it didn’t work very stably (when the connection pool was clogged, then for some reason it was not easy for him to return to normal mode, even after unloading). As a result, I patched the scgi module a bit and added a limiter to the number of simultaneously executed requests - until some of the current requests are processed - the new one will not be accepted by Accept () - th:

func ServeLimited(l net.Listener, handler http.Handler, limit int) error {
        if limit <= 0 {
                Serve(l, handler)
        }
        if l == nil {
                var err error
                l, err = net.FileListener(os.Stdin)
                if err != nil {
                        return err
                }
                defer l.Close()
        }
        if handler == nil {
                handler = http.DefaultServeMux
        }
        sem := make(chan struct{}, limit)
        for {
                sem <- struct{}{}
                rw, err := l.Accept()
                if err != nil {
                        return err
                }
                go func(rw net.Conn) {
                        serve(rw, handler)
                        <-sem
                }(rw)
        }
}

The scgi module was also chosen for performance reasons - for some reason, net / http / fcgi was slower than just net / http (and does not support persistent connection), and net / http additionally loaded the OS with generation of tcp packets and support for internal tcp connections (although it is technically possible to start listening to it on a unix socket) - and since it was possible to get rid of it, why not get rid of it? Using nginx as a front end gives its advantages - timeout control, logging, forwarding failed requests to other servers from the cluster - all this with minimal additional server load. Another plus of this approach - according to netstat-Lan, you can see when the Accept queue grows on the scgi socket, which means that we have an overload somewhere and we need to do something.

Code Quality and Debugging

The net / http / pprof package is a magic thing! This is something like the Apache server-status module, but for the Go daemon. And by the way, I would not recommend including it in production if you use DefaultServeMux instead of the dedicated http handler - as the package becomes available to everyone via the link / debug / pprof /. I don’t have such a problem, on the contrary, in order to access the package functions via http, you need to run a separate mini-server on localhost:

go func() {
        log.Println(http.ListenAndServe("127.0.0.1:8081", nil))
}()

In addition to obtaining a profile for the processor and memory, this module makes it possible to view on the stack a list of all currently running goroutines, the entire chain of functions that are currently running in them and in what state are: / debug / pprof / goroutine \? Debug = 1 gives a list of different goroutines and their states, and / debug / pprof / goroutine \? debug = 2 gives a list of all running goroutines, including and duplicate (i.e. in completely identical states). Here is an example of one of them:

goroutine 85 [IO wait]:
net.runtime_pollWait(0x800c71b38, 0x72, 0x0)
        /usr/local/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc20848daa0, 0x72, 0x0, 0x0)
        /usr/local/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc20848daa0, 0x0, 0x0)
        /usr/local/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).accept(0xc20848da40, 0x8df378, 0x0, 0x800c6c518, 0x23)
        /usr/local/go/src/pkg/net/fd_unix.go:409 +0x343
net.(*UnixListener).AcceptUnix(0xc208273880, 0x8019acea8, 0x0, 0x0)
        /usr/local/go/src/pkg/net/unixsock_posix.go:293 +0x73
net.(*UnixListener).Accept(0xc208273880, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/pkg/net/unixsock_posix.go:304 +0x4b
bitbucket.org/mjl/scgi.ServeLimited(0x800c7ec58, 0xc208273880, 0x800c6c898, 0x8df178, 0x1f4, 0x0, 0x0)
        /home/user/go/src/bitbucket.org/mjl/scgi/scgi.go:177 +0x20d
main.func008()
        /home/user/repo/main.go:264 +0x90
created by main.main
        /home/user/repo/main.go:265 +0x1f5c

This helped me to identify a bug with locks (RUnlock () was called twice under certain conditions, but you can’t do this) - in the dump of the stack I saw a whole bunch of locked goroutines and line numbers where RUnlock () was called.

The CPU profile is also not bad, I recommend installing gv (ghostview) and looking at the Xorg diagram of transitions between functions with counters - you can see what you should pay attention to and optimize.

go vet, although a useful utility, but my main benefit came down to warnings about missing format specifiers in all sorts of printf () - the compiler is unable to detect this. On obviously bad code

if UintValue < 0 {
        DoSomething()
}

vet does not react in any way.

The main work of code verification is performed by the compiler. He regularly swears at unused variables and packages, but neither the compiler nor vet react to unused fields in structures (at least with a warning), although there is also something to pay attention to.

It should be careful with the operator : = . I had a case when it was necessary to calculate the difference between two uint, incl. correctly consider the negative difference as negative, and the code

  var a, b uint
 ...
  diff := a - b

counts not what you expect - you need to use the conversion to the signed type (or not use unsigned).

It’s also a good practice to name the same data types for different purposes with different names. For example, like this:

type ServerIdType uint32
type CustomerIdType uint32
var ServerId ServerIdType
var CustomerId CustomerIdType

And now for the CustomerId variable, the compiler will not let just write the value of ServerId (without type conversion), despite the fact that there and there inside uint32. It helps with all sorts of typos, although now it is often necessary to use type casting, especially when initializing variables.

Packages, libraries, and a bunch of C

An important role in the popularity of Go was played by an effective (alas, not in terms of performance, there are still some problems) mechanism for interacting with C-libraries. By and large - a significant part of Go-libraries are just wrappers over their C counterparts. For example, the github.com/abh/geoip and github.com/jbarham/gopgsqldriver packages compile with -lGeoIP and -lpq respectively (in truth, I use the native Go PostgreSQL driver - github.com/lib/pq).

For example, consider the almost standard crypt () function from unistd.h - this function is out of the box in many languages, for example, in the Nginx Perl module, it can be used without loading additional modules, which is useful. But not in Go, here you have to forward it to C yourself. This is done in an elementary way (in the example, salt is cut from the result):

// #cgo LDFLAGS: -lcrypt
// #include 
// #include 
import "C"
import (
        "sync"
        "unsafe"
)
var cryptMutex sync.Mutex
func Crypt(str, salt string) string {
        cryptStr := C.CString(str)
        cryptSalt := C.CString(salt)
        cryptMutex.Lock()
        key := C.GoString(C.crypt(cryptStr, cryptSalt))[len(salt):]
        cryptMutex.Unlock()
        C.free(unsafe.Pointer(cryptStr))
        C.free(unsafe.Pointer(cryptSalt))
        return key
}

Lock is needed because crypt () returns the same char * to the internal state, the received string must be copied, otherwise it will be overwritten during the next call, i.e. the function is not thread-safe.

database / sql

For each used Db handler, I recommend calling to register the maximum limit of connections and specify some non-zero limit of idle connections:

db.SetMaxOpenConns(30)
db.SetMaxIdleConns(8)

The first will avoid overloading the database and use it in maximum performance mode (with an increase in the number of simultaneous connections, database performance starts to fall from some point, there is an optimal value for the number of simultaneous requests), and the second will remove the need to open a new connection on every query, for PostgreSQL with its fork () mode, this is especially important. Of course, for PostgreSQL, you can still use pgpool or pgbouncer, but this is all the extra overhead for sending data by the kernel and additional delays - so it is better to ensure the continuity of connections right at the application level.

To exclude an overhead for parsing a request and building a plan, you should use prepared statements instead of direct requests. But you need to keep in mind - in some cases, the query execution scheduler may not use the most optimal plan, since it is built at the stage of parsing the request (and not its execution) and the scheduler does not always have enough data to know which index is preferable to use. By the way, placeholders for variables in the PostgreSQL Go driver use '$ 1', '$ 2', etc., instead of '?', As in Perl.

sql. (Rows) .Scan () has one feature - it does not understand renamed string types, such as type DomainNameType string . I have to start a temporary variable of type stringand load data from the database into it, and then make an assignment with type conversion. For some reason, there is no such problem with renamed numeric types.

Channels and sync

There is a somewhat erroneous opinion that since we have channels in Go, it is worth using them and only them. This is not entirely true - each task has its own tool. Channels are great for sending various kinds of messages, but for working with shared resources, such as sql-cache, it is quite legal to use mutexes. To work with the cache through the channels, we will have to write a query manager that will limit the performance of cache access to one core, add even more work to the sheduler goroutin and add an overhead to copy and read data to the channel, plus we need to create a temporary channel for data transfer each time back calling function. Code using channels also often becomes many times more complicated than code with mutexes (oddly enough). But with the mutexes you need to be extremely careful so as not to get into the deadlock.

Go has a tricky feature like struct {} . Those. completely empty structure, borderless. It takes up zero space, an array of any size of such structures takes up zero space too, well, and the buffered channel of empty structures takes up zero space too (plus internal data, of course). Actually, this buffered channel of empty structures is a semaphore, even a separate handler is made for it in the compiler - if you need a semaphore with Go syntax, you can use chan struct {} .

The sadness of the sync packet is a bit sad. For example, there are no spinlocks, although they are very useful, as they are fast (although using GC, using spinlocks becomes a risky business). Moreover, operations with mutexes themselves are not inline (as far as I can tell). Even more frustrating is the inability to upgrade the RWMutex lock - if the lock is in RLock status and it was found that you need to make changes - please do RUnlock (), then Lock () and check again whether there is still a need to make these changes or some kind of goroutine managed to do everything. There is also no non-blocking TryLock () function, again it is not clear why - for some cases it is extremely necessary. Here the language developers with their “we better know how you need to program”, IMHO, have already gone too far.

In some cases, the sync / atomic package with its atomic operations helps to avoid the use of mutexes. For example, I often use the current uint32 timestamp in my code - I keep it in a global variable and at the beginning of each request I simply save the current value to it atomically. A bit dirty approach, I know, it was possible to write a helper function, but sometimes you have to make such sacrifices in the struggle for performance - I can now use this variable in arithmetic expressions without any special restrictions.

There is another good optimization method for the case when some common data is updated only in one place (for example, periodically), and in other cases they are used in read-only mode. The bottom line is that there is no need to do RLock () / RUnlock () for read operations (and Lock () / Unlock () for updates) - the update function can load data into a new memory area, and then atomically replace the pointer to the old data with a pointer to new ones. True, in Go, the function of atomic pointer writing requires unsafe.Pointer type, and you have to block this design:

atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(&Data)), unsafe.Pointer(&newData))

But you can use this data in any expressions without worrying about locks. This is of particular importance to Go, as seemingly short locks can actually be very long - and all because of the GC.

GC (garbage collector)

He drank my roof pretty much; (. Imagine a situation - run a load test - everything is ok. Let a live traffic - everything is fine too. And then bam - and everything gets bad or very, very bad - old requests hang, new ones arrive and arrive (several thousand per second), you have to restart the application, after which everything again dies, because the cache has to be refilled, but at least it works and after some time returns to normal. I made a measurement of the execution time of each request processing stages - and vi I wish that periodically the execution time of all stages jumps to three seconds or more, even those that do not use locks do not use access to the database and files, but do only local calculations and usually fit in microseconds. there wasn’t any external factor and the platform itself. More precisely, a garbage collector.

It’s good that in Go you can see the GC performance statistics through runtime / debug.ReadGCStats () - there is something to be surprised about. In my case, on the most unloaded server, the GC worked in the following mode:
0.06
0.30
2.00
0.06
0.30
2.00
...
The order of values was preserved, although the numbers themselves varied slightly. This is the duration of the application falling asleep while the GC is running, the most recent ones at the top. Pause all work for 2 seconds - what? I’m even afraid to imagine what was happening on the busiest servers, but I didn’t touch them in order not to create extra downtimes.

The solution is to run GC () more often, for reliability it is better to independently from the program. You can even just periodically, I got a little confused and made a request counter, as well as forced launch of GC () after major cleanups of obsolete data. As a result, GC () began to run every ten to twenty seconds instead of once every few minutes, but each pass takes about 0.1s stably - a completely different matter! And the demon’s memory consumption at the same time dropped by 20 percent. There is an option to disable the garbage collector altogether, but this is only suitable for short-lived programs, and not for demons. The language developers should add the setting to the GC so that it does not stop the application for longer than the specified limit, but instead starts to run more often - this would save many users from problems under high load.

maps

No one will argue that maps (hashes in terms of Perl) are extremely useful stuff. But I have serious complaints against the language developers about the way they are implemented and used. Roughly speaking, to work with map compiler uses the following three functions:

valueType, ok := map_fetch(keyType)
map_store(keyType, valueType)
map_delete(keyType)

And this imposes a number of significant limitations. While the maps consist of basic types - everything is fine, but problems with the map of structures or types with reference methods (that is, methods that work by linking to data, not by copying data) already begin - we cannot write for example

type T struct { cnt int }
m := make(map[int]T)
m[0] = T{}
m[0].cnt++  // ./main.go:9: cannot assign to m[0].cnt

since the compiler cannot get the address of the value m [0] in order to increment by cnt offset.

You can either make a map link to the structure

m := make(map[int]*T)
m[0] = new(T)
m[0].cnt++

either unload and save the entire structure

m := make(map[int]T)
tmp := m[0]
tmp.cnt++
m[0] = tmp

The first option will add a lot of extra work to the garbage collector, and the second to the processor (especially if the structure is rather big).

In my opinion, the question can be solved if, when working with map, the compiler will use the function instead of map_store

*valueType = map_allocate(keyType)

and add an additional restriction that once the value added to the map will not move in memory.

The map_allocate function should be used to get pointers not only to newly created elements, but also to existing ones if they will be modified. This pointer can be used for issuing to the programmer, for updating the value, for calling the reference method - and while the value is in place, everything works fine.

Some may argue that the ability to get references to a value inside a map immediately breaks all the vaunted language security. But for map, security is not guaranteed anyway - they cannot be used from different goroutines without blocking, otherwise there is a risk of breaking internal data while inserting elements. In addition, no one says that it is necessary to give the programmer the ability to get the address of the map element without using the unsafe package - in the example above, the compiler could take the address, increment the counter, and forget that address - it won’t get anywhere else and basically nobody from operations cannot be affected.

Problems can arise only if you delete the item and continue to use the link to unused memory. This is from the same area as the simultaneous use of maps from different goroutines without blocking - if the programmer is Pinocchio himself, then who is the doctor for him? And if it will be possible to adapt the garbage collector to this case, so that memory is not freed after deletion as long as the program has a live link to the deleted item, then everything is fine and there are no security problems.

Total

Alas, there is no perfection in the world. But it would be naive to expect that the new language will immediately be born ideal. Yes, Go has some drawbacks, but they, in principle, are all fixable, there would be a desire. But Go advances the development of programming languages to the next level, adapting to the modern realities of multi-core computer architectures and proposing appropriate paradigms.

For a very long time I have not studied new programming languages. At one time, I mastered C a bit (at the level of a bit to patch the FreeBSD kernel), Perl and Shell-scripting (for general tasks). I had no time or desire to immerse myself in learning Python, Ruby or JS - these languages could not offer me anything fundamentally new, and there was no desire to change the idea. Go was able to significantly complement my set of tools, which I am only glad. With all his shortcomings, I do not regret a single drop of the time spent studying it - it really is worth it.

Tags: