Introduction
While doing Advent of code last year, I played around a lot with Go's benchmarking
tools to try out different
optimizations.
I began by trying to reduce the number of allocations. I then procceded by trying to find the right balance between
concrete values and pointers. Finally, to try and squeeze the last bits of performance out of my code, I explored
different compiler directives
.
In this post, I'll dive into the latter, and share a few insights and lessons learned.
Directives
Directives gives us the possibility to provide the compiler with special instructions. There are a couple of different
ones, but the one I played around with the most was //go:nosplit
. To understand how to use this directive, we'll begin
with a primer on gouroutine stacks.
Each goroutine starts out with a tiny stack. The variables that define the actual sizes can be found in the runtime
package. These are unexported constants, and could probably change with any future version of Go, but as of today, they
are defined like this:
stackSystem = goos.IsWindows*512*goarch.PtrSize + goos.IsPlan9*512 + goos.IsIos*goarch.IsArm64*1024
stackMin = 2048
fixedStack0 = stackMin + stackSystem
fixedStack1 = fixedStack0 - 1
fixedStack2 = fixedStack1 | (fixedStack1 >> 1)
fixedStack3 = fixedStack2 | (fixedStack2 >> 2)
fixedStack4 = fixedStack3 | (fixedStack3 >> 4)
fixedStack5 = fixedStack4 | (fixedStack4 >> 8)
fixedStack6 = fixedStack5 | (fixedStack5 >> 16)
fixedStack = fixedStack6 + 1
stackNosplit = abi.StackNosplitBase * sys.StackGuardMultiplier
stackGuard = stackNosplit + stackSystem + abi.StackSmall
Here we can see that the minimum stack size, in bytes, that the Go will use is 2048
bytes, which is equivalent to
2KB
.
The fixedStack0
to fixedStack6
uses some clever bitwise manipulations to round up to the nearest power of 2. In
general, the pattern (value | value >> n) is a way of setting the nth bit of value if it isn't already set. When
combined repeatedly with right shifts of increasing powers of 2, this ensures that all bits below the highest set bit in
the original number are also set.
For instance, if fixedStack0
was 1011
in binary (which is 11 in decimal), the series of operations would make it
1111
in binary (15 in decimal), which is one less than the next power of 2 (16 in decimal). Hence, after the series of
bitwise OR operations fixedStack6
will be one less than the next power of two. So defining fixedStack like this:
fixedStack = fixedStack6 + 1
gives us the next power of 2.
I would assume that they have decided to do it this way for performance reasons too, because working with powers of 2
aligns really well with the binary nature of computer arithmetic and memory systems.
The purpose of the stackGuard
is to determine a pointer that sits a certain number of bytes above the bottom of the
stack. This helps Go ensure that there's enough room left for certain operations.
Being able to start with a small stack, and dynamically growing it only if it's needed, is one of the reasons why
goroutines are so cheap to create. However, To prevent stack overflow errors, the runtime has to ensure that this chunk
of memory isn't exceeded. Therefore, before most function calls, the runtime checks if there's enough stack space for
the function to execute. If there isn't, the runtime will allocate a larger stack, and copy the current stack's
contents, before proceeding with the function call.
Back to directives
Now, if we know that our functions won't exceed the stack space, we can use the //go:nosplit
directive to tell the
compiler to not perform this check, which should, at least in theory, give us a small performance gain.
Here, I've written a small program to try and illustrate this:
package nosplit
//go:nosplit
func funcWithNoSplit() int {
return 1
}
func LoopWithNoSplit() int {
sum := 0
for i := 0; i < 1e6; i++ {
sum += funcWithNoSplit()
sum += funcWithNoSplit()
sum += funcWithNoSplit()
sum += funcWithNoSplit()
sum += funcWithNoSplit()
}
return sum
}
func funcWithSplit() int {
return 1
}
func LoopWithSplit() int {
sum := 0
for i := 0; i < 1e6; i++ {
sum += funcWithSplit()
sum += funcWithSplit()
sum += funcWithSplit()
sum += funcWithSplit()
sum += funcWithSplit()
}
return sum
}
As you can see, we have the LoopWithNoSplit
and LoopWithSplit
functions. Each one of these functions uses a loop
that performs a million iterations. During each iteration another function is called 5
times. The first loop calls a
function where we are using the //go:nosplit
directive.
I've also written two benchmark tests for these functions:
func BenchmarkSplit(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = nosplit.LoopWithSplit()
}
}
func BenchmarkNoSplit(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = nosplit.LoopWithNoSplit()
}
}
Now, let's run the benchmarks to see if it had any effect:
❯ go test -bench=.
goos: darwin
goarch: arm64
pkg: go-directives/nosplit
BenchmarkSplit-8 4180 286407 ns/op
BenchmarkNoSplit-8 3741 289680 ns/op
PASS
ok go-directives/nosplit 3.327s
It looks like our performance improved by roughly 10%
. But did it really? Let's run the benchmark again:
❯ go test -bench=.
goos: darwin
goarch: arm64
pkg: playground/nosplit
BenchmarkSplit-8 4147 286728 ns/op
BenchmarkNoSplit-8 4123 286897 ns/op
PASS
ok playground/nosplit 3.507s
This time, there was barely any difference. The compiler is clever enough to simply inline these functions. Thus, making
the nosplit
directive have no effect.
To really illustrate this, let's use another directive to prevent this optimization:
//go:noinline
//go:nosplit
func funcWithNoSplit() int {
return 1
}
and run the benchmark again:
❯ go test -bench=.
goos: darwin
goarch: arm64
pkg: playground/nosplit
BenchmarkSplit-8 3946 289233 ns/op
BenchmarkNoSplit-8 348 3444913 ns/op
PASS
ok playground/nosplit 2.816s
Wrap up
I wasn't able to make my advent of code solutions any faster by using compiler directives either, but it was a fun
experiement and I learned a lot. I hope you enjoyed this post!
The end
I usually tweet something when I've finished writing a new post. You can find me on Twitter
by clicking