Go directives

Introduction

While doing Advent of code last year, I played around a lot with Go's benchmarking tools to try out different

optimizations.

I began by trying to reduce the number of allocations. I then procceded by trying to find the right balance between

concrete values and pointers. Finally, to try and squeeze the last bits of performance out of my code, I explored

different compiler directives.

In this post, I'll dive into the latter, and share a few insights and lessons learned.

Directives

Directives gives us the possibility to provide the compiler with special instructions. There are a couple of different

ones, but the one I played around with the most was //go:nosplit. To understand how to use this directive, we'll begin

with a primer on gouroutine stacks.

Each goroutine starts out with a tiny stack. The variables that define the actual sizes can be found in the runtime

package. These are unexported constants, and could probably change with any future version of Go, but as of today, they

are defined like this:

stackSystem = goos.IsWindows*512*goarch.PtrSize + goos.IsPlan9*512 + goos.IsIos*goarch.IsArm64*1024

stackMin = 2048

fixedStack0 = stackMin + stackSystem

fixedStack1 = fixedStack0 - 1

fixedStack2 = fixedStack1 | (fixedStack1 >> 1)

fixedStack3 = fixedStack2 | (fixedStack2 >> 2)

fixedStack4 = fixedStack3 | (fixedStack3 >> 4)

fixedStack5 = fixedStack4 | (fixedStack4 >> 8)

fixedStack6 = fixedStack5 | (fixedStack5 >> 16)

fixedStack  = fixedStack6 + 1

stackNosplit = abi.StackNosplitBase * sys.StackGuardMultiplier

stackGuard = stackNosplit + stackSystem + abi.StackSmall

Here we can see that the minimum stack size, in bytes, that the Go will use is 2048 bytes, which is equivalent to

2KB.

The fixedStack0 to fixedStack6 uses some clever bitwise manipulations to round up to the nearest power of 2. In

general, the pattern (value | value >> n) is a way of setting the nth bit of value if it isn't already set. When

combined repeatedly with right shifts of increasing powers of 2, this ensures that all bits below the highest set bit in

the original number are also set.

For instance, if fixedStack0 was 1011 in binary (which is 11 in decimal), the series of operations would make it

1111 in binary (15 in decimal), which is one less than the next power of 2 (16 in decimal). Hence, after the series of

bitwise OR operations fixedStack6 will be one less than the next power of two. So defining fixedStack like this:

fixedStack = fixedStack6 + 1 gives us the next power of 2.

I would assume that they have decided to do it this way for performance reasons too, because working with powers of 2

aligns really well with the binary nature of computer arithmetic and memory systems.

The purpose of the stackGuard is to determine a pointer that sits a certain number of bytes above the bottom of the

stack. This helps Go ensure that there's enough room left for certain operations.

Being able to start with a small stack, and dynamically growing it only if it's needed, is one of the reasons why

goroutines are so cheap to create. However, To prevent stack overflow errors, the runtime has to ensure that this chunk

of memory isn't exceeded. Therefore, before most function calls, the runtime checks if there's enough stack space for

the function to execute. If there isn't, the runtime will allocate a larger stack, and copy the current stack's

contents, before proceeding with the function call.

Back to directives

Now, if we know that our functions won't exceed the stack space, we can use the //go:nosplit directive to tell the

compiler to not perform this check, which should, at least in theory, give us a small performance gain.

Here, I've written a small program to try and illustrate this:

package nosplit

//go:nosplit

func funcWithNoSplit() int {

  return 1

func LoopWithNoSplit() int {

  sum := 0

  for i := 0; i < 1e6; i++ {

    sum += funcWithNoSplit()

    sum += funcWithNoSplit()

    sum += funcWithNoSplit()

    sum += funcWithNoSplit()

    sum += funcWithNoSplit()

  return sum

func funcWithSplit() int {

  return 1

func LoopWithSplit() int {

  sum := 0

  for i := 0; i < 1e6; i++ {

    sum += funcWithSplit()

    sum += funcWithSplit()

    sum += funcWithSplit()

    sum += funcWithSplit()

    sum += funcWithSplit()

  return sum

As you can see, we have the LoopWithNoSplit and LoopWithSplit functions. Each one of these functions uses a loop

that performs a million iterations. During each iteration another function is called 5 times. The first loop calls a

function where we are using the //go:nosplit directive.

I've also written two benchmark tests for these functions:

func BenchmarkSplit(b *testing.B) {

  for i := 0; i < b.N; i++ {

    _ = nosplit.LoopWithSplit()

func BenchmarkNoSplit(b *testing.B) {

  for i := 0; i < b.N; i++ {

    _ = nosplit.LoopWithNoSplit()

Now, let's run the benchmarks to see if it had any effect:

❯ go test -bench=.

goos: darwin

goarch: arm64

pkg: go-directives/nosplit

BenchmarkSplit-8            4180            286407 ns/op

BenchmarkNoSplit-8          3741            289680 ns/op

PASS

ok      go-directives/nosplit   3.327s

It looks like our performance improved by roughly 10%. But did it really? Let's run the benchmark again:

❯ go test -bench=.

goos: darwin

goarch: arm64

pkg: playground/nosplit

BenchmarkSplit-8            4147            286728 ns/op

BenchmarkNoSplit-8          4123            286897 ns/op

PASS

ok      playground/nosplit      3.507s

This time, there was barely any difference. The compiler is clever enough to simply inline these functions. Thus, making

the nosplit directive have no effect.

To really illustrate this, let's use another directive to prevent this optimization:

//go:noinline

//go:nosplit

func funcWithNoSplit() int {

  return 1

and run the benchmark again:

❯ go test -bench=.

goos: darwin

goarch: arm64

pkg: playground/nosplit

BenchmarkSplit-8            3946            289233 ns/op

BenchmarkNoSplit-8           348           3444913 ns/op

PASS

ok      playground/nosplit      2.816s

Wrap up

I wasn't able to make my advent of code solutions any faster by using compiler directives either, but it was a fun

experiement and I learned a lot. I hope you enjoyed this post!

The end

I usually tweet something when I've finished writing a new post. You can find me on Twitter

by clicking

File name	Tags	Time to read	Created at
context	go context	8 minutes	2024-02-28
circular-buffers	go concurrency data processing	5 minutes	2024-02-04
go-directives	go compiler performance	4 minutes	2023-10-21
async-tree-traversals	node trees graphs typescript	19 minutes	2023-09-10