1

Introduction

2

While doing Advent of code last year, I played around a lot with Go's benchmarking tools to try out different

3

optimizations.

4

5

I began by trying to reduce the number of allocations. I then procceded by trying to find the right balance between

6

concrete values and pointers. Finally, to try and squeeze the last bits of performance out of my code, I explored

7

different compiler directives.

8

9

In this post, I'll dive into the latter, and share a few insights and lessons learned.

10

11

Directives

12

Directives gives us the possibility to provide the compiler with special instructions. There are a couple of different

13

ones, but the one I played around with the most was //go:nosplit. To understand how to use this directive, we'll begin

14

with a primer on gouroutine stacks.

15

16

Each goroutine starts out with a tiny stack. The variables that define the actual sizes can be found in the runtime

17

package. These are unexported constants, and could probably change with any future version of Go, but as of today, they

18

are defined like this:

19

20
stackSystem = goos.IsWindows*512*goarch.PtrSize + goos.IsPlan9*512 + goos.IsIos*goarch.IsArm64*1024
21
stackMin = 2048
22
fixedStack0 = stackMin + stackSystem
23
fixedStack1 = fixedStack0 - 1
24
fixedStack2 = fixedStack1 | (fixedStack1 >> 1)
25
fixedStack3 = fixedStack2 | (fixedStack2 >> 2)
26
fixedStack4 = fixedStack3 | (fixedStack3 >> 4)
27
fixedStack5 = fixedStack4 | (fixedStack4 >> 8)
28
fixedStack6 = fixedStack5 | (fixedStack5 >> 16)
29
fixedStack  = fixedStack6 + 1
30
stackNosplit = abi.StackNosplitBase * sys.StackGuardMultiplier
31
stackGuard = stackNosplit + stackSystem + abi.StackSmall
21

22

Here we can see that the minimum stack size, in bytes, that the Go will use is 2048 bytes, which is equivalent to

23

2KB.

24

25

The fixedStack0 to fixedStack6 uses some clever bitwise manipulations to round up to the nearest power of 2. In

26

general, the pattern (value | value >> n) is a way of setting the nth bit of value if it isn't already set. When

27

combined repeatedly with right shifts of increasing powers of 2, this ensures that all bits below the highest set bit in

28

the original number are also set.

29

30

For instance, if fixedStack0 was 1011 in binary (which is 11 in decimal), the series of operations would make it

31

1111 in binary (15 in decimal), which is one less than the next power of 2 (16 in decimal). Hence, after the series of

32

bitwise OR operations fixedStack6 will be one less than the next power of two. So defining fixedStack like this:

33

fixedStack = fixedStack6 + 1 gives us the next power of 2.

34

35

I would assume that they have decided to do it this way for performance reasons too, because working with powers of 2

36

aligns really well with the binary nature of computer arithmetic and memory systems.

37

38

The purpose of the stackGuard is to determine a pointer that sits a certain number of bytes above the bottom of the

39

stack. This helps Go ensure that there's enough room left for certain operations.

40

41

Being able to start with a small stack, and dynamically growing it only if it's needed, is one of the reasons why

42

goroutines are so cheap to create. However, To prevent stack overflow errors, the runtime has to ensure that this chunk

43

of memory isn't exceeded. Therefore, before most function calls, the runtime checks if there's enough stack space for

44

the function to execute. If there isn't, the runtime will allocate a larger stack, and copy the current stack's

45

contents, before proceeding with the function call.

46

47

Back to directives

48

Now, if we know that our functions won't exceed the stack space, we can use the //go:nosplit directive to tell the

49

compiler to not perform this check, which should, at least in theory, give us a small performance gain.

50

51

Here, I've written a small program to try and illustrate this:

52

53
package nosplit
54
 
55
//go:nosplit
56
func funcWithNoSplit() int {
57
  return 1
58
}
59
 
60
func LoopWithNoSplit() int {
61
  sum := 0
62
  for i := 0; i < 1e6; i++ {
63
    sum += funcWithNoSplit()
64
    sum += funcWithNoSplit()
65
    sum += funcWithNoSplit()
66
    sum += funcWithNoSplit()
67
    sum += funcWithNoSplit()
68
  }
69
  return sum
70
}
71
 
72
func funcWithSplit() int {
73
  return 1
74
}
75
 
76
func LoopWithSplit() int {
77
  sum := 0
78
  for i := 0; i < 1e6; i++ {
79
    sum += funcWithSplit()
80
    sum += funcWithSplit()
81
    sum += funcWithSplit()
82
    sum += funcWithSplit()
83
    sum += funcWithSplit()
84
  }
85
  return sum
86
}
54

55

As you can see, we have the LoopWithNoSplit and LoopWithSplit functions. Each one of these functions uses a loop

56

that performs a million iterations. During each iteration another function is called 5 times. The first loop calls a

57

function where we are using the //go:nosplit directive.

58

59

I've also written two benchmark tests for these functions:

60

61
func BenchmarkSplit(b *testing.B) {
62
  for i := 0; i < b.N; i++ {
63
    _ = nosplit.LoopWithSplit()
64
  }
65
}
66
 
67
func BenchmarkNoSplit(b *testing.B) {
68
  for i := 0; i < b.N; i++ {
69
    _ = nosplit.LoopWithNoSplit()
70
  }
71
}
62

63

Now, let's run the benchmarks to see if it had any effect:

64

65
❯ go test -bench=.
66
goos: darwin
67
goarch: arm64
68
pkg: go-directives/nosplit
69
BenchmarkSplit-8            4180            286407 ns/op
70
BenchmarkNoSplit-8          3741            289680 ns/op
71
PASS
72
ok      go-directives/nosplit   3.327s
66

67

It looks like our performance improved by roughly 10%. But did it really? Let's run the benchmark again:

68

69
❯ go test -bench=.
70
goos: darwin
71
goarch: arm64
72
pkg: playground/nosplit
73
BenchmarkSplit-8            4147            286728 ns/op
74
BenchmarkNoSplit-8          4123            286897 ns/op
75
PASS
76
ok      playground/nosplit      3.507s
70

71

This time, there was barely any difference. The compiler is clever enough to simply inline these functions. Thus, making

72

the nosplit directive have no effect.

73

74

To really illustrate this, let's use another directive to prevent this optimization:

75

76
//go:noinline
77
//go:nosplit
78
func funcWithNoSplit() int {
79
  return 1
80
}
77

78

and run the benchmark again:

79

80
❯ go test -bench=.
81
goos: darwin
82
goarch: arm64
83
pkg: playground/nosplit
84
BenchmarkSplit-8            3946            289233 ns/op
85
BenchmarkNoSplit-8           348           3444913 ns/op
86
PASS
87
ok      playground/nosplit      2.816s
81

82

Wrap up

83

I wasn't able to make my advent of code solutions any faster by using compiler directives either, but it was a fun

84

experiement and I learned a lot. I hope you enjoyed this post!

84

85

The end

86

I usually tweet something when I've finished writing a new post. You can find me on Twitter

87

by clicking 

normalintroduction.md
||194:18

Recently Edited

Recently Edited

File name

Tags

Time to read

Created at

context

  • go
  • context
8 minutes2024-02-28

circular-buffers

  • go
  • concurrency
  • data processing
5 minutes2024-02-04

go-directives

  • go
  • compiler
  • performance
4 minutes2023-10-21

async-tree-traversals

  • node
  • trees
  • graphs
  • typescript
19 minutes2023-09-10

All Files

All Files

  • go

    5 files

  • node

    2 files

  • typescript

    1 file

  • frontend

    1 file

  • workflow

    7 files