Unbounded Queue: A tale of premature optimization

A simple start

When you work in this industry for a while, if your introspective, you can discover what type of programmer you are. You might be a programmer obsessed with performance or trying to find clever tricks.

I am neither of those. While I do like to find new techniques or simpler ways to do something, I generally prefer to steal from others who are much smarter than I. I dislike tracking down performance problems in systems unless it is necessary. I tend to be interested in delivering a product that functions with reasonable performance characteristics. I do not analyze assembly output in order to add custom assembly code to take advantage of a specific Intel instruction that does xyz faster (but thank God for a few people I've met who do). I rarely think that is necessary.

So, for reasons unknown to me, I have ended up going down a rabbit hole of exploring some Go concepts in order to make something faster and use less memory. Let me explain.

rabbit hole

Recently I have had a need for an unbounded queue. Buffered channels are my go to for a standard queue. And they are nice when you are okay with either:

blocking when the buffer is full
dropping an update when the buffer is full

Neither of those situations was going to work in this case. In reality, the queue's use would have a bound, but I would not know what that bound was ahead of time. I could not drop an update, have an update out of order, or block for an update. However, I could deal with higher memory use while we built up to whatever the bound would be.

I wanted was something that:

FIFO if single reader
pushed new items fast
read items fast
grew when the buffer became full
shrank when buffer space was not required
was unbounded
provide methods that could pop if possible or return !ok
provide methods that blocked until it could pop
if an above method was waiting, it would not eat up the CPU
no goroutines
solved a generic use case, so []interface{}

Note on being Generic: Yeah, I hate []interface{}. Empty interface{} conversion is a time suck and I hate not being typed at compile. However, I'm anti-generics with the proposals I've seen, they are just ugly and I don't want to spend 6 weeks explaining them to newbies. I'm fairly convinced that with the brain power behind trying to come up with generics in Go that if there was a good answer it would have happened. They will either screw up compile time or make the code more difficult to understand. With that said, this code is probably ripe for using Genny or some other type of pre-compiled code generator to save you.

To test implementations, I eventually decided on an interface I wanted for testing:

type Buffer interface{
    // Push an item on to the buffer. If we have reached maximum
    // buffer size, return false.
    Push(item interface{}) (ok bool)
    // Force is like Push() except it will block until we can push onto the buffer.
    Force(item interface{}
    // Pop retrieves a value from the buffer. It returns false if the buffer
    // is empty.
    Pop() (val interface{}, ok bool)
    // Pull is like Pop() except it will block until it can retrieve a value.
    Pull() interface{}
}

This isn't the most Goish interface. Pull() for example could return a channel that keeps returning values until another method like Close() was called. But I wanted to avoid spinning off goroutines. My final version might include these.

But before I go any further, other than Attempt #1, everything I am doing here is wrong. This is a near-perfect example of over-optimizing. When writing an application you should concentrate on:

Application structure
Nice interfaces to data structures that can have the internals switched out
Good data structures

Performance optimization should come after testing, tracing, etc... of bottlenecks that are unacceptable in your application. Good data structures lead to more optimized algorithms if you need them. And if you abstract your interfaces cleanly, you can always optimize behind the scenes.

For reference, you can find the various code that I show or don't show here:

Different versions of the FIFO queue:
http://github.com/johnsiilver/golib/queue/fifo/unbounded/experimental/

The final version:
http://github.com/johnsiilver/golib/queue/fifo/unbounded/

Attempt 1: Not Being Clever

My first crack at this was to avoid being clever at all and use Go in the most simple way possible.

This involved:

using a slice to hold the queue
using a slice's sliding window over the array to dequeue entries
use a mutex when enqueuing or dequeuing to the slice
a channel to prevent providing sleeping routines when doing blocking calls

Here's what the first attempt looked like:

package unbounded

import "sync"

// Buffer provides a first in first out non-blocking (but not lock-free) Buffer.
type Buffer struct {
    mu     sync.Mutex
    nextCh chan interface{}
    data   []interface{}
}

// New is the contructor for Buffer.
func New() *Buffer {
    return &Buffer{nextCh: make(chan interface{}, 1)}
}

// Push adds an item to the Buffer.
func (q *Buffer) Push(item interface{}) {
    q.mu.Lock()
    q.data = append(q.data, item)
    q.shift()
    q.mu.Unlock()
}

// Force will push the item onto the buffer and blocks until the operation completes.
func (q *Buffer) Force(item interface{}) {
    q.Push(item)
}

// shift will move the next available entry from our queue into the channel to
// be returned to the user. Must be locked by a calling function.
func (q *Buffer) shift() {
    if len(q.data) > 0 {
        select {
        case q.nextCh <- q.data[0]:
            q.data = q.data[1:]
        default:
        }
    }
}

// Pop returns the next value in the buffer.  It returns ok == false if there
// is no value in the buffer.
func (q *Buffer) Pop() (val interface{}, ok bool) {
    select {
    case item := <-q.nextCh:
        q.mu.Lock()
        q.shift()
        q.mu.Unlock()
        return item, true
    default:
        return nil, false
    }
}

// Pull pulls an item off the circular buffer. It blocks until a value is found.
func (q *Buffer) Pull() interface{} {
    item := <-q.nextCh
    q.mu.Lock()
    q.shift()
    q.mu.Unlock()
    return item
}

This one worked well enough. But it got me thinking about other ways to do this. I have no idea why. I should have just stopped. I had a good data structure for what I needed. I could optimize later if I needed to. But for some reason, I just decided I wanted to waste a day, very unlike me. Sorry, time to dive first into the rabbit hole.

dive into the rabbit hole

First, for whatever crazy reason, I dislike using Mutex. In recent times I've begun using atomic.Value and other atomic operations to protect constantly changing values. I've benchmarked these and found this to provide significant increases if you hammer the variable with changes. But that got me thinking about just doing atomic.CompareAndSwap_ for my Mutex needs.

Second, there is a lot of array copies happening in this version. Array copies are very fast¹, but I don't like doing quite so many (this irked me, but it has no bearing on the efficiency of the method). Wouldn't some form of circular buffer be faster? Of course, it would need to be a circular buffer that didn't have a set size.

¹Note: I saw a talk in the last year from someone showing that array copies were faster than heap allocations and pointer swaps by some crazy multiple. I remember his evidence being fairly good and for this reason, I concentrated on either using channels for the Buffer or slices. However, there are two old sayings from Google that apply here:

In God we trust, everyone else must bring the data
Trust but verify

I'm sure these originate somewhere else for anyone who is really nitpicky. But Google is the first place I heard this said.

Go wants you to be clever, but not too clever

So Go isn't the language for people who want to be clever with optimizations. C is the language for clever optimizations. No runtime to contend with, crazy pointer arithmetic and direct control.

Go really wants you to be clever with your data structures, APIs, etc... Make it easy for your users and maintainers. There is an entire philosophy that bears where after you swim in its waters for a while you do find it refreshing. At first, it just feels cold.

With that said, when you start looking at internals to low-level tasks, you will often find that the standard library has access to code your programs don't. So, for example, if you want to spinlock as well as the Mutex implementation, you can't. Mutex has access to runtime_doSpin(), which your program doesn't.

There are other memory barrier tricks that your code can't take advantage of unless you want to use Cgo, which for the purpose of this article is cheating. And I'm sure I can do a hack to gain access to these things, but that's not portable. One of Go's strength is that until you start using unsafe or syscall, your code really is portable (while C just dreams it is actually portable).

It is important to understand this, because if you try to write a Mutex that is more optimized for your use case, you are not going to be as efficient. You can attempt to do some of this with runtime.Gosched(), but you are rolling that rock uphill. Apparently, I like rolling rocks.

rolling the rock

Trying to be clever

My first attempt at being clever was to use channels as a circular buffer. The circular buffer would be protected by a simple atomic.CompareAndSwapInt32(). If I couldn't insert a value into the channel, I would simply create a new channel with double the capacity, copy values from the old channel to the new channel, insert the new value, and swap the old channel for the new one. Pretty much what append() does for slices.

My reasoning was:

A lot of work has been done in channels to prevent the use of locks when the channel is not blocked
select has some very nice properties that can be used
atomic.CompareAndSwapInt32 is much faster than sync.Mutex

I knew that there was some risk:

Mutex makes certain guarantees that a caller will not be starved. CompareAndSwapInt32() by itself does not.
Needed to write my own sleeping code that tried to be a spinlock and eventually slept to some maximum value. Otherwise, when idle these locking loops would eat the CPUs alive.

The sleep code took a while to get to a point that seemed to work in a manner that didn't kill performance nor choke my CPU for too long.

package spin

import (
    "runtime"
    "time"
)

// Sleeper allows sleeping for an increasing time period up to 1 second
// after giving up a goroutine 2^16 times.
// This is not thread-safe and should be thrown away once the loop the calls it
// is able to perform its function.
type Sleeper struct {
    loop uint16

    at time.Duration
}

// Sleep at minimum allows another goroutine to be scheduled and after 2^16
// calls will begin to sleep from 1 nanosecond to 1 second, with each
// call raising the sleep time by a multiple of 10.
func (s *Sleeper) Sleep() {
    const maxSleep = 1 * time.Second

    if s.loop < 65535 {
        runtime.Gosched()
        s.loop++
        return
    }

    if s.at == 0 {
        s.at = 1 * time.Nanosecond
    }

    time.Sleep(s.at)

    if s.at < maxSleep {
        s.at = s.at * 10
        if s.at > maxSleep {
            s.at = maxSleep
        }
    }
}

The first custom version came out like:

import (
    "runtime"
    "sync/atomic"
    "time"

    "github.com/johnsiilver/golib/queue/fifo/internal/spin"
)

// Unbounded indicates that the Queue should have no memory bounds.
const Unbounded = -1

const (
    unlocked = int32(0)
    locked   = int32(1)
)

// Buffer provides a FIFO circular buffer that can grow and shrink as required.
// Buffer must not be copied after creation (which means use a pointer if
// passing between functions).
type Buffer struct {
    // Max is the maximum size the Buffer can grow to.  Use Unbounded if
    // you wish to grow the buffer to any size. By default, this will grow to 1k items.
    Max int

    lockInt    int32
    lastShrink time.Time
    data       chan interface{}
}

// Push pushes an item into the circular buffer. "ok" indicates if this happens.
func (c *Buffer) Push(item interface{}) (ok bool) {
    c.lock()
    defer c.unlock()

    select {
    case c.data <- item:
        return true
    default:
    }

    // The buffer was too small, grow the buffer and then insert.
    c.grow()
    select {
    case c.data <- item:
        return true
    default:
        return false
    }
}

// Force will push the item onto the buffer and blocks until the operation completes.
func (c *Buffer) Force(item interface{}) {
    sleeper := spin.Sleeper{}

    for {
        if c.Push(item) {
            return
        }
        sleeper.Sleep()
    }
}

// Pop returns the next value off the circular buffer. If the buffer is empty
// ok will be false.
func (c *Buffer) Pop() (value interface{}, ok bool) {
    c.lock()
    defer c.unlock()

    select {
    case v := <-c.data:
        c.shrink()
        return v, true
    default:
        return nil, false
    }
}

// Pull pulls an item off the circular buffer. It blocks until a value is found.
func (c *Buffer) Pull() interface{} {
    sleeper := spin.Sleeper{}

    for {
        v, ok := c.Pop()
        if !ok {
            sleeper.Sleep()
            continue
        }
        return v
    }
}

// grow will double the size of the internal buffer until we hit the max size
// if the buffer is currently full.
// Note: grow must be protected with .lock/.unlock.
func (c *Buffer) grow() {
    if c.Max == 0 {
        c.Max = 1000
    }

    if cap(c.data) == c.Max {
        return
    }

    if len(c.data) == cap(c.data) {
        if c.Max == Unbounded {
            ch := make(chan interface{}, cap(c.data)*2)
            c.copy(ch, c.data)
            c.data = ch
            return
        }
        if cap(c.data) < c.Max {
            size := cap(c.data) * 2
            if size == 0 {
                size = 8
            }
            if size > c.Max {
                size = c.Max
            }
            ch := make(chan interface{}, size)
            c.copy(ch, c.data)
            c.data = ch
            return
        }
    }
}

// shrink shrinks the size of the internal buffer if the length of the buffer
// is < 50% of the capacity.  The reduction will be by 25% but will produce
// a buffer size of no less than 8 slots.
// Note: shrink must be protected with .lock/.unlock.
func (c *Buffer) shrink() {
    if cap(c.data) == 8 {
        return
    }

    if time.Now().Sub(c.lastShrink) > 10*time.Minute {
        return
    }

    // If the current unused capacity is > 50% of the buffer, reduce it by 25%.
    if (cap(c.data) - len(c.data)) > (cap(c.data) / 2) {
        size := int(float64(cap(c.data)) * .75)
        if size < 8 {
            size = 8
        }

        ch := make(chan interface{}, size)
        c.copy(ch, c.data)
        c.data = ch
        c.lastShrink = time.Now()
    }
}

func (c *Buffer) copy(dst chan<- interface{}, src chan interface{}) {
    if (cap(dst) - len(dst)) < (cap(src) - len(src)) {
        panic("internal error: Buffer.copy() cannot be called when dst is smaller than src")
    }

    close(src)
    for v := range src {
        dst <- v
    }
}

func (c *Buffer) lock() {
    for {
        if atomic.CompareAndSwapInt32(&c.lockInt, unlocked, locked) {
            return
        }
        runtime.Gosched()
    }
}

func (c *Buffer) unlock() {
    for {
        if atomic.CompareAndSwapInt32(&c.lockInt, locked, unlocked) {
            return
        }
        runtime.Gosched()
    }
}

This one had some nice properties. It had a useful zero value, which I tried to use with other implementations instead of a constructor (which was not always possible).

It also only shrunk at intervals, reducing the need for array copies.

The Truth is in the Pudding

Now I have two versions, but how do each stack up against each other? Oh, the fun of benchmarking!

bench.go

package bench

import (
    "sync"
    "testing"
)

type unbounded interface {
    Force(item interface{})
    Pull() interface{}
}

func singleRun(bench *testing.B, n func() unbounded, items, senders, receivers int) {
    for i := 0; i < bench.N; i++ {
        bench.StopTimer()

        b := n()
        sendCh := make(chan int, items)
        wg := sync.WaitGroup{}
        wg.Add(items)

        // Setup senders.
        for i := 0; i < senders; i++ {
            go func() {
                for v := range sendCh {
                    b.Force(v)
                }
            }()
        }

        // Setup receivers.
        for i := 0; i < receivers; i++ {
            go func() {
                for {
                    b.Pull()
                    wg.Done()
                }
            }()
        }

        bench.StartTimer()

        // Send to Buffer (which the receivers will read from)
        go func() {
            for i := 0; i < items; i++ {
                sendCh <- i
            }
            close(sendCh)
        }()

        wg.Wait()
    }
}

bench_test.go

package bench

import (
    "fmt"
    "testing"

    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/custom_locks"
    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/custom_sleep_atomic"
    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/sliding_slice_atomic"
    "github.com/johnsiilver/golib/queue/fifo/experimental/unbounded/sliding_slice_locks"
)

func BenchmarkUnboundedQueues(b *testing.B) {
    runs := []struct{ items, senders, receivers int }{
        {100000, 1, 1},
        {100000, 10, 1},
        {100000, 1, 10},
        {100000, 10, 10},
        {100000, 100, 10},
        {100000, 10, 100},
    }

    benchmarks := []struct {
        name    string
        newFunc func() unbounded
    }{
        {
            "sliding_slice_atomic",
            func() unbounded { return sliding_slice_atomic.New() },
        },
        {
            "sliding_slice_locks",
            func() unbounded { return sliding_slice_locks.New() },
        },
        {
            "custom_sleep_atomic",
            func() unbounded { return &custom_sleep_atomic.Buffer{} },
        },
        {
            "custom_locks",
            func() unbounded { return &custom_locks.Buffer{} },
        },
    }

    for _, run := range runs {
        for _, benchmark := range benchmarks {
            b.Run(
                benchmark.name+fmt.Sprintf("-items: %d, senders: %d, receivers: %d", run.items, run.senders, run.receivers),
                func(b *testing.B) {
                    singleRun(b, benchmark.newFunc, run.items, run.senders, run.receivers)
                },
            )
        }
    }
}

This benchmark was setup to test a few scenarios:

Single sender and receiver
10 senders and 1 receiver
1 sender and 10 receivers
10 senders and 10 receivers
100 senders and 10 receivers
10 senders and 100 receivers

This gives a certain static scaling of the number of senders and receivers to see how each scales for different scenarios. I want a good generic solution that handles each of these reasonably.

In this initial test, I also included versions of both that use atomic.CompareAndSwapInt32() and sync.Mutex. Here are the initial results:

Note: Bold is the fastest version.

Version	#Items	#Senders	#Receivers	Iterations	ns/op
sliding_slice_atomic	100000	1	1	30	`50688582`
sliding_slice_locks	100000	1	1	30	`41097381`
custom_sleep_atomic	100000	1	1	20	`60742172`
custom_locks	100000	1	1	30	`61267323`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
sliding_slice_atomic	100000	10	1	30	`46452147`
sliding_slice_locks	100000	10	1	50	`37041679`
custom_sleep_atomic	100000	10	1	10	`106059804`
custom_locks	100000	10	1	20	`116124583`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
sliding_slice_atomic	100000	100	1	30	`59664733`
sliding_slice_locks	100000	100	1	30	`39991409`
custom_sleep_atomic	100000	100	1	1	`1048544140`
custom_locks	100000	100	1	2	`699103394`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
sliding_slice_atomic	100000	10	10	30	`50459215`
sliding_slice_locks	100000	10	10	30	`42415787`
custom_sleep_atomic	100000	10	10	20	`76158348`
custom_locks	100000	10	10	20	`74131977`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
sliding_slice_atomic	100000	10	100	20	`82940141`
sliding_slice_locks	100000	10	100	30	`45741994`
custom_sleep_atomic	100000	10	100	5	`221296458`
custom_locks	100000	10	100	20	`97246758`

Well....huh...

First take away:

atomic and custom sleeping is not beating standard Mutex
The array copies are most likely beating the custom channel copies
Using select and channel, while providing some advantages are probably hurting us with a lot of extra locking and other magic

There was some profiling and such, but I decided to go on and test a bunch of other versions.

If At First You Don't Succeed, Fail Some More??

So it would be lying to say I'm covering all the steps I took here. It was a winding path of outright failure and tuning. There were other tests I wrote to prove that the CPU would lower after being stuck either pushing or pulling and other things I'm not going to cover. The rabbit hole seems to keep going.

large rabbit hole

In the end, I decided to benchmark against a few more methods:

Versions of each that had no sleep. These would eat CPU, but I wanted to see if it made a difference with using atomic for locking
I wanted to run against a reference type that used a channel with a buffer of 100 and was not unbounded. Resizing to support unbounded is expensive, but how expensive?
What if I used a queue that was more CS standard, records with pointers pointing to the next entry in the queue (I called this the heap version). According to the talk I saw, this should be much slower, but how much slower?

Well, let's have a look:

Note: Bold is for the fastest, discounting the reference.
Note: The reference is in blue when it was the fastest.

Version	#Items	#Senders	#Receivers	Iterations	ns/op
reference	100000	1	1	100	`20660880`
sliding_slice_atomic	100000	1	1	30	`49620908`
sliding_slice_locks	100000	1	1	30	`38928647`
custom_sleep_atomic	100000	1	1	30	`60764994`
custom_nosleep	100000	1	1	20	`61262504`
custom_locks	100000	1	1	20	`63590469`
heap_lock	100000	1	1	50	`27832756`
heap_atomic	100000	1	1	100	`25798467`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
reference	100000	10	1	50	`31719731`
sliding_slice_atomic	100000	10	1	20	`99060012`
sliding_slice_locks	100000	10	1	30	`78695899`
custom_sleep_atomic	100000	10	1	10	`119628479`
custom_nosleep	100000	10	1	10	`105837735`
custom_locks	100000	10	1	10	`111282187`
heap_lock	100000	10	1	50	`28885042`
heap_atomic	100000	10	1	50	`28278157`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
reference	100000	100	1	50	`29424432`
sliding_slice_atomic	100000	100	1	30	`45287610`
sliding_slice_locks	100000	100	1	30	`49388002`
custom_sleep_atomic	100000	100	1	2	`1014270608`
custom_nosleep	100000	100	1	1	`1018922777`
custom_locks	100000	100	1	3	`478910481`
heap_lock	100000	100	1	100	`26944540`
heap_atomic	100000	100	1	50	`28301942`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
reference	100000	10	10	50	`29539230`
sliding_slice_atomic	100000	10	10	30	`47400717`
sliding_slice_locks	100000	10	10	30	`44891524`
custom_sleep_atomic	100000	10	10	20	`73116554`
custom_nosleep	100000	10	10	20	`74409189`
custom_locks	100000	10	10	20	`70609204`
heap_lock	100000	10	10	50	`23445934`
heap_atomic	100000	10	10	50	`26386795`

Version	#Items	#Senders	#Receivers	Iterations	ns/op
reference	100000	10	100	50	`24754023`
sliding_slice_atomic	100000	10	100	20	`58304944`
sliding_slice_locks	100000	10	100	50	`46103498`
custom_sleep_atomic	100000	10	100	5	`203424993`
custom_nosleep	100000	10	100	5	`211169360`
custom_locks	100000	10	100	20	`89729445`
heap_lock	100000	10	100	50	`25450734`
heap_atomic	100000	10	100	100	`34094147`

Well....huh...., again!

It is not surprising the reference is faster in several versions. But what surprised me was the standard pointer based queue. The heap_lock or heap_atomic took this test by storm.

heap_atomic and heap_lock being so close in most benchmarks leads me to believe that the custom.* versions are slower because there is a bunch of lock contention with the combination of channels/locks(or atomic)/select. I could go work on that, but I don't think that is worth my time.

Now, I'm not sure why the speaker's assessment I listened to was wrong. I tend to think that this kind of thing is my fault. It could be:

I misheard what he was saying
I'm optimizing in some way
The compiler has changed
The benchmark is broken
He was just dead wrong

No matter which one it is, my heap_lock optimization is where my effort is going to concentrate on.

But why not heap_atomic? While it did win in some benchmarks, simpler is better. There isn't enough difference to make me want to use atomic over Mutex and Mutex makes guarantees on goroutine starvation that atomic will not.

What If I Made the Heap Version Circular?

The heap version uses a simple queue. It is memory efficient and apparently fast. But I'm still haunted by allocating extra memory when I don't need to.

What if instead of throwing away entries when dequeuing I just move a pointer in a circular buffer? That entry could be reused without the need for allocating a new entry. And if I run out of room I can always insert more entries at the back.

This will add some overhead in checking if things like the front and tail pointers are the same to know if we need to do an allocation.

Here's a first attempt at it with 100k items:

Version	#Items	#Senders	#Receivers	Iterations	ns/op
heap_lock	100000	1	1	50	`26110856`
heap_circular	100000	1	1	50	`26823725`
heap_lock	100000	10	1	50	`29731315`
heap_circular	100000	10	1	30	`34592901`
heap_lock	100000	100	1	50	`30892013`
heap_circular	100000	100	1	30	`40746096`
heap_lock	100000	10	10	50	`27873525`
heap_circular	100000	10	10	50	`35315318`
heap_lock	100000	10	100	20	`57328038`
heap_circular	100000	10	100	30	`60764377`

Well, it looks like the queue is still winning. Not by a huge amount, but enough.

I wonder if we see better results with 1m items:

Version	#Items	#Senders	#Receivers	Iterations	ns/op
heap_lock	1000000	1	1	5	`275110923`
heap_circular	1000000	1	1	5	`251509839`
heap_lock	1000000	10	1	5	`316068383`
heap_circular	1000000	10	1	3	`367007666`
heap_lock	1000000	100	1	3	`489702046`
heap_circular	1000000	100	1	3	`397814725`
heap_lock	1000000	10	10	3	`418825312`
heap_circular	1000000	10	10	3	`404838549`
heap_lock	1000000	10	100	2	`579100509`
heap_circular	1000000	10	100	2	`630555319`

How about 10m items:

Version	#Items	#Senders	#Receivers	Iterations	ns/op
heap_lock	10000000	1	1	1	`2719649748`
heap_circular	10000000	1	1	1	`2896301313`
heap_lock	10000000	10	1	1	`3239424103`
heap_circular	10000000	10	1	1	`3812422117`
heap_lock	10000000	100	1	1	`3314603774`
heap_circular	10000000	100	1	1	`4393173577`
heap_lock	10000000	10	10	1	`4271595984`
heap_circular	10000000	10	10	1	`4358516405`
heap_lock	10000000	10	100	1	`6051491583`
heap_circular	10000000	10	100	1	`5518405633`

Now my growing circular buffer is probably overly complicated, which leads to some time suck. And it is close in time to heap_lock for the common cases (1:1/10:10/10:100 senders to receivers).

But looking at the trace tool, the allocations are just minor time sinks, gc rarely kicks in because of this on the heap_lock. That makes heap_lock the overall winner in my view. The code is easy to follow and fast.

I think heap_circular has merit, but with a slightly different interface and only if the data being stored is concrete not interface{}. If the entries recorded large data, such as []byte{}, we could use it as a combination FIFO queue and free-list. You could have a method to return an empty value in which it would pull a dequeued entry or if full would give a new one. That could then be used to enqueue without a large allocation. But that is a very specific use case, which I am not going to test. And of course, there is the very good argument of just having a free list and this queue separate.

I am absolutely sure that many of these can be made faster with some trivial work (but I've wasted enough time with this thought experiment). And there were all kinds of things I didn't try (using []unsafe.Pointer and CompareAndSwapPointer() or a slice of slices or ...).

But now I am absolutely sure that:

I should have used the one in Attempt #1
I am not a computer scientist

If you read this far, I'm sorry to have dragged you down the rabbit hole with me. Misery occasionally loves company.

safe landing