Sometimes I'm the poster child for premature optimization.

Normally, I write something and if it is fast enough I don't look at the benchmarks. I've written a lot of Go, so I generally have performant code that gets the job done without benchmarking. Most of what I write has a specific audience, so I can simply base it on if it holds up with the clients.

But certain work, such as work on public APIs or my own home projects involving networking need benchmarks for optimizations.

Right now I'm doing a lot of network multiplexing code (home project) and I track connections via uint32s that represent a particular connection to a client.

As packets come in, I need to see the descriptor and map onto the right backend client. While I can have thousands of connections at any given time, adding a new connection pales in comparison to the number of packet lookups I must do for all data being multiplexed in order to route data correctly.

My two data structures

There are two basic paths I decided I want to take, either:

  • A map[uint32]conn that maps my identifier to a conn object and is protected by a sync.Mutex
  • An atomic.Value that holds the map[uint32]conn for frictionless reads but must make a copy of the map protected by a mutex when adding a value

I wanted to benchmark this in the case of 1,000/10,000/100,000 values.

Benchmark Code

Here's the code in the playground, but you'll need to run it on your own machine as the playground doesn't support benchmarking:

https://play.golang.org/p/pometRnZuR0

The Results

Using Atomic

Adding a single entry

| Map Size | Time per op | Allocs |
| ----------- | ----------- |
| 1000 | 37136 ns/op | 7 allocs/op
| 10000 | 358635 ns/op | 12 allocs/op
| 100000 | 4347930 ns/op | 1677 allocs/op

Fetching a single entry

Map Size Time per op
1000 7.54 ns/op
10000 7.58 ns/op
100000 7.39 ns/op

Geting 10,000 items with an Add every 100 Gets

Running some side tests, writes get progressively worse but really nosedive at 100,000 items.

Reads stay constant at about 130ns for a read when we are putting it under a bunch of read pressure (spinning off read/write goroutines as fast as possible).

Using Mutex

Adding a single entry

| Map Size | Time per op | Allocs |
| ----------- | ----------- |
| 1000 | 25.9 ns/op | 0 allocs/op
| 10000 | 22.4 ns/op | 0 allocs/op
| 100000 | 25.7 ns/op | 0 allocs/op

Fetching a single entry

Map Size Time per op
1000 19.5 ns/op
10000 17.3 ns/op
100000 17.5 ns/op

Geting 10,000 items with an Add every 100 Gets

Map Size Time per op
1000 3687016 ns/op
10000 3654590 ns/op
100000 3755740 ns/op

This shows that if we have a new addition/subtraction every 100 requests and we do 10K requests, it takes around 3.4MS to do all 10K.

That is 340ns on average when under a lot of read pressure.

Conclusion

We all know that micro-benchmarks lie. There are lies, damn lies, and micro benchmarks.

There is a secret cost when you have allocs happening on every call, you pay during garbage collection and that isn't shown.

Running my model in my head, I don't like either one is great for my use case. Now, sub-ms times are probably fine and if I had to choose I'd probably just do the mutex. Because I have a multiplexed connection on a single incoming io.Reader in my use case:

  • I don't want head of line block when adding a new connection
  • I would like Reads to have the atomic speed for reads

In the end, I used a blended approach.

  • All reads read from an atomic.Value. If the values is not found, it reads from a standard map with a RWMutex (which is significantly slower than Mutex if reads aren't a lot higher than writes)
  • All writes write to the map and then push an update function into a buffered channel that updates the atomic.Value (for an Add() or Delete())

Basically, I went with a read cache based on atomic where we fall back on a slower map when we don't hit.

You will be disappointed to know I didn't benchmark this, I'm going to wait until I can do it in the real world. In a synthetic benchmark, I would have to make guesses on how often a read would occur that doesn't hit the atomic.Value map. In addition, I don't know how often writes might actually block when the channel buffer for updates is reached.

If your wondering if I realized I probably prematurely optimized, yep I sure did.