Introduction
It might be the fever talking, but I have found a new use for the much maligned runtime.Finalizer.
Between urgent care runs through the holidays for my family members and trying to adapt songs about Christmas into songs about coughing (aka "Hard Candy Christmas" becomes "Hard Coughing Christmas"), this little problem popped into my head:
How do you use a *sync.Pool to recover gRPC protocol buffers?
Go program speed seem linked to some combination of the number of allocations and the size of allocations. The allocations cause the GC to work and most papers link this to be the biggest speed differences between non-GC languages and Go.
We can see that Java and C# programs can often get close to Go speed in common applications like web services, just not memory usage.
Many Go programmers want to get close to C/C++/Rust speed with Go, they spend a lot of time trying to control allocations, doing weird things with MAXPROCS, changing when GC's happen, or allocating large virtual memory chunks to trick the GC.
gRPC Problem
gRPC is my chosen platform for RPCs, though this problem would affect most Golang RPC services or when object control leaves code you control and doesn't return.
Repeated large allocations are bad for speed according to every source because of the time it takes to allocate objects and the time the garbage collector has to spend tracking them.
In the old days, experts would tell you to use circular buffers based on channels to reuse your buffers. This allowed reuse on demand of expensive objects. This lowered your large allocations but the circular buffer wouldn't automatically adjust its size and might hold more memory than you needed or not enough and constantly need to create new objects. It needed some automation.
The Go authors added a standard one called sync.Pool. This is a fast free list that you can store heap objects in for reuse.
But here's the rub:
gRPC and third party libs control when an object will go out of scope. In gRPC, this can make expensive slices un-poolable. When you create and return an output object, you cannot pool a contained []byte slice because the gRPC service takes control and you no longer have lifetime control.
Deeper Look
This isn't actually just a gRPC problem, but any time you have to pass an object to third-party package, you loose reuse by pooling.
Here is a simple proto definition for a gRPC service with a method called Record().
service Recorder {
rpc Record(Input) returns (Output) {}
}
The code below implements the interface.
func (g *grpcService) Record(ctx context.Context, in *pb.Input) (*pb.Output, error) {
out := &pb.Output{}
...
return out, nil
}
The first problem is that "in" is a complete loss. We can't reuse it because we cannot tell gRPC where to get its next input object.
The output object we create within the Record() function would be reusable, but it is returned to the gRPC service object that then has control of its lifetime.
gRPC is unfortunately caught in a bad position here. They cannot pool objects because they cannot control input/output object lifetime.
Stubby, the Google internal version of gRPC, in the early days, had an interface that was similar to:
func (g *grpcService) Record(in *pb.Input, out *pb.Output) error {
...
}
In this model they could have pooled, but the user would have to make copies of input/output objects if they were to live passed the Record() call. This is probably why they moved to a more standard function model because most SWEs would forget that detail and have data race issues. This is conjecture on my part.
Autopool - a use for runtime.SetFinalizer()
If you've never used runtime.SetFinalizer(), good for you. People like to think of them as destructors, but in a GC language that makes no guarantees on object lifetime, this just leads to problems.
A finalizer simply is a function that is called when the object pointed to is going to be garbage collected.
David Crawshaw had a good article about finalizers being less than useful, so I will list his articles here and let you investigate why they are bad from an expert:
What if we could use a finalizer to reclaim an object we have lost track of into a sync.Pool for reuse?
Enter autopool. Let's put it into the service.
type grpcService struct {
...
pool *autopool.Pool
rescID int
...
}
func newGRPC() *grpcService {
...
serv := &grpcService{}
// Create our pool object.
p := autopool.New()
// Add a pool that will serve this object type and get the ID of the
// internal sync.Pool to pull from.
serv.rescID = autopool.Add(reflect.TypeOf(&pb.Resource{}))
serv.pool = p
return serv
}
...
// Record implements the gRPC service Record() call defined in the protocol buffer.
func (g *grpcService) Record(ctx context.Context, in *pb.Input) (*pb.Output, error) {
// Create our standard Output struct.
var out = &pb.Output{...}
// The output.Resource object has a []byte, which we want to be able to reuse.
// So we yank it from our pool and reset the []byte to 0 length. You may have
// to reset other fields.
out.Resc = g.pool.Get(g.rescID).(*pb.Resource)
out.Resc.Payload = out.Resc.Payload[0:0]
// Somewhere here we'd want to modify the payload.
...
return out, nil
}
What you see happening here is that when we create our output object and pull a sub object that contains a []byte from our Pool.
So why are we able to reclaim our Protocol Buffers here where we could not before? And where is that happening?
autopool wraps standard sync.Pool(s) for object types you define. When you pull one of these objects, it works exactly like a sync.Pool except we add a finalizer to the object and inserts it back into pool after garbage collection tries to free it.
But you cannot guarantee when the pool will be added to?
That is correct, especially if you are trying to hack your GC with a lot of the tricks I see around the web. The GC runs at certain memory pressures, so autopool finalizers wont' be run necessarily when the object goes out of scope, or ever.
But on any service that is getting a constant stream of requests, this should happen often enough to cause this to fill our pool.
The cost of adding the finalizer is fairly low.
Is this worth doing?
From what I can tell, if your service is getting enough requests to keep a sync.Pool from freeing the memory and you have payloads at around 100KiB or higher, you start to see non-trivial gains.
Why not do this with all messages?
I gave that a try to see if this gave any benefit. I could not detect benefits based on just number of allocations, the size mattered.
Why not finalize just the slices then, wouldn't that be safer?
You can only finalize an object created by new() or taking the address of a composite literal. Reference types don't count.
Since my initial problem was about gRPC and protocol buffers (and proto3 specifically), I could not wrap my buffer. Even if I did, that would not guarantee that all references to the underlying array would be clear when the wrapper went away.
Keith Randall on go-nuts had a cool way of finalizing a slice's array, you can read about it here (thanks Keith).
However, that method did not allow me to capture the slice itself and was banking on a loophole that he was kind enough to point out is not spec compliant.
Garbage Collection is tricky beast, are you sure this won't cause problems?
Short answer: No
Longer answer:
Using this is like the unsafe package, you better be sure on what you are doing, and even then you might get bitten in the future.
When you manually use a sync.Pool.Put(), you are ensuring that the entire object is free for reuse otherwise you get some nasty bugs. When an object is finalized, you have no idea if a reference to an underlying slice is held somewhere.
So this technique is not completely safe, you have to KNOW that any slices or maps will not have any references held in the third-party code (like gRPC). When using this package you need to either version or static the code in your mod file as to avoid nasty surprises by changes in the upstream code.
gRPC seems quite happy at the moment with taking my output object and keeping no references to any of my fields once it serializes the output.
But imagine a pirate voice here: "Thar be bugs out thar!"
Let's See Some Numbers
I thought you'd never ask:
We have two types of Benchmarks testing a gRPC service:
- Using the autopool
- Not using the autopool
My benchmark environment:
- Mac Pro Laptop circa 2015
- Go 1.13
Note a few things:
- I don't have a benchmarking machine
- I'm not on Linux, which I am sure the go compiler makes more optimized binaries for
- I could have done something wrong in my benchmarks. This is likely
- I could be drawing the wrong conclusions
Let's talk about what the server does:
- Receives a message
- Creates an output message. That output message has a []byte field.
- The []byte field is filled to some buffer size at 64 byte chunks at a time.
- Sends the output message back, which drops it
GRPC Service Benchmark
Without Pool Summary:
Clients | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
---|---|---|---|---|---|---|---|---|
100 | 1K | 100K | 1238696350 | 1703020680 | 16959429 | 0m1.965s | 0m8.653s | 0m1.095s |
100 | 10K | 100K | 3634045380 | 11715026968 | 17832296 | 0m4.472s | 0m17.188s | 0m7.466s |
100 | 50K | 10K | 1311663465 | 4924447952 | 1932345 | 0m2.072s | 0m4.658s | 0m2.731s |
100 | 50K | 100K | 15146798078 | 49332431456 | 19474287 | 0m16.331s | 0m40.992s | 0m27.367s |
100 | 100K | 10K | 2602280177 | 10144042400 | 2150441 | 0m3.793s | 0m7.413s | 0m5.290s |
100 | 100K | 100K | 33020455793 | 101572491024 | 21440598 | 0m34.581s | 1m14.539s | 0m57.384s |
100 | 3M | 10K | 185087993259 | 329795532600 | 9040208 | 3m7.449s | 4m16.773s | 7m6.174s |
With Pool Summary:
Clients | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
---|---|---|---|---|---|---|---|---|
100 | 1K | 100K | 1228820620 | 1514841392 | 16515117 | 0m1.942s | 0m8.842s | 0m1.170s |
100 | 10K | 100K | 3373151710 | 7457301736 | 16614532 | 0m4.187s | 0m15.015s | 0m6.795s |
100 | 50K | 10K | 1208806099 | 3166273448 | 1764917 | 0m1.992s | 0m3.987s | 0m2.550s |
100 | 50K | 100K | 12690099859 | 31439344120 | 17791186 | 0m13.965s | 0m31.267s | 0m23.260s |
100 | 100K | 10K | 2118462469 | 5874870160 | 1890988 | 0m3.351s | 0m5.465s | 0m4.063s |
100 | 100K | 100K | 29157830020 | 59277607656 | 19369437 | 0m30.743s | 0m57.534s | 0m52.227s |
100 | 3M | 10K | 131743623587 | 178073871536 | 8470306 | 2m14.139s | 2m53.578s | 4m32.568s |
Conclusions
1K Slices
Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
---|---|---|---|---|---|---|---|---|
No | 1K | 100K | 1238696350 | 1703020680 | 16959429 | 0m1.965s | 0m8.653s | 0m1.095s |
Yes | 1K | 100K | 1228820620 | 1514841392 | 16515117 | 0m1.942s | 0m8.842s | 0m1.170s |
9.87573ms decrease in op time, 433KiB allocation savings, 444,312 reduction in allocs.
Virtually no real time saved and a slight increase in kernel time. I'd say that there isn't enough benefit here to warrant usage.
10K Slices
| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 10K | 100K | 3634045380 | 11715026968 | 17832296 | 0m4.472s | 0m17.188s | 0m7.466s |
| Yes | 10K | 100K | 3373151710 | 7457301736 | 16614532 | 0m4.187s | 0m15.015s | 0m6.795s |
261ms decrease in op time, 4.0 GiB in allocation savings, 1,217,764 reduction in allocs.
Still almost no real world savings in time, slight reduction in user space and kernel space time. Wouldn't get excited about using it here.
50K Slices
| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 50K | 10K | 1311663465 | 4924447952 | 1932345 | 0m2.072s | 0m4.658s | 0m2.731s |
| Yes | 50K | 10K | 1208806099 | 3166273448 | 1764917 | 0m1.992s | 0m3.987s | 0m2.550s |
103ms decrease in op time, 1.6 GiB in allocation savings, 167,428 reduction in allocs.
Again, nothing to write home about here. But if you look at the runs for 100K, we start to see several seconds in time reduction both for real time and CPU times.
100K Slices
| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 100K | 10K | 2602280177 | 10144042400 | 2150441 | 0m3.793s | 0m7.413s | 0m5.290s |
| Yes | 100K | 10K | 2118462469 | 5874870160 | 1890988 | 0m3.351s | 0m5.465s | 0m4.063s |
484ms decrease in op time, 4 GiB in allocation savings, 259,453 reduction in allocs.
Here is where I think things get interesting. Real time doesn't really change that much, but we can see that we are spending less CPU time here. We gain a second on both User/Sys.
So if you are averaging over 100K in byte slices, this might be where this might help.
3MiB Slices
| Has Pool | Buffer Size | Requests | ns/op | B/op | allocs/op | Real | User | Sys |
| No | 3M | 10K | 185087993259 | 329795532600 | 9040208 | 3m7.449s | 4m16.773s | 7m6.174s |
| Yes | 3M | 10K | 131743623587 | 178073871536 | 8470306 | 2m14.139s | 2m53.578s | 4m32.568s |
53s decrease in op time, 141 GiB in allocation savings, 569,902 reduction in allocs.
At the far end here we can see some significant savings. We saved over a minute on our runtime and several minutes of reduction in CPU time.
So in the MiB region of slice size, being able to recover these slices can provide some significant savings.
Final Conclusions
I think I've found a good use for finalizers that could really help speed up software where control of objects is lost to third party packages.
You might be asking, why I think I've found a good use.
- It is possible a mistake was made or an assumption went into this that is incorrect.
- The conclusions may be wrong or attributed to another factor.
- This is not peer reviewed.
- This was not tested on the most popular platform, Linux. There could be optimizations there that make this mute.
I would also note that I'm not recommending this use. There may be hidden gotchas I haven't thought of and it is easy to have packages outside your control change how they treat your slices. In most use cases, you give up control when you pass an object.
The code is published. You are welcome to duplicate my findings or show where it is incorrect.
Until then, we won't know if this was just the fever making me delusional or I have found something interesting.
Until then, cheers and happy holidays.
Note on some gotchas
- Protocol Buffers have a Reset() function. In proto3, this simply points the pointer at a new version of the struct. That destroys the slices. This means that you cannot use Reset() and this technique.
- If you are thinking of linking to this code, realize it is in a development branch, it is subject to change.