Protocol buffers: Avoid these uses

Note: This is not an article saying protocol buffers are bad. Protocol buffers are a great serialization technology. Compact, easy to understand and readable/writable in several different languages. They easily beat JSON or other schema-less technologies when you know the structure you want to send/receive. But they aren't appropriate for every use case.

Note: This article can be applied to multiple languages, but the code examples and talking points will be Go centric.

Protos make bad complex objects

The basic problem

The problem with protocol buffers is that they transform into language native data objects but not optimal structures for use in a language.

If your software simply changes a protocol buffer or records a protocol buffer, then using it directly in your code should not pose a problem.

But for more complex use cases, they are generally not a great fit.

A simple example

Protocol buffers support very basic types. But this certainly doesn't cover all the types provided by a language. A simple example is representing time.

Go has a great time library. Protos don't have such a representation, though they do provide a helper for recording timestamps. The most common way to support timestamps is via an int64 based on epoch.

Take a simple proto:

syntax = "proto3";

message Time {
    int64 submit = 1;
    int64 access = 2;
}

This turns into Go code like:


type Time struct{
    Submit int64
    Access int64
    
    // Some hidden and \_XXX fields.
    ...
}
// Methods here for accessing these fields
...

You have lost the simplicity that the time library gives you. If you plumb the proto though your code, you always pull the value into a native type to use it, then push it back into the proto. This gives code such as:

Now := time.Now()

// Don't accept a submission over 30 minutes.
if now.Sub(time.Unix(proto.Submit, 0)) > 30 * time.Minute {
    return fmt.Errorf("this request has expired")
}
// Record last access time.
proto.Access = time.Now().Unix()

Certainly not impossible to use, but less readable and this is a simplistic example with a type that has a built-in conversion method.

What is important to grasp here is that every time I want to call a method on this type, I have to do a time conversion to gain access to these methods.

Losing logical objects

While protos represent objects, they represent public data objects only. You lose access to private types, types with complex semantics (service RPC connections, file IO, ...), unsupported native types (like Go channels) and methods.

What you have is a native object, but one that is designed to allow data transfer between languages. It is not optimized for use in your language.

This causes your language to lose access to common design patterns.

One of my favorite patterns is a validation pattern I use on nested data objects. The basic idea is to call a Validate() method on the top object which calls Validate() methods on sub-objects which do the same for lower tier objects.

This pattern is very effective and keeps validation code bonded to the type they represent.

type Record struct {
	Basic      Basic
	Employment Employment
}

func (r Record) Validate() error {
    // This uses reflection to run Validate() on all fields.
    // It is not detailed here for brevity.
	if err := validateStruct(r); err != nil {
		return err
	}
	return nil
}

type Basic struct {
	First string
	Last  string
}

func (b Basic) Validate() error {
	if b.First == "" {
		return fmt.Errorf("Basic.First must not be an empty string")
	}
	if b.Last == "" {
		return fmt.Errorf("Basic.Last must not be an empty string")
	}
	return nil
}

type Employment struct {
	ID         uint
	Department string
}

func (e Employment) Validate() error {
	if e.ID == 0 {
		return fmt.Errorf("Employment.ID must be > 0")
	}
	if e.Department == "" {
		return fmt.Errorf("Employment.Department must be set")
	}
	return nil
}

Note: Full code here

This simple code allows validation against Record in a single call.

You can validate your proto, but not using a similar method. You would need to do this as a function or series of functions with the proto as a passed argument. You cannot use reflection to dig into the object hierarchy and call Validate(), because that method doesn't exist. Finally, you will need explicit mappings to all sub-messages.

What if you want to just add a GRPC connection object? You can't, because you can't add a new field. In some dynamic languages like Python, this is possible. But it is bad form to dynamically create new attributes on objects during runtime. It hides from developers what fields are available without deep code introspection.

Synchronization becomes another major hurtle. You can use a generic mutex to encompass the entire proto, but you cannot embed individual field mutexes.
You lose the ability to utilize faster synchronization types used in the atomic library when its appropriate (or your language's equivalent). Channels, queue.Queue objects, ... are not available for use.

What gets lost as you begin to solve these problems is that your architecture is being dictated by the limitations of the protocol buffer's implementation. You can do these things, but not in an optimal way.

You are asking protocol buffers to do more than what it was designed to do.

Solutions: wrappers, injection or native type conversion

Wrappers

Wrappers, depending on the language, can be a solution. Think of this as the lazy developers solution: You want to get the benefits of your language without doing much work.

Adding a wrapper allows direct use of the proto while gaining access to new fields and methods. Simply embed the proto message within a native type.

type Record struct {
    proto.Record
}

func (r Record) Validate() error {
    ...
}

type Basic struct {
	proto.Basic
}

func (b Basic) Validate() error {
	if b.First == "" {
		return fmt.Errorf("Basic.First must not be an empty string")
	}
	if b.Last == "" {
		return fmt.Errorf("Basic.Last must not be an empty string")
	}
	return nil
}
...

By doing composition we added methods around these fields.

But this method has a few negatives:

You lose access to native types like time.Time or time.Duration, because your timestamp is still an int64. You can fix this, but its not pretty.
Reflection methods that were used in Record.Validate() are much harder to write. You have to extract the sub-messages such as Basic and Employment into their own compositions when reflecting through the protos.
This adds more methods and fields than it appears. Protos contain public fields like _XXX, getters, ... that aren't as compact as your native code. This may or may not matter to your application.
You lose access to most object diffing packages. The proto libraries supply protocol buffer comparison functions, but not with native wrappers. You would need to write a custom compare function.

Injection

Label this as controversial.

In many proto languages it is possible to inject methods into your protocol buffer.

In Go you would add a file that has the same package name as the proto within the same directory where you generate your native proto file. Using this method, you can add new methods around your types.

This works fairly well if you don't need to add new fields to the structs.

However, I don't recommend this. It is doubtful this is compatible with build systems like Bazel that can automatically compile your proto and the methodology is brittle.

This also doesn't provide the ability to provide fields based on native types. If you ever want to add native types, you are out of luck. Future proofing is something to strive for.

Native Types

With this method protocol buffers are utilized for what they are meant for: data serialization.

This method would require conversion to/from native objects. This allows customization of an object with native type representation, complex objects and the addition of methods. It also allows the ability to hide from fields that are needed only between client/server communication.

This is the most work intensive method. It requires making very similar native representations of data fields and conversion methods to and from these types (int64 to time.Time or time.Duration). Enumerators need conversion, etc...

It would be great if someone provided a library that wrote the skeleton for this, but I haven't seen one (maybe a weekend project in my future).

This gives the ultimate in flexibility to use the language without restrictions put on protos by cross-language support.

It avoids all the weird naming conventions, extra fields and methods that are not required. In large projects, this can make a difference in how your program is structured.

And it provides separation between RPC calls/storage formats and the software's internal representations. This allows for variations between representations where it makes sense.

Protos make a bad configuration language

I'm only going to briefly touch on this and follow up with another article.

Protocol buffers do not make a good configuration language. This is another place where the versatility of protocol buffers gets us in trouble.

The string representation lacks:

good documentation, string representation is really a debug tool
support for serializing the string representation with new fields into an older version of the protocol buffer
multi-line strings in a human readable format

This representation is simply for debug use only. Yaml and Toml are both superior formats meant for human consumption.

But enough of this for now.

What to use protos for

Protos are great serialization objects. When you need to communicate with another service or store data onto disk there are few serializers than can match both its language support and efficiency.

But utilizing protos within your software's logic can lead to architectural decisions that are not optimal on your language platform.

Like any article, this one certainly has its opinion. This does not mean it is the right methodology for every language or every use case.

When thinking of plumbing protocol buffers throughout your software, just give a thought on how you would structure your software if you were using native language concepts and how future versions might change your needs.

This might save you many time and code complexity working around proto limitations.