04 / 11

Protocols / 04

Protocol Buffers

Protobuf is Google's interface description language and binary wire format. You write a schema in a .proto file, run a code generator, and get typed message classes in whatever languages you need. The bytes that go on the wire are compact and tagged, and the schema has a small set of rules for how it can evolve over time without breaking the clients you've already shipped. This page walks the whole path: the schema language, how a message turns into bytes, why those bytes are smaller and faster to parse than JSON, the rules that keep two versions of a schema compatible, and where Protobuf fits next to gRPC.

The problem Protobuf solves

Two services need to exchange a record — a charge, a user, an order. They are written in different languages, owned by different teams, and deployed on their own schedules. Each side has to agree on what fields exist, what type each field is, and how the whole thing is laid out as bytes so the other side can read it back. Get that agreement wrong and you get the worst class of bug: data that parses cleanly but means something different than the sender intended.

Text formats like JSON answer this by being self-describing. Every value carries its field name and its structure right there in the payload, so a parser can read a JSON document with no prior knowledge of its shape. That is convenient and it is why JSON won the public web. It is also the source of its cost: every message repeats every field name as text, numbers are written as decimal digits, and the parser has to scan characters and guess types as it goes. There is no contract that the sender and receiver share ahead of time, so nothing stops a field from quietly changing meaning.

Protobuf takes the opposite bet. The contract — the schema — lives outside the bytes, in a .proto file that both sides compile against. Because both ends already know the shape, the bytes on the wire can drop the field names entirely and carry only small numeric tags and tightly packed values. That single decision is what makes Protobuf compact, fast to parse, and strongly typed, and it is also the source of its main drawback: you cannot read the bytes without the schema. The rest of this page is the consequences of that trade.

A schema lives in a .proto file

syntax = "proto3";

package billing.v1;

message Charge {
  string id          = 1;
  int64  amount_cents = 2;
  string currency    = 3;
  Status status      = 4;
  repeated string tags = 5;
}

enum Status {
  STATUS_UNSPECIFIED = 0;
  STATUS_PENDING     = 1;
  STATUS_SETTLED     = 2;
  STATUS_FAILED      = 3;
}

Three things to notice. First: the numbers (1, 2, 3...) are field tags, not array indexes. They're what gets serialised on the wire — never the field names. Second: every enum starts with a zero-valued UNSPECIFIED; that's the proto3 default and it's load-bearing for the evolution rules below. Third: the package is versioned (billing.v1); that's how you ship breaking changes — by introducing billing.v2 alongside.

The body of a message is a list of fields, and each field has three parts: a type, a name, and a tag. The type comes from a fixed menu. There are scalar types for numbers, booleans, strings, and raw bytes; there are message types, which let one message hold another; and there are enums for closed sets of named values. The repeated keyword turns any field into a list of that type, and map<key, value> gives you an associative array. A message can nest other messages to any depth, so a whole API request with addresses, line items, and metadata is one tree of messages rooted at the top.

The scalar types are worth knowing because the choice has real cost consequences on the wire, which the next sections make concrete:

Proto type	Range / use	Wire cost
`int32` / `int64`	signed integers, expected non-negative or mixed	varint; cheap when small, 10 bytes when negative
`uint32` / `uint64`	unsigned integers	varint; cheap when small
`sint32` / `sint64`	signed integers that are often negative	ZigZag varint; cheap for small magnitudes either sign
`fixed64` / `fixed32`	large values, hashes, IDs near the type's ceiling	always 8 or 4 bytes, no size scan
`double` / `float`	floating point	8 or 4 bytes (IEEE 754)
`bool`	true / false	1-byte varint
`string` / `bytes`	UTF-8 text / arbitrary octets	length-delimited: a length plus the data

The pattern to take from that table: pick the type that matches how your data actually looks, not the one that looks safest. A column of mostly-small positive integers wants int32 or uint32. A column of small numbers that swing negative wants sint64. A 64-bit random ID that is almost always near the top of the range wants fixed64, because a varint would spend ten bytes on it while a fixed field spends eight every time. These choices are invisible in the generated code — your application sees a plain integer either way — but they show up directly in payload size at scale.

The wire format: tag, type, value

Every field on the wire is encoded as a key followed by a value. The key packs the field tag and a 3-bit wire type into a single varint:

key = (field_tag << 3) | wire_type

There are five wire types you'll meet:

Wire type	Number	Used for
VARINT	0	int32, int64, uint32, uint64, sint32, sint64, bool, enum
I64	1	fixed64, sfixed64, double
LEN	2	string, bytes, embedded messages, packed repeated
SGROUP / EGROUP	3 / 4	deprecated; ignore
I32	5	fixed32, sfixed32, float

The length-delimited wire type (2) is the one to remember: a varint length followed by that many bytes. Strings, byte slices, sub-messages, and packed repeated fields all share this shape. It is also what makes a parser able to skip fields it does not understand: when it reads a key whose wire type is LEN but whose tag is unknown, it reads the length and jumps that many bytes forward without needing to know what the contents meant. That skip-the-unknown ability is the quiet foundation of every evolution rule later on the page.

There is no separator between fields and no terminator at the end of a message. A message is simply its fields concatenated, key then value, key then value, until the buffer runs out or the enclosing length is reached. The fields do not even have to appear in tag order, and a decoder is required to accept them in any order. Put together, the format is: read a key varint, split it into a tag and a wire type, use the wire type to know how to read the value, store the value under that tag, repeat. Here is one small message laid out byte by byte.

Each field is a key byte (tag and wire type) plus its value. The same record in JSON, {"amount_cents":150,"currency":"USD"}, is 38 bytes.

Varints — variable-length integers

A varint encodes an unsigned integer using a variable number of bytes, seven bits of payload each, with the high bit signalling "continue".

300 in binary:        100101100   (9 bits)
chunked LSB-first:    0101100 0000010
prefixed:             10101100 00000010
                      ^         ^
                      cont      stop

Walk the example above. The number 300 needs nine bits, more than fits in one byte once you reserve the high bit as a continuation flag. So the encoder slices the value into seven-bit groups, least-significant group first, and sets the high bit on every group except the last. A decoder reads bytes until it finds one whose high bit is clear, strips the flag bits, and reassembles the seven-bit payloads in reverse to recover the original integer.

The varint for 300. One byte covers 0–127, two bytes cover up to 16,383, and the encoding grows one byte per seven extra bits.

Small numbers cost one byte (anything < 128). Numbers up to 16,383 cost two. Field tags 1–15 fit in a single-byte key, so the most-frequently-used fields cost one byte of overhead — that's the famous "make your hot fields tags 1–15" rule. The flip side is the worst case: because a 64-bit value can need ten seven-bit groups, a varint can be ten bytes long, two more than the eight a raw 64-bit value would take. That only happens for very large or, as the next section shows, negative numbers, which is exactly when you should reach for a fixed-width or ZigZag type instead.

ZigZag — what sint32 is for

Varints are unsigned. Encoding a signed -1 as a varint produces a 10-byte worst case (because two's complement makes it look like 18 quintillion). The fix is ZigZag encoding, which maps signed integers to unsigned ones in alternating order:

Signed	ZigZag	Bytes
0	0	1
-1	1	1
1	2	1
-2	3	1
2147483647	4294967294	5
-2147483648	4294967295	5

You opt in by declaring sint32 / sint64 instead of int32. If you have a field that's mostly small negative numbers, this is free 4–9× compression. If your numbers are always non-negative, prefer uint32. The mapping is reversible and cheap — a couple of bit operations on each end — so there is no parsing penalty for the saving; the only requirement is that both sides agree the field is ZigZag-encoded, which is exactly what declaring it sint32 in the shared schema does.

Why it's smaller and faster than JSON

Two effects compound. The first is size. JSON pays for its self-describing nature on every single message: the field names are spelled out as text, every value is decimal digits or quoted characters, and braces, colons, commas, and whitespace add structural overhead. Protobuf drops all of it. A field name like amount_cents costs eleven bytes in JSON and zero on the wire in Protobuf, where it is a one-byte tag. An integer like 150 costs three text bytes in JSON and one varint byte in Protobuf. Booleans cost four or five text bytes versus one. For a message of many small numeric fields the difference is routinely three to ten times, and the gap widens as field names get longer and values get smaller.

The second effect is parsing speed, and it matters at least as much. A JSON parser scans the payload one character at a time: it has to find the quotes around a key, match that key string against the target type, find the value, decide whether the value is a number or a string or a nested object, and convert decimal text into a machine integer. A Protobuf parser reads a key varint, learns the tag and wire type from a few bit operations, and from the schema it already knows exactly what field and what machine type that tag maps to. Numbers arrive in a form close to their in-memory representation, so there is little or no text-to-number conversion. There is no string matching, no type guessing, and no character scanning of structure. That turns deserialisation from a parsing problem into something closer to a memory copy.

Both effects come from the same root cause: moving the schema out of the bytes and into a file both sides compiled against. JSON keeps the schema in the payload, which buys you the freedom to read a document you've never seen and the cost of carrying that description forever. Protobuf keeps the schema out of the payload, which costs you the freedom to read unknown bytes and buys you compactness and speed. The JSON vs Protobuf simulator lets you paste a record and watch the same data shrink and the byte layout change as you flip between the two.

When JSON is the right call anyway. If the payload is read by humans, debugged by hand, or consumed by browsers and third parties who shouldn't need your schema, JSON's self-describing text is the feature, not the bug. Protobuf pays off on internal, high-volume, machine-to-machine traffic where both ends are yours and bytes and CPU are measured.

Schema evolution rules

This is where Protobuf earns its keep. Adhere to these and you can change schemas indefinitely without breaking deployed clients.

Two words to keep straight. Backward compatibility means new code can read data written by old code. Forward compatibility means old code can read data written by new code. In a real deployment you never upgrade every service at the same instant, so for a window of time both directions are live at once: a new server is receiving messages from clients that haven't been updated, and old servers are receiving messages from a new client that already has been. Protobuf is designed to hold both directions as long as you obey the rules, and the whole reason it can is the skip-the-unknown behaviour from the wire format — an old parser that meets a tag it doesn't recognise reads past it and carries on.

The tag number is the only thing that survives across versions. Add new tags freely; never recycle an old one.

Never reuse a field tag. Even after deleting a field, mark its tag reserved so no future schema change accidentally reuses it.
Adding a new field is safe. Old clients ignore the unknown tag; they get the field's default value if they ask for it.
Removing a field is safe if you mark its tag reserved. Old clients that send it still serialise; new servers ignore it.
Renaming a field is safe. Names aren't on the wire. (Renaming an enum value's name is also safe, but renumbering is not.)
Changing a field's type is mostly unsafe. The exceptions are listed on protobuf.dev; they're things like int32 ↔ int64, fixed32 ↔ sfixed32, etc., where the wire type doesn't change.
Changing required → optional or vice versa doesn't apply in proto3 — there is no required. (Proto2 had it. Don't.)

message Charge {
  reserved 6, 7, 9 to 11;
  reserved "old_field_name";

  string id          = 1;
  int64  amount_cents = 2;
  // ...
}

Default values and field presence

Proto3 made every field have a "default value" (zero, empty string, empty list) and removed the distinction between "field absent" and "field set to default". This works for dense data but loses information when "0" is meaningful and "unset" is too.

The 2020 field presence revival reintroduced optional in proto3. With it, the generated code distinguishes "not set" from "set to default":

message User {
  string name = 1;
  optional int32 age = 2;  // generated has `HasAge()` and `GetAge()`
}

Use optional for any field where "unset" is meaningfully different from "default". The cost is one extra byte per set field; the gain is being able to round-trip exactly what was serialised.

Generated code and canonical encoding

protoc generates per-language stubs from a .proto file — structs/classes with getters, setters, serialisers, and parsers. The Go generator is protoc-gen-go; modern toolchains use buf (buf.build) for both schema linting and codegen orchestration.

Important: the binary serialisation is not canonical. Two parsers that re-serialise the same message can produce byte-different outputs (field order, packed vs unpacked repeated, etc.). If you need a stable hash of a message — for signing, caching, or content-addressing — use a defined canonical form ( protobuf.dev — canonical serialisation).

proto2 vs proto3

Protobuf has two language versions, declared by the syntax line at the top of the file. They share the same wire format — bytes written by one are readable by the other — but the schema language and the generated code differ in how they treat presence and defaults.

Proto2 had explicit field labels: every field was marked required, optional, or repeated. The generated code always tracked whether a field had been set, so you could tell "absent" from "set to zero". The trap was required. A required field can never be safely removed, and if a message ever arrives without it the parse fails outright, which means one team's schema decision can break another team's ability to read old data forever. The lesson stuck: do not use required.

Proto3 cleaned this up by removing labels. Every scalar field is implicitly optional, there is no required, and unset fields simply read back as their zero value — empty string, zero, false, empty list. That made the common case simpler and the wire smaller, because a field at its default value isn't serialised at all. The cost was the loss of presence tracking: a field set to 0 and a field never set looked identical. As noted above, the 2020 revival of optional in proto3 brought presence back for the fields that need it, so today proto3 covers both behaviours. New schemas should be proto3 unless you are extending a proto2 codebase.

Protobuf and gRPC

Protobuf describes data; it does not, by itself, describe calls between services. That second job is what gRPC adds, and the two are designed to fit together. In a .proto file you can declare a service with rpc methods, each taking one message type and returning another. The same protoc run that generates your message classes also generates the client stub and server skeleton for those methods, so a call to a remote service looks like a local function call that takes a typed request and hands back a typed response.

service Billing {
  rpc CreateCharge(CreateChargeRequest) returns (Charge);
  rpc ListCharges(ListChargesRequest)  returns (stream Charge);
}

gRPC carries those Protobuf messages over HTTP/2, which gives it multiplexed streams, header compression, and first-class support for streaming in either or both directions — the stream keyword above turns a response into a server-side stream of messages. The division of labour is clean: Protobuf owns the message schema and the bytes, gRPC owns the method definitions and the transport. You can use Protobuf without gRPC — to store records, to put messages on a queue, to serialise a cache entry — but if you are building RPC between services, the pair is the standard combination. The gRPC page covers the transport, streaming modes, and error model in full.

The trade-offs, stated plainly

Everything Protobuf is good at follows from moving the schema out of the bytes, and so does everything it is bad at. On the upside: messages are small, parsing is fast, the types are checked at compile time in every language you generate for, and the evolution rules let independent teams change schemas without coordinated deploys. For internal service traffic at volume, those are the properties that matter.

On the downside: the bytes are not human-readable, so you cannot eyeball a payload in a log or a packet capture without decoding it against the schema. You cannot parse a message at all if you don't have the right .proto, which makes Protobuf a poor fit for public APIs where you don't control the client. The required code-generation step adds a build dependency and a toolchain — protoc, plugins, and increasingly buf — that JSON does not need. And the very flexibility of the wire format, where unknown fields are silently skipped, means a careless schema change can corrupt data quietly rather than failing loudly, which is exactly why the linting tools below exist. None of these is a reason to avoid Protobuf; they are the bill that comes with the speed.

Common mistakes

Reusing a deleted field's tag. Old clients on the wire send the old type for that tag, the new server parses garbage. Always reserved.
Picking int32 for monotonic IDs. Postgres bigserial / Snowflake IDs overflow int32 silently. Use int64 / uint64 / fixed64 for IDs.
Using int32 for negative values. Wastes 9 bytes vs sint32's 1.
Putting hot fields in tag > 15. Two-byte keys instead of one. A rare optimisation, but a real one for high-volume services.
Skipping buf lint / buf breaking. A linter catches every evolution rule violation before code review. Buf does this for free.

Protocol Buffers

The problem Protobuf solves

A schema lives in a .proto file

The wire format: tag, type, value

Varints — variable-length integers

ZigZag — what sint32 is for

Why it's smaller and faster than JSON

Schema evolution rules

Default values and field presence

Generated code and canonical encoding

proto2 vs proto3

Protobuf and gRPC

The trade-offs, stated plainly

Common mistakes

Further reading

05 — Apache Thrift