Protocol Buffers
Protobuf is Google's interface description language and binary wire format. You write a
schema in a .proto file, run a code generator, and get typed message classes in
whatever languages you need. The bytes that go on the wire are compact and tagged, and the
schema has a small set of rules for how it can evolve over time without breaking the clients
you've already shipped. This page walks the whole path: the schema language, how a message
turns into bytes, why those bytes are smaller and faster to parse than JSON, the rules that
keep two versions of a schema compatible, and where Protobuf fits next to gRPC.
The problem Protobuf solves
Two services need to exchange a record — a charge, a user, an order. They are written in different languages, owned by different teams, and deployed on their own schedules. Each side has to agree on what fields exist, what type each field is, and how the whole thing is laid out as bytes so the other side can read it back. Get that agreement wrong and you get the worst class of bug: data that parses cleanly but means something different than the sender intended.
Text formats like JSON answer this by being self-describing. Every value carries its field name and its structure right there in the payload, so a parser can read a JSON document with no prior knowledge of its shape. That is convenient and it is why JSON won the public web. It is also the source of its cost: every message repeats every field name as text, numbers are written as decimal digits, and the parser has to scan characters and guess types as it goes. There is no contract that the sender and receiver share ahead of time, so nothing stops a field from quietly changing meaning.
Protobuf takes the opposite bet. The contract — the schema — lives outside the bytes, in a
.proto file that both sides compile against. Because both ends already know the
shape, the bytes on the wire can drop the field names entirely and carry only small numeric
tags and tightly packed values. That single decision is what makes Protobuf compact, fast to
parse, and strongly typed, and it is also the source of its main drawback: you cannot read
the bytes without the schema. The rest of this page is the consequences of that trade.
A schema lives in a .proto file
syntax = "proto3";
package billing.v1;
message Charge {
string id = 1;
int64 amount_cents = 2;
string currency = 3;
Status status = 4;
repeated string tags = 5;
}
enum Status {
STATUS_UNSPECIFIED = 0;
STATUS_PENDING = 1;
STATUS_SETTLED = 2;
STATUS_FAILED = 3;
}Three things to notice. First: the numbers (1, 2, 3...) are field tags,
not array indexes. They're what gets serialised on the wire — never the field names.
Second: every enum starts with a zero-valued UNSPECIFIED; that's the proto3
default and it's load-bearing for the evolution rules below. Third: the package is
versioned (billing.v1); that's how you ship breaking changes — by introducing
billing.v2 alongside.
The body of a message is a list of fields, and each field has three parts: a type, a name,
and a tag. The type comes from a fixed menu. There are scalar types for numbers, booleans,
strings, and raw bytes; there are message types, which let one message hold another; and
there are enums for closed sets of named values. The repeated keyword turns any
field into a list of that type, and map<key, value> gives you an
associative array. A message can nest other messages to any depth, so a whole API request
with addresses, line items, and metadata is one tree of messages rooted at the top.
The scalar types are worth knowing because the choice has real cost consequences on the wire, which the next sections make concrete:
| Proto type | Range / use | Wire cost |
|---|---|---|
int32 / int64 | signed integers, expected non-negative or mixed | varint; cheap when small, 10 bytes when negative |
uint32 / uint64 | unsigned integers | varint; cheap when small |
sint32 / sint64 | signed integers that are often negative | ZigZag varint; cheap for small magnitudes either sign |
fixed64 / fixed32 | large values, hashes, IDs near the type's ceiling | always 8 or 4 bytes, no size scan |
double / float | floating point | 8 or 4 bytes (IEEE 754) |
bool | true / false | 1-byte varint |
string / bytes | UTF-8 text / arbitrary octets | length-delimited: a length plus the data |
The pattern to take from that table: pick the type that matches how your data actually looks,
not the one that looks safest. A column of mostly-small positive integers wants
int32 or uint32. A column of small numbers that swing negative wants
sint64. A 64-bit random ID that is almost always near the top of the range wants
fixed64, because a varint would spend ten bytes on it while a fixed field spends
eight every time. These choices are invisible in the generated code — your application sees a
plain integer either way — but they show up directly in payload size at scale.
The wire format: tag, type, value
Every field on the wire is encoded as a key followed by a value. The key packs the field tag and a 3-bit wire type into a single varint:
key = (field_tag << 3) | wire_typeThere are five wire types you'll meet:
| Wire type | Number | Used for |
|---|---|---|
| VARINT | 0 | int32, int64, uint32, uint64, sint32, sint64, bool, enum |
| I64 | 1 | fixed64, sfixed64, double |
| LEN | 2 | string, bytes, embedded messages, packed repeated |
| SGROUP / EGROUP | 3 / 4 | deprecated; ignore |
| I32 | 5 | fixed32, sfixed32, float |
The length-delimited wire type (2) is the one to remember: a varint length followed by that many bytes. Strings, byte slices, sub-messages, and packed repeated fields all share this shape. It is also what makes a parser able to skip fields it does not understand: when it reads a key whose wire type is LEN but whose tag is unknown, it reads the length and jumps that many bytes forward without needing to know what the contents meant. That skip-the-unknown ability is the quiet foundation of every evolution rule later on the page.
There is no separator between fields and no terminator at the end of a message. A message is simply its fields concatenated, key then value, key then value, until the buffer runs out or the enclosing length is reached. The fields do not even have to appear in tag order, and a decoder is required to accept them in any order. Put together, the format is: read a key varint, split it into a tag and a wire type, use the wire type to know how to read the value, store the value under that tag, repeat. Here is one small message laid out byte by byte.
{"amount_cents":150,"currency":"USD"}, is 38 bytes.Varints — variable-length integers
A varint encodes an unsigned integer using a variable number of bytes, seven bits of payload each, with the high bit signalling "continue".
300 in binary: 100101100 (9 bits)
chunked LSB-first: 0101100 0000010
prefixed: 10101100 00000010
^ ^
cont stopWalk the example above. The number 300 needs nine bits, more than fits in one byte once you reserve the high bit as a continuation flag. So the encoder slices the value into seven-bit groups, least-significant group first, and sets the high bit on every group except the last. A decoder reads bytes until it finds one whose high bit is clear, strips the flag bits, and reassembles the seven-bit payloads in reverse to recover the original integer.
Small numbers cost one byte (anything < 128). Numbers up to 16,383 cost two. Field tags 1–15 fit in a single-byte key, so the most-frequently-used fields cost one byte of overhead — that's the famous "make your hot fields tags 1–15" rule. The flip side is the worst case: because a 64-bit value can need ten seven-bit groups, a varint can be ten bytes long, two more than the eight a raw 64-bit value would take. That only happens for very large or, as the next section shows, negative numbers, which is exactly when you should reach for a fixed-width or ZigZag type instead.
ZigZag — what sint32 is for
Varints are unsigned. Encoding a signed -1 as a varint produces a 10-byte
worst case (because two's complement makes it look like 18 quintillion). The fix is
ZigZag encoding, which maps signed integers to unsigned ones in
alternating order:
| Signed | ZigZag | Bytes |
|---|---|---|
| 0 | 0 | 1 |
| -1 | 1 | 1 |
| 1 | 2 | 1 |
| -2 | 3 | 1 |
| 2147483647 | 4294967294 | 5 |
| -2147483648 | 4294967295 | 5 |
You opt in by declaring sint32 / sint64 instead of
int32. If you have a field that's mostly small negative numbers, this is
free 4–9× compression. If your numbers are always non-negative, prefer
uint32. The mapping is reversible and cheap — a couple of bit operations on each
end — so there is no parsing penalty for the saving; the only requirement is that both sides
agree the field is ZigZag-encoded, which is exactly what declaring it sint32 in
the shared schema does.
Why it's smaller and faster than JSON
Two effects compound. The first is size. JSON pays for its self-describing nature on every
single message: the field names are spelled out as text, every value is decimal digits or
quoted characters, and braces, colons, commas, and whitespace add structural overhead.
Protobuf drops all of it. A field name like amount_cents costs eleven bytes in
JSON and zero on the wire in Protobuf, where it is a one-byte tag. An integer like 150 costs
three text bytes in JSON and one varint byte in Protobuf. Booleans cost four or five text
bytes versus one. For a message of many small numeric fields the difference is routinely
three to ten times, and the gap widens as field names get longer and values get smaller.
The second effect is parsing speed, and it matters at least as much. A JSON parser scans the payload one character at a time: it has to find the quotes around a key, match that key string against the target type, find the value, decide whether the value is a number or a string or a nested object, and convert decimal text into a machine integer. A Protobuf parser reads a key varint, learns the tag and wire type from a few bit operations, and from the schema it already knows exactly what field and what machine type that tag maps to. Numbers arrive in a form close to their in-memory representation, so there is little or no text-to-number conversion. There is no string matching, no type guessing, and no character scanning of structure. That turns deserialisation from a parsing problem into something closer to a memory copy.
Both effects come from the same root cause: moving the schema out of the bytes and into a file both sides compiled against. JSON keeps the schema in the payload, which buys you the freedom to read a document you've never seen and the cost of carrying that description forever. Protobuf keeps the schema out of the payload, which costs you the freedom to read unknown bytes and buys you compactness and speed. The JSON vs Protobuf simulator lets you paste a record and watch the same data shrink and the byte layout change as you flip between the two.
Schema evolution rules
This is where Protobuf earns its keep. Adhere to these and you can change schemas indefinitely without breaking deployed clients.
Two words to keep straight. Backward compatibility means new code can read data written by old code. Forward compatibility means old code can read data written by new code. In a real deployment you never upgrade every service at the same instant, so for a window of time both directions are live at once: a new server is receiving messages from clients that haven't been updated, and old servers are receiving messages from a new client that already has been. Protobuf is designed to hold both directions as long as you obey the rules, and the whole reason it can is the skip-the-unknown behaviour from the wire format — an old parser that meets a tag it doesn't recognise reads past it and carries on.
- Never reuse a field tag. Even after deleting a field, mark its tag
reservedso no future schema change accidentally reuses it. - Adding a new field is safe. Old clients ignore the unknown tag; they get the field's default value if they ask for it.
- Removing a field is safe if you mark its tag
reserved. Old clients that send it still serialise; new servers ignore it. - Renaming a field is safe. Names aren't on the wire. (Renaming an enum value's name is also safe, but renumbering is not.)
- Changing a field's type is mostly unsafe. The exceptions are listed on protobuf.dev; they're things like int32 ↔ int64, fixed32 ↔ sfixed32, etc., where the wire type doesn't change.
- Changing required → optional or vice versa doesn't apply in proto3 —
there is no
required. (Proto2 had it. Don't.)
message Charge {
reserved 6, 7, 9 to 11;
reserved "old_field_name";
string id = 1;
int64 amount_cents = 2;
// ...
}Default values and field presence
Proto3 made every field have a "default value" (zero, empty string, empty list) and removed the distinction between "field absent" and "field set to default". This works for dense data but loses information when "0" is meaningful and "unset" is too.
The 2020 field presence revival reintroduced optional in
proto3. With it, the generated code distinguishes "not set" from "set to default":
message User {
string name = 1;
optional int32 age = 2; // generated has `HasAge()` and `GetAge()`
}optional for any field where "unset" is meaningfully
different from "default". The cost is one extra byte per set field; the gain
is being able to round-trip exactly what was serialised.Generated code and canonical encoding
protoc generates per-language stubs from a .proto file —
structs/classes with getters, setters, serialisers, and parsers. The Go generator is
protoc-gen-go; modern toolchains use buf
(buf.build) for both schema linting and codegen orchestration.
Important: the binary serialisation is not canonical. Two parsers that re-serialise the same message can produce byte-different outputs (field order, packed vs unpacked repeated, etc.). If you need a stable hash of a message — for signing, caching, or content-addressing — use a defined canonical form ( protobuf.dev — canonical serialisation).
proto2 vs proto3
Protobuf has two language versions, declared by the syntax line at the top of the
file. They share the same wire format — bytes written by one are readable by the other — but
the schema language and the generated code differ in how they treat presence and defaults.
Proto2 had explicit field labels: every field was marked required,
optional, or repeated. The generated code always tracked whether a
field had been set, so you could tell "absent" from "set to zero". The trap was
required. A required field can never be safely removed, and if a message ever
arrives without it the parse fails outright, which means one team's schema decision can break
another team's ability to read old data forever. The lesson stuck: do not use
required.
Proto3 cleaned this up by removing labels. Every scalar field is implicitly optional, there is
no required, and unset fields simply read back as their zero value — empty string,
zero, false, empty list. That made the common case simpler and the wire smaller, because a
field at its default value isn't serialised at all. The cost was the loss of presence
tracking: a field set to 0 and a field never set looked identical. As noted above, the 2020
revival of optional in proto3 brought presence back for the fields that need it,
so today proto3 covers both behaviours. New schemas should be proto3 unless you are extending a
proto2 codebase.
Protobuf and gRPC
Protobuf describes data; it does not, by itself, describe calls between services. That second
job is what gRPC adds, and the two
are designed to fit together. In a .proto file you can declare a
service with rpc methods, each taking one message type and returning
another. The same protoc run that generates your message classes also generates
the client stub and server skeleton for those methods, so a call to a remote service looks
like a local function call that takes a typed request and hands back a typed response.
service Billing {
rpc CreateCharge(CreateChargeRequest) returns (Charge);
rpc ListCharges(ListChargesRequest) returns (stream Charge);
}gRPC carries those Protobuf messages over HTTP/2, which gives it multiplexed streams,
header compression, and first-class support for streaming in either or both directions — the
stream keyword above turns a response into a server-side stream of messages. The
division of labour is clean: Protobuf owns the message schema and the bytes, gRPC owns the
method definitions and the transport. You can use Protobuf without gRPC — to store records,
to put messages on a queue, to serialise a cache entry — but if you are building RPC between
services, the pair is the standard combination. The gRPC page
covers the transport, streaming modes, and error model in full.
The trade-offs, stated plainly
Everything Protobuf is good at follows from moving the schema out of the bytes, and so does everything it is bad at. On the upside: messages are small, parsing is fast, the types are checked at compile time in every language you generate for, and the evolution rules let independent teams change schemas without coordinated deploys. For internal service traffic at volume, those are the properties that matter.
On the downside: the bytes are not human-readable, so you cannot eyeball a payload in a log or
a packet capture without decoding it against the schema. You cannot parse a message at all if
you don't have the right .proto, which makes Protobuf a poor fit for public APIs
where you don't control the client. The required code-generation step adds a build dependency
and a toolchain — protoc, plugins, and increasingly buf — that JSON
does not need. And the very flexibility of the wire format, where unknown fields are silently
skipped, means a careless schema change can corrupt data quietly rather than failing loudly,
which is exactly why the linting tools below exist. None of these is a reason to avoid
Protobuf; they are the bill that comes with the speed.
Common mistakes
- Reusing a deleted field's tag. Old clients on the wire send the old
type for that tag, the new server parses garbage. Always
reserved. - Picking int32 for monotonic IDs. Postgres bigserial / Snowflake IDs overflow int32 silently. Use int64 / uint64 / fixed64 for IDs.
- Using
int32for negative values. Wastes 9 bytes vssint32's 1. - Putting hot fields in tag > 15. Two-byte keys instead of one. A rare optimisation, but a real one for high-volume services.
- Skipping
buf lint/buf breaking. A linter catches every evolution rule violation before code review. Buf does this for free.
Further reading
- protobuf.dev — Encoding — the canonical reference for the wire format.
- protobuf.dev — proto3 language guide — field types, defaults, evolution rules.
- Buf docs — modern toolchain for proto: linting, breaking-change detection, codegen, schema registry.
- protobuf C++ source — the reference parser/serialiser.
- Google AIPs — Google's API Improvement Proposals, which dictate proto schema conventions across their public APIs.