02 / 11
Protocols / 02

gRPC

gRPC takes Protobuf, runs it over HTTP/2, and generates typed client and server stubs in most major languages. It started as Google's internal "Stubby" and was open-sourced in 2015. If you have control of both sides of the wire and you want strict typing, fast serialisation, and built-in streaming, it's a sensible default. There are also workloads where it adds more friction than it removes — those are at the bottom of this page.


Three layers

  1. IDL. You write a .proto file describing services and messages. Same syntax as Protobuf — see the Protobuf deep dive.
  2. Codegen. protoc generates client and server stubs in Go, Java, Python, Rust, C++, Swift, Kotlin, Ruby, Node, Dart, and more.
  3. Runtime. The wire format is Protobuf-encoded binary frames carried over HTTP/2 streams. Each method call is one stream; messages are length-prefixed frames within that stream.
A complete service definition
// users.proto
syntax = "proto3";
package users.v1;

service Users {
  rpc Get(GetReq)            returns (User);            // unary
  rpc List(ListReq)          returns (stream User);     // server streaming
  rpc BulkCreate(stream User) returns (BulkResp);       // client streaming
  rpc Chat(stream Msg)       returns (stream Msg);      // bidirectional
}

message GetReq { string id = 1; }
message User   { string id = 1; string email = 2; }

Run protoc --go_out=. --go-grpc_out=. users.proto and you have typed Go client and server stubs. Run it with --python_out and you have Python stubs. Same .proto, every language.

Why HTTP/2

HTTP/1.1 is one request per connection (or per slot in a connection pool). HTTP/2 multiplexes many concurrent streams over a single TCP connection, with header compression (HPACK) and binary framing. gRPC needs all three:

  • Multiplexing lets a long-lived connection carry hundreds of in-flight calls without head-of-line blocking at the application layer.
  • Streams let you send a sequence of messages without re-establishing the connection. The four streaming modes are direct projections of HTTP/2 streams.
  • HPACK keeps per-call header overhead near zero. Headers like :method POST and content-type application/grpc compress to single-byte references after the first call.

HTTP/2 is also the reason gRPC doesn't run on browsers — browsers don't expose raw HTTP/2 streams to JavaScript. gRPC-Web is the workaround: a proxy translates between gRPC-over-HTTP/2 and a special framing over HTTP/1.1 that fetch() can read. It strips client-streaming and bidi modes; only unary and server-streaming survive.

The four streaming modes

ModeClientServerUse for
Unary1 msg1 msgMost calls — request/response.
Server stream1 msgN msgsPagination, search results, log tailing, AI token streaming.
Client streamN msgs1 msgBulk uploads, telemetry batches, voice → transcript.
Bidi streamN msgsN msgsChat, real-time games, collaborative editing, transcription.

Both ends know the mode at codegen time, so the generated stubs expose distinct types. A unary stub returns (User, error); a server-stream stub returns a UserStream that you iterate. Bidi stubs let you send and receive on the same handle concurrently.

Status codes — gRPC's own, not HTTP's

gRPC has its own 17-value status code table. They cover the things RPC actually cares about, in language richer than HTTP:

CodeMeaning
OK (0)Success.
CANCELLED (1)Operation was cancelled, typically by the caller.
INVALID_ARGUMENT (3)Bad input, regardless of system state.
DEADLINE_EXCEEDED (4)Caller's deadline elapsed.
NOT_FOUND (5)Resource doesn't exist.
ALREADY_EXISTS (6)Resource already exists.
PERMISSION_DENIED (7)Authenticated, not authorized.
RESOURCE_EXHAUSTED (8)Quota / rate limit hit.
FAILED_PRECONDITION (9)System state forbids this operation now.
ABORTED (10)Concurrency conflict — retry the whole transaction.
UNIMPLEMENTED (12)Method not implemented (yet).
INTERNAL (13)Server bug or invariant broken.
UNAVAILABLE (14)Service temporarily unreachable. Retryable.
DATA_LOSS (15)Unrecoverable data corruption.
UNAUTHENTICATED (16)No / invalid credentials.

Map them to HTTP only at the gateway, never inside. The standard mapping (NOT_FOUND404, PERMISSION_DENIED403) is a one-way translation; gRPC clients should consume the gRPC code directly.

Deadlines, not timeouts

Every gRPC call carries an absolute deadline, not a per-hop timeout. If service A calls B with a 100ms deadline and B calls C, the deadline propagates — C only gets whatever budget is left. This is the single most under-appreciated feature of gRPC, and the reason it works at planet scale.

# pseudocode — what happens on every gRPC call
ctx, cancel := context.WithTimeout(parent, 100ms)
defer cancel()

# the deadline (now + 100ms) goes into the gRPC metadata
# the next service reads it, computes remaining budget,
# and applies the same context to its outbound calls

resp, err := client.Get(ctx, &GetReq{Id: "42"})

# if 60ms elapsed before B calls C, C sees a 40ms deadline
# C cannot extend it. The deadline shrinks monotonically.

Without this, a slow downstream call holds the whole upstream chain. With it, every service stops working as soon as the user-facing call has given up. Tail-latency meltdowns become recoverable instead of fatal.

Interceptors — middleware for RPC

Cross-cutting concerns — auth, tracing, retries, metrics, panic recovery — go in interceptors, the gRPC equivalent of HTTP middleware. They wrap the call before/after the user code runs.

func authInterceptor(
    ctx context.Context, req any,
    info *grpc.UnaryServerInfo, handler grpc.UnaryHandler,
) (any, error) {
  md, _ := metadata.FromIncomingContext(ctx)
  if !validate(md["authorization"]) {
    return nil, status.Error(codes.Unauthenticated, "bad token")
  }
  return handler(ctx, req)
}

server := grpc.NewServer(
  grpc.UnaryInterceptor(authInterceptor),
  grpc.StreamInterceptor(streamAuthInterceptor),
)

Production stacks chain four to six interceptors: panic recovery on the outside, then tracing, then metrics, then auth, then optional rate-limiting and logging. Each one is ~30 lines of code, runs on every call, and stays out of business logic.

Where gRPC fits well

  • Internal microservices. Strict schemas, ~5–10× the throughput of JSON+REST, typed clients in every language. A single .proto file replaces ten hand-written client libraries.
  • Polyglot teams. The Go service, the Python data-science service, and the iOS client all consume the same generated stubs. No translation layer.
  • Streaming workloads. Real-time push, log/metric pipelines, IoT telemetry, mobile sync. The four streaming modes cover the design space.
  • Service meshes. gRPC plays well with Envoy, Linkerd, Istio. Built- in load balancing, retry budgets, and circuit-breaker integration.

Where it adds friction

  • Browsers. No raw HTTP/2 access — you need gRPC-Web, which strips streaming and adds a proxy. Connect Web (Buf) is the modern alternative.
  • Public APIs. Customers can't curl it; tooling is weaker than REST; debugging requires the .proto file. Most public APIs that "support gRPC" actually ship a REST gateway and call gRPC from inside.
  • Debugging. Wireshark dumps look like binary noise without the .proto file. Reach for grpcurl with Reflection enabled on dev/staging.
  • Load balancing. Long-lived HTTP/2 connections trip up L4 balancers — once a connection sticks to a backend, all its streams stick too. Either use L7 LBs (Envoy, Linkerd) or do client-side round-robin.
  • Schema breakage. Easier to silently break clients than with REST, because field-number reuse mis-decodes silently. Use Buf's breaking-change linter as a CI gate.

gRPC vs REST in practice

A reasonable default: REST/JSON for public-facing edges, gRPC for everything server-to-server. Many shops front their gRPC services with a thin REST gateway (grpc-gateway generates one from the same .proto). You get binary efficiency inside and a familiar HTTP+JSON face outside. The .proto file is the source of truth either way.

The mistake is treating it as either/or. Most non-trivial systems run both: gRPC for internal calls between services that share a binary fence, REST/JSON for the public edge where curl-ability and tooling matter more than per-call CPU.

Load balancing — the hidden problem

HTTP/2 multiplexes many requests over one TCP connection. That's a feature for latency and a problem for load balancing. An L4 (TCP-level) load balancer like AWS NLB sees one connection from each client and routes it to one backend; once pinned, every subsequent request rides that same connection, regardless of how loaded the chosen backend is. The result: skewed load, "noisy backend" tail latency, and that one pod you mysteriously can't drain.

Three strategies for fixing it, in increasing sophistication:

StrategyHow it worksTrade-off
L7 / proxy-side balancingRun an HTTP/2-aware proxy (Envoy, Linkerd, NGINX with grpc_pass) in front of the backends. The proxy sees individual RPCs and routes each one.Extra hop adds 0.5-1 ms; the proxy is a new failure domain.
Client-side balancing (lookaside)The client gets the list of backends from a name resolver (DNS, xDS, etcd) and opens a connection to each. Each RPC picks a backend via round-robin or pick-first.Requires SDK support; client + server fan-out grows N×M; reconnect storms on backend churn.
Look-aside with xDSgRPC clients talk to a control plane (xDS) that pushes endpoint + weight updates. The dataplane stays direct (no proxy in the request path) but routing decisions are centralised.Most operational complexity; near-zero data-path latency overhead; the model behind Istio + Consul Connect.
What to use when. Internal traffic inside a service mesh — xDS with a sidecar proxy (or proxyless gRPC) is the default. Two services that just need to talk — client-side balancing, simple round-robin over a small set of pods. External traffic — terminate gRPC at the edge L7 proxy and load balance from there. The mistake is using L4 LB inside the cluster; everything else is a reasonable choice.

Versioning Protobuf — what you can change, and when

Protobuf wire compatibility is famously good, but only if you follow specific rules. The wire format encodes only field tag numbers, not names; the field name and type at compile time are interpretation, not transport. So you can rename anything; you cannot reuse tag numbers without corrupting historic readers.

ChangeSafe?Notes
Add a new optional field with a new tag numberYesOld code skips it; new code sees the default.
Rename a field (same tag, same type)YesWire format ignores names. Source code on both sides re-builds.
Change optionalrepeated on the same tagNoWire types differ; old readers will see garbage.
Change int32uint32int64Yes (with caveats)Same varint wire type. Sign extension on negative ints can surprise.
Change int32fixed32NoDifferent wire types (varint vs 32-bit fixed).
Remove a fieldMaybeOld code keeps writing it; new code ignores it. Reserve the tag number so it's never reused.
Add a new enum valueYes for proto3Old readers see the numeric value if their generated enum doesn't recognise it.
Reorder fields in the .proto fileYesWire format is keyed on tag number, not file order.

The two patterns that keep services compatible across years: always reserved 7, 8; when removing fields (the protoc compiler enforces the reservation across the whole codebase), and never reuse a tag number even for a "different but semantically equivalent" field — encode it as a new tag.

message User {
  string id = 1;
  string email = 2;
  // reserved field 3 — was "phone_number" before we moved phones to a sub-message
  reserved 3;
  reserved "phone_number";
  PhoneNumber phone = 4;
}

Compression, message size, and the network bill

gRPC compresses per message, not per connection. By default the implementations ship with gzip, with an opt-in for deflate and the much newer zstd (gRPC-Go 1.55+, Java 1.62+). For binary protobuf the gains are smaller than text — typical 20-40% compression vs 70%+ on JSON — but for sustained streaming or for messages with large repeated string fields, it still earns its keep.

Two production gotchas worth knowing:

  • The 4 MB default max_receive_message_size. Most language implementations cap incoming messages at 4 MB. That's fine for typical RPC, but if you send a large batched response or a binary blob inline, the receiver will reject with RESOURCE_EXHAUSTED. Either bump the cap (grpc.MaxRecvMsgSize) or stream the payload.
  • HTTP/2 frame size limits. Each HTTP/2 frame is at most 16 KB by default (SETTINGS_MAX_FRAME_SIZE = 16,384), configurable up to 16 MB. Large messages get split into multiple frames within the same stream. This is transparent to the application but affects how head-of-line blocking can manifest within a single RPC.
  • Compression and observability. Tracing infrastructure (Jaeger, Honeycomb) often samples request/response bytes. With compression on, sampled bytes are still in the wire format — meaning you can't grep response bodies for debugging. Most teams disable compression on traced calls or sample at the proto-message level.

An RPC on the wire — HTTP/2 frames, in order

A single unary gRPC call decomposes into a small handful of HTTP/2 frames. Once you've seen the sequence, every "why is my gRPC call doing X" question becomes easier to reason about, because you can map the symptom back to a specific frame at a specific point in the exchange.

Client                                                            Server
  │                                                                  │
  │ ── HEADERS (stream 1, END_STREAM=0) ────────────────────────────▶│
  │      :method = POST                                              │
  │      :scheme = https                                             │
  │      :path   = /catalog.v1.CatalogService/GetItem                │
  │      :authority = catalog.svc:443                                │
  │      content-type = application/grpc+proto                       │
  │      te = trailers                                               │
  │      grpc-timeout = 3500m                                        │
  │      grpc-encoding = gzip                                        │
  │      authorization = Bearer eyJhbGciOi...                        │
  │                                                                  │
  │ ── DATA (stream 1, END_STREAM=1) ───────────────────────────────▶│
  │      [1-byte flags | 4-byte length] [protobuf-serialised request]│
  │                                                                  │
  │                                                                  │ ◀── HEADERS
  │                                                                  │       :status = 200
  │                                                                  │       content-type = application/grpc+proto
  │                                                                  │
  │                                                                  │ ◀── DATA
  │                                                                  │       [protobuf-serialised response]
  │                                                                  │
  │                                                                  │ ◀── HEADERS (TRAILERS, END_STREAM=1)
  │                                                                  │       grpc-status = 0
  │                                                                  │       grpc-message = (omitted on success)

Five details worth internalising. First, the message envelope inside each DATA frame is compressed-flag (1 byte) + length (4 bytes) + payload. The length lets a server frame the next message even though TCP-or-HTTP/2 buffering may have split the payload across multiple frames. Second, grpc-timeout is the deadline carried on every call — propagated to downstream RPCs by the interceptor stack. Third, te: trailers is required by spec; some HTTP middleboxes strip it and break gRPC silently. Fourth, the gRPC status code rides on the trailers, not the response HEADERS — which is why a 200 HTTP response can still represent a gRPC error, and why intercepting HTTP/2 at a non-grpc-aware proxy can drop the trailers and corrupt the result. Fifth, HTTP/2 stream identifiers are per-connection and bumped odd-numbered per new client request; this is why grpc-go pre-allocates a slab of stream IDs at connection setup.

Why the trailer matters operationally. If a client receives a 200 HEADERS frame but the connection drops before the trailer arrives — for example because of a load-balancer connection timeout — the call ends in UNAVAILABLE with no body, not in success. This is a common source of "the server logged success but the client saw failure" mysteries. The fix is usually a longer LB idle timeout or moving to a gRPC-aware proxy that respects the trailer.

Authentication and credentials

gRPC separates channel credentials (the connection-level trust model, almost always TLS) from call credentials (the per-RPC authentication tokens, often a JWT or service-account assertion). The two layer: the channel sets up the secure transport; call credentials ride as metadata on each RPC. This is more granular than HTTP/REST defaults, which usually conflate the two into a single Authorization header on every request.

CredentialLayerWhat it providesWhere it's used
TLS (server auth)ChannelServer identity, encrypted transport. Client verifies cert chain.Default for all public traffic.
mTLS (mutual TLS)ChannelBoth server AND client present certs. Identity flows in both directions.Service mesh defaults — Istio, Linkerd, Consul Connect.
ALTS (Application Layer Transport Security)ChannelGoogle's internal binding-protocol — identity via service account, no cert provisioning.Google production. Available open-source but rare elsewhere.
JWT bearer tokenCallPer-RPC identity claim verified at the server.User-facing APIs, machine-to-machine where mTLS would be heavy.
OAuth 2.0 access tokenCallSame shape as JWT but issued by a trusted IdP, often short-lived.Public APIs, federated identity.
Google Compute Engine credsCallThe instance's metadata-server token, auto-refreshed by the SDK.GCP services calling other GCP services.
AWS SigV4CallPer-request HMAC signature using the IAM role.AWS-hosted gRPC services (less common — AWS leans REST).

The recommended production posture is mTLS at the channel layer for service identity, plus a per-call JWT carrying the end-user identity. The mTLS cert says "this is the orders service" and the JWT says "running on behalf of user 42". A server-side interceptor extracts both, populates the request context, and the handler treats them as separate concerns. This split lets a misconfigured user token fail per-call without taking down the connection, and lets a rotating service identity be revoked without invalidating every in-flight user session.

// Go — composing channel + call credentials
creds := credentials.NewTLS(&tls.Config{Certificates: []tls.Certificate{cert}})
perRPC := oauth.NewOauthAccess(&oauth2.Token{AccessToken: token})

conn, err := grpc.Dial(addr,
    grpc.WithTransportCredentials(creds),
    grpc.WithPerRPCCredentials(perRPC),
)

Retries and hedging — the service-config knob

gRPC has a built-in retry mechanism configured via the service config — a JSON document the client receives at name-resolution time (or hand-rolled into the client). Two strategies are first-class: retry (try the next backend on a failed call) and hedging (fire the call to multiple backends in parallel, return the first success). Most teams reach for retry naturally; hedging is the under-used one that wins tail latency for read-heavy workloads.

{
  "methodConfig": [{
    "name": [{ "service": "catalog.v1.CatalogService", "method": "GetItem" }],
    "retryPolicy": {
      "maxAttempts": 4,
      "initialBackoff": "0.1s",
      "maxBackoff": "1s",
      "backoffMultiplier": 2,
      "retryableStatusCodes": ["UNAVAILABLE", "DEADLINE_EXCEEDED"]
    }
  }, {
    "name": [{ "service": "catalog.v1.CatalogService", "method": "Search" }],
    "hedgingPolicy": {
      "maxAttempts": 3,
      "hedgingDelay": "0.05s",
      "nonFatalStatusCodes": ["UNAVAILABLE"]
    }
  }]
}

The retry budget is enforced by a token-bucket on the channel to prevent retry storms; once exhausted, calls fail through without retrying even if their status is in the retryable list. Default budget is 10 tokens replenished at 0.1 per RPC — generous enough for healthy systems, tight enough that a flapping backend can't snowball.

Two rules to follow: only retry calls that the service config has marked idempotent, and never retry calls whose status code might mean the server did the work but the network swallowed the response (anything other than UNAVAILABLE and RESOURCE_EXHAUSTED with the right details should be considered risky). The gRPC client library enforces this by refusing to retry a call once any byte of the request has been written to the wire — unless you opt in with the commitable_retry flag.

When hedging shines. A search endpoint with p99 latency at 200 ms and p50 at 30 ms. Without hedging, every user feels the p99 occasionally. With hedgingDelay: 50ms and maxAttempts: 2, the slow tail gets a second chance — the first call that returns wins. p99 typically drops by 5-10x. The cost is a few percent extra QPS to the backend pool. This is the canonical "tail-tolerant" pattern from Google's The Tail at Scale paper.

Connection lifecycle — keepalive and idle timeout

HTTP/2 connections are long-lived. That's by design — connection setup (TCP + TLS) is expensive, and reusing one connection for thousands of RPCs is the whole point. But a long-lived connection that nobody knows is dead — a NAT that silently dropped state, a load balancer that severed it without sending RST — is a slow source of "the first call after lunch always times out" production pain.

gRPC exposes three knobs for managing this:

SettingDefaultWhat it does
KEEPALIVE_TIME_MSinfinite (off)Send a PING frame every N ms when the connection is idle (no streams). If no ACK in KEEPALIVE_TIMEOUT_MS, tear it down.
KEEPALIVE_TIMEOUT_MS20sHow long to wait for the PING ACK before assuming the peer is gone.
KEEPALIVE_PERMIT_WITHOUT_CALLSfalseAllow pings even when no RPCs are active. Required for long-idle clients.
MAX_CONNECTION_AGE_MSinfinite (off)Server-side. Gracefully drain and close the connection after this age — forces clients to re-resolve DNS and rebalance.
MAX_CONNECTION_IDLE_MSinfinite (off)Server-side. Close a connection that's been idle this long. Frees resources on the server.
MAX_CONCURRENT_STREAMS100HTTP/2 setting. Server tells client how many concurrent streams it accepts. Saturating this leads to client-side queuing.

A sensible production posture: clients set KEEPALIVE_TIME_MS to 30-60 seconds and KEEPALIVE_PERMIT_WITHOUT_CALLS true. Servers set MAX_CONNECTION_AGE_MS to ~10 minutes (with a 30s grace) so autoscaling churn actually moves traffic; idle timeouts to something generous like 1 hour. Without MAX_CONNECTION_AGE_MS, scaling-out adds capacity but existing clients pin to the original pods forever — a common source of "I added pods but load didn't budge" surprise.

The "too many pings" trap. If a client sets keepalive to something aggressive like 5 seconds without telling the server, the server will eventually reply with GOAWAY: too many pings and tear down the connection. The server's PING_STRIKES threshold (default 2) has to be coordinated with the client's keepalive interval. Either set both explicitly or accept the defaults on both ends.

Observability — tracing, channelz, stats handlers

Three layers of visibility matter, and gRPC supports all of them as first-class extension points rather than bolt-ons.

  • Distributed tracing via OpenTelemetry. The otelgrpc interceptor (every major language has an equivalent) extracts the W3C traceparent header from incoming metadata, starts a child span, and injects the propagated context into outgoing calls. The result is one trace per cross-service RPC, with the gRPC method as the span name. This is the same shape as HTTP-tracing instrumentation, just plugged in via gRPC's interceptor seam.
  • Channelz. A built-in introspection service every gRPC server can expose. It enumerates live channels, sub-channels, sockets, and per-stream statistics — what you'd otherwise have to scrape from kernel netstat plus application logs. Useful for debugging which backends a load- balanced channel is actually using and how many streams are in flight per connection.
  • Stats handlers. A lower-level hook than interceptors. You see every begin/end of every RPC, every byte read and written, every header and trailer. Most service-mesh dataplanes plumb these into Prometheus counters via the language SDK; it's also the seam Stripe and Slack use to produce per-method latency histograms by status code.

Two metrics every gRPC service should expose: RPCs per second by status code (so you can alert on a sudden surge of UNAVAILABLE or DEADLINE_EXCEEDED) and p50/p99/p99.9 latency by method (so you can localize a regression to a single endpoint). Both fall out of the stats-handler hook with maybe 20 lines of code per language.

The two metrics that catch outages. Sustained >1% of calls returning RESOURCE_EXHAUSTED means the server is rate-limiting or the queue is full — page someone. Sustained >1% returning UNAVAILABLE means the client can't reach the server at all — page someone. Both are usually invisible at the HTTP layer because gRPC errors don't surface as HTTP non-2xx; you have to instrument the gRPC stats layer specifically.

Code generation and the tooling stack

Protobuf and gRPC use a code-generation model: the .proto files describe the service contract, a code generator emits language-specific stubs and message classes, and the application code consumes those generated types. The generation step is the part most teams initially under-invest in and then spend disproportionate time fixing.

ToolRoleNotes
protocThe reference code generatorWritten in C++, ships separate plugins per language. The OG tool. Almost never used directly anymore — too low-level.
bufHigher-level workflow toolWraps protoc with lint rules, breaking-change detection, BSR (Buf Schema Registry) for cross-team .proto distribution. The modern default.
protoc-gen-go / protoc-gen-go-grpcGo code generatorsThe split between message generation and service generation is recent (~2020). Older codebases ship one combined plugin.
protoc-gen-grpc-javaJava generatorProduces enormous classes. Often paired with protoc-gen-grpc-kotlin for Kotlin idioms.
Tonic / ProstRust generator + runtimeThe dominant Rust gRPC stack. Generates async-await idiomatic code; integrates with Tokio.
betterprotoModern Python generatorProduces dataclasses, type hints, async-native. Replaces the old googleapis-common-protos style.
grpc-gatewayREST gateway from .protoReads HTTP annotations in the .proto, generates a reverse-proxy that turns REST calls into gRPC. The "single source of truth" pattern.
protoc-gen-validate (PGV)Field-level validationReads validate.rules annotations, generates language-specific validators. Saves writing validation boilerplate.
connect-go / connect-webAlternative gRPC + REST frameworkBuf's stack. Same .proto, single binary speaks gRPC, gRPC-Web, and JSON over HTTP/1.1. Excellent for browser clients.

A reasonable starting point in 2026: use buf for builds and lint, buf breaking in CI to catch wire-incompat changes before they merge, and language-specific generators behind it. If a public REST surface is required, layer either grpc-gateway or Connect on top. Resist the temptation to hand-write the gRPC client and server — the generator output is mechanical and the value is in keeping the .proto as the only thing humans edit.

Wire-incompatibility in CI. The single most important hygiene move on a Protobuf codebase: run buf breaking --against against the previous release in CI. It catches removed fields, changed types, reused tag numbers — all the things that look fine in code review but break clients already in the field. Most outages in the gRPC ecosystem start with a wire- incompat change that the team didn't realise was wire-incompat.

The Protobuf wire format, in 200 words

Protobuf encodes each field as (tag, wire-type, payload). The tag-and-wire-type combine into a single varint key; the wire type tells the parser how to read the payload that follows. Six wire types cover everything.

Wire typeIDUsed for
VARINT0int32, int64, uint32, uint64, bool, enum — 1-10 bytes, MSB continuation bit
I641fixed64, sfixed64, double — exactly 8 bytes little-endian
LEN2string, bytes, embedded messages, packed repeated — varint length followed by payload
SGROUP / EGROUP3, 4Deprecated proto2 groups. Don't use.
I325fixed32, sfixed32, float — exactly 4 bytes little-endian

Two implications worth knowing. First, signed ints (int32, int64) encode negative numbers as 10-byte varints — twos-complement is sign-extended to 64 bits first. If your field is mostly negative, use sint32 (which zigzags before varint-encoding) to keep the wire size small. Second, the VARINT continuation bit means parsing has to walk the payload byte by byte; on hot paths, fixed-width fixed64 can be faster despite using more bytes, because the decoder can do a single 8-byte load.

Backpressure and flow control in streams

Streaming RPCs introduce a question unary calls never ask: what happens when the producer is faster than the consumer? gRPC inherits HTTP/2's per-stream flow control window — every receiver advertises a window size that the sender may not exceed without an explicit WINDOW_UPDATE frame. That gives you wire-level backpressure for free. The trap is that most language SDKs expose a higher-level Send/Recv API that hides the wire, and so an application can happily call stream.Send(...) a million times in a tight loop and discover only later that the data is buffered in the SDK rather than actually flowing.

LayerMechanismWhat you have to do
HTTP/2 flow controlPer-stream WINDOW_UPDATE frames; default window 64 KBNothing — handled by the SDK. Bump INITIAL_WINDOW_SIZE on high-throughput streams to amortise window-update cost.
SDK send bufferPer-language buffer between the application Send call and the actual writeCheck whether your SDK blocks on backpressure or buffers unboundedly. grpc-go blocks; some others queue.
Application-level pacingProducer paces itself based on observed throughputFor unbounded streams (CDC, telemetry), add an explicit acknowledgement message every N records and pause when behind.
Receive bufferThe consumer's stream.Recv() rate sets the WINDOW_UPDATE cadenceIf the consumer is slow, the window stalls, the sender blocks. This is the wanted behaviour — don't try to "fix" it by buffering on the server.

For server-streaming RPCs (one-to-many), the typical anti-pattern is a server that produces 100,000 records into a slow client and OOMs on the buffer. The fix is to write the producer as a loop that blocks on each Send — grpc-go's SendMsg blocks until the stream has window — and stop producing eagerly. For bidirectional streams, application-level acknowledgement messages (a "credits" pattern, à la TCP windows) give you explicit control: the consumer grants N credits, the producer sends N messages, waits for more credits, repeats. This is how Apache Pulsar and CockroachDB's KV stream work.

The 64 KB default that bites everyone once. HTTP/2's default initial window is 64 KB per stream. On a high-latency long-fat-pipe link, that means a single stream is capped at 64 KB / RTT throughput — roughly 6 MB/s over a 10 ms RTT. Bump InitialWindowSize (and InitialConnWindowSize) to something like 4 MB if you're streaming GBs over a WAN. It's two lines of SDK config and turns 6 MB/s into the actual line rate.

Local development and testing

gRPC pays back its setup cost most when integrated into a fast feedback loop — but it has a longer ramp than REST because there is no curl-equivalent shipped with the OS. Three tools fill the gap.

ToolWhat it doesWhen to reach for it
grpcurlCLI client that uses reflection or a .proto file to make gRPC calls. The curl-equivalent.Quick "does this RPC return what I expect" checks. Smoke tests in CI.
evansInteractive REPL for gRPC services. Tab completion across methods and fields.Exploring an unfamiliar service. Onboarding to a new codebase.
Postman / InsomniaGUI clients with gRPC support since ~2022.Cross-team API workshops; sharing collections with non-engineers.
buf curlBuf's grpcurl-like with HTTP/JSON + gRPC + Connect support.If your stack already uses Buf for builds.
gRPC reflectionServer-side feature that exposes the .proto schema over a special RPC.Enable in dev/staging; disable or auth-gate in production (it's an attack-surface signal).

For tests, prefer the in-process channel over a real TCP loopback. Every SDK supports something like grpc-go's bufconn — a fake net.Listener backed by a buffer — so unit tests can spin up a real server, dial it, and exercise interceptors without binding a port. This makes tests parallelisable, deterministic, and fast (sub-millisecond per call).

// Go — in-process gRPC server for tests
lis := bufconn.Listen(1024 * 1024)
srv := grpc.NewServer()
catalogpb.RegisterCatalogServiceServer(srv, &fakeCatalog{})
go srv.Serve(lis)

conn, _ := grpc.DialContext(ctx, "bufnet",
    grpc.WithContextDialer(func(context.Context, string) (net.Conn, error) {
        return lis.Dial()
    }),
    grpc.WithTransportCredentials(insecure.NewCredentials()),
)
client := catalogpb.NewCatalogServiceClient(conn)

Two operational rules pay back disproportionately. First, every new service should ship with a grpcurl-callable health-check method (grpc.health.v1.Health/Check) — the load balancer and the deployment pipeline both want this. Second, write at least one end-to-end integration test that exercises the full client SDK, not just an in-process handler — there have been many production incidents caused by an interceptor that only fires on the real network path (auth, tracing, retries) misbehaving in ways the in-process test never saw.

When NOT to use gRPC

A short, opinionated list. gRPC is excellent for its core use case and routinely misapplied outside it.

  • Public-facing APIs consumed by third parties. Discoverability is bad, browser support is via grpc-web (which loses bidirectional streaming and requires a proxy), and curl-ability is essentially nil. Use REST/JSON or GraphQL at the edge.
  • Browser-to-server with significant client state. grpc-web works but the SDK is awkward and the streaming story is half-baked. Server-sent events or WebSockets are usually a better fit.
  • Mostly-static configuration / reference data. If 95% of your requests are GETs of slow-moving data, the HTTP cache machinery (Cache-Control, ETag, CDN) is too valuable to give up. REST + a real CDN is faster end-to-end than gRPC bypassing the cache.
  • When the team can't operate it. Debugging requires Envoy-aware tooling, observability needs proto-aware sampling, and language SDKs vary in maturity. If the org isn't ready for that surface area, JSON-over-HTTP is the lower total cost of ownership.
  • When the message domain is text-heavy. Protobuf isn't a great fit for messages dominated by free-form prose (chat content, search documents). The wire savings are smaller and you lose the JSON ergonomics that text-heavy clients want.

Further reading

Found this useful?