03 / 11
Protocols / 03

GraphQL

GraphQL is a query language for your API. Instead of designing endpoints around the data the server wants to expose, you publish a typed graph of everything that can be read or written, and you let the client ask for the exact shape it needs in one request. Facebook built it for their mobile apps and open-sourced it in 2015. It is a real win for teams shipping many clients against the same backend, and it is too much machinery for a small, simple API. This page walks the whole thing: the schema, the resolver tree, the N+1 problem and the fix, caching, pagination, errors, security, and federation — then the honest question of when it earns its keep.


The whole idea in one move

Start from the problem GraphQL was built to solve. A mobile app needs to draw a profile screen: a user's name and avatar, their last five posts, and the comment count on each post. With a plain REST API you hit /users/42, then /users/42/posts, then a comment-count endpoint for each post. Three or four round trips, and each response hands you a fixed payload that probably carries fields the screen will never render. Either you fetch too much (over-fetching) or you have to make several calls to assemble what one screen needs (under-fetching). On a phone with a flaky connection, every extra round trip is felt.

GraphQL inverts the relationship. The server publishes a typed graph of data, and the client sends one request describing the exact tree of fields it wants. The server returns a response whose shape mirrors the request, no more and no less. One round trip, no wasted bytes, and the client decides the shape rather than the server guessing it in advance. That single inversion is the source of every benefit GraphQL claims and most of the costs it carries.

REST — three trips, extra fieldsclient/users/42/users/42/posts/posts/_/commentseach returns name, email,phone, address, … (unused)GraphQL — one trip, exact shapeclientuser(id:42){ name avatarposts(first:5){ titlecommentCount }}POST /graphql — one request, response matches the query tree
REST spreads a screen across endpoints and hands back fixed payloads. One GraphQL query names the fields and gets them in a single round trip.

The three pieces that define GraphQL

  1. A typed schema, defined on the server. The server writes a schema in SDL (Schema Definition Language): the types, the fields on each type, and how they connect. Every query is validated against it before a single resolver runs, so a typo or a wrong type fails fast with a clear message rather than at the database.
  2. One endpoint, one verb. Almost everything is a single POST /graphql. The operation travels in the request body. There is no URI to design, no per-resource status-code convention, and — importantly — no caching keyed on the URL, because every request shares the same URL.
  3. Three operation types. A query reads, a mutation writes, and a subscription streams updates over time, usually on a WebSocket or server-sent events.
# Schema (SDL) — the server's contract
type Query {
  user(id: ID!): User
}

type User {
  id: ID!
  name: String!
  posts(first: Int = 10): [Post!]!
}

type Post {
  id: ID!
  title: String!
  author: User!
}

# Query (the client authors this)
query {
  user(id: "42") {
    name
    posts(first: 5) {
      title
    }
  }
}

Read the SDL closely, because the type system carries more weight than it looks. The ! means non-null: name: String! promises the field is never null, and [Post!]! promises a non-null list of non-null posts. Fields can take arguments, like posts(first: Int = 10), with defaults. The schema is the single source of truth that both sides agree on, and tooling reads it to generate typed client SDKs, autocomplete queries in an editor, and validate every request. The schema is the API documentation, and it cannot drift from the implementation the way a hand-written REST spec can.

Queries, mutations, subscriptions

The three operation types map onto the three things an API does: read state, change state, and watch state change. They are deliberately separated, and the separation matters in practice rather than just on paper.

A query is a read. The fields inside a query are resolved in parallel where the runtime can, because reads have no ordering dependency on each other. A mutation is a write, and the top-level fields of a single mutation are run in series, top to bottom, so that one write can depend on the previous one finishing. A subscription opens a long-lived channel and pushes a new payload to the client every time some event fires — a new message in a chat, a price change, a build finishing. Subscriptions need a transport that stays open, which is why they ride on WebSockets or SSE rather than the request-response POST that queries and mutations use.

query — readmutation — writesubscription — streamfield Afield Bfield Cfield Dresolved in parallelone response backcreateUseraddToTeamsendInvitetop-level fields run in seriesserverclientmany pushes over one open channel
Three operation types, three execution shapes: parallel reads, serial writes, and a long-lived push channel.

One practical note about mutations: the response shape is up to you, and the good pattern is to return the objects you just changed so the client can update its cache in the same round trip. A mutation that returns only { ok: true } forces the client to re-query, which throws away one of GraphQL's advantages.

Resolvers and the resolver tree

A schema describes what data exists. Resolvers say how to fetch it. For every field in the schema you register a resolver: a function that, given the parent value and the field's arguments, returns that field's value. The runtime takes the incoming query, walks it as a tree, and calls one resolver per field per object. The result of a parent resolver becomes the first argument to its children's resolvers, so data flows down the tree as execution descends.

const resolvers = {
  Query: {
    user: (_, { id }, ctx) => ctx.db.users.find(id),
  },
  User: {
    posts: (user, { first }, ctx) =>
      ctx.db.posts.where({ authorId: user.id }).limit(first),
  },
  Post: {
    author: (post, _, ctx) => ctx.db.users.find(post.authorId),
    title:  (post) => post.title,   // a "trivial" resolver
  },
};

Every resolver receives four arguments: the parent (the value its parent resolver returned), the field args, a shared context object (the place to put the database handle, the authenticated user, request-scoped loaders), and an info object describing the rest of the query. Fields that simply read a property off the parent — like Post.title — usually need no explicit resolver at all; the default resolver just reads parent.title. You only write resolvers for the fields that do real work.

Query.user(42)User.posts(first:5)Post.authorPost.authorPost.authorPost.authorPost.authordatabasefive posts → five author resolvers fan out
The query tree becomes a tree of resolver calls. Each field on each object runs its own resolver, and the leaf resolvers fan out to the data store.

This model is the source of GraphQL's flexibility and its single most common performance trap. Because each field resolves independently, the runtime has no built-in idea that five Post.author calls could share one database query. Run the resolvers exactly as written and you get a separate fetch for each — which is the next section.

The N+1 problem, and DataLoader

Take the query that asks for a user's five posts and the author of each post. The runtime calls Query.user once, User.posts once to get five posts, and then Post.author five times, once per post. Each author call issues its own SELECT. That is one query for the user, one for the posts, and five for the authors: seven database round trips to render a single screen. The pattern is named for its shape — one query to fetch the parents, then N more, one per child — and it is the default behaviour of naive resolvers. It does not bite in development with three rows. It melts a database in production when a list page resolves a related field across a thousand items and turns into 1001 queries.

The fix is a per-request batching layer, the reference implementation of which is DataLoader. Instead of fetching one id at a time, a loader collects every id requested during the current tick of the event loop, fires a single batched query for all of them, and hands each resolver back its own row. It also caches within the request, so asking for the same id twice does one fetch.

// Create a FRESH loader on every request.
const userLoader = new DataLoader(async (ids) => {
  const rows = await db.users.whereIn('id', ids);
  // Must return rows in the SAME ORDER as the input ids,
  // with a null placeholder for any id that had no row.
  const byId = new Map(rows.map((r) => [r.id, r]));
  return ids.map((id) => byId.get(id) ?? null);
});

// The resolver no longer queries directly:
Post: {
  author: (post, _, ctx) => ctx.userLoader.load(post.authorId),
}

// 5 calls to .load() in one tick → 1 batched SELECT … WHERE id IN (…)
N+1 — five separate queriesr1r2r3r4r5DB5 SELECTsDataLoader — one batched queryr1r2r3r4r5loaderbatch + cacheDB1 SELECT … WHERE id IN (1,2,3,4,5)
DataLoader sits between resolvers and the data store. It collapses the per-id fetches of one tick into a single batched query, then deals each resolver its own row back.
Why per request, never global. The loader is created fresh for each incoming request and discarded after. A long-lived global cache would leak data between users — request A's authorisation context might let it see a row that request B must not — and it would serve stale rows after a write. Per-request scope keeps the batching win without the correctness risk.

Two details trip people up. First, the batch function must return results in the same order as the ids it was given, with a placeholder for ids that had no row, because DataLoader matches outputs to inputs by position. Second, batching only happens for .load() calls made in the same tick of the event loop, which is exactly when the GraphQL runtime resolves the siblings of a list — so the batching lines up with the resolver tree by design, not by accident.

Over-fetching, under-fetching, and the cost of flexibility

The flexibility that kills over- and under-fetching for the client moves the problem onto the server. In REST, an endpoint author knows precisely which queries their handler runs and can tune them. In GraphQL, the client composes the query, so the set of possible queries is large and the server must perform well across all of them. A field that is cheap on its own can be ruinous when a client selects it across a thousand parents. DataLoader handles the common fan-out, but deeply nested selections, expensive computed fields, and joins that a client did not realise it triggered still need attention. The work GraphQL saves the client is real, and so is the work it asks of the server team.

This is also why "just expose every field" is a trap. Each new field is a new query the server must be ready to answer well under load. Treat the schema as a product with a budget, not as a thin wrapper that dumps the database onto the wire.

Caching is the hard part

REST gets a great deal of caching almost for free because the URL is the cache key. A GET /users/42 can be cached by the browser, a CDN, and a reverse proxy with no extra thought, and an ETag or Cache-Control header tunes it. GraphQL gives that up. Every operation is a POST to the same /graphql URL with the query in the body, so the layers of HTTP caching that REST relies on cannot distinguish one request from another. Caching in GraphQL is something you build rather than something you get.

There are two answers, at two layers. On the client, normalised caches (Apollo Client, Relay, urql) store objects by a stable global id and reassemble query results from those objects, so a user fetched by one query updates everywhere it appears. That pays off, and it is why client libraries lean so hard on object identity. On the network, the move is persisted queries: register a query with the server once, refer to it by a hash forever after, and now the request is small and stable enough to cache.

Persisted queries

A persisted query is a query the server already knows. Instead of sending the full query text on every request, the client sends a short hash of it. The server looks the hash up, runs the known query, and returns the result. This shrinks the request, removes the round-trip cost of shipping large query strings on mobile, and — because a hash is a stable key — makes the request cacheable and lets you allow-list exactly which queries production will accept.

Apollo's automatic persisted queries (APQ) make this self-bootstrapping. The client first sends only the hash. If the server has not seen it, it replies "unknown," the client retries with the full query plus the hash, and the server stores it. From then on every client sends only the hash. Used as an allow-list — reject any query whose hash is not registered — persisted queries also double as a security control, because an attacker cannot run an arbitrary, expensive query that was never deployed.

clientserver{ hash: ab12… }PersistedQueryNotFound{ hash: ab12…, query: "…" } — storedlater: { hash: ab12… } only — cache hitregister once by hash, then send only the hash forever after
Automatic persisted queries bootstrap themselves: the full query crosses the wire once, after which the client refers to it by a stable, cacheable hash.

Pagination — cursors and the Relay convention

Lists need pagination, and GraphQL has a strong house style for it borrowed from Relay. Rather than offset-based paging (page=3), which double-counts or skips rows when the underlying data changes between requests, the convention is cursor-based: each item carries an opaque cursor that encodes its position, and you ask for "the first N after this cursor." Cursors are stable under inserts and deletes, which is what you want for an infinite scroll that does not glitch.

type PostConnection {
  edges: [PostEdge!]!
  pageInfo: PageInfo!
}
type PostEdge {
  cursor: String!
  node: Post!
}
type PageInfo {
  hasNextPage: Boolean!
  endCursor: String
}

query {
  user(id: "42") {
    posts(first: 10, after: "cursorXYZ") {
      edges { cursor node { title } }
      pageInfo { hasNextPage endCursor }
    }
  }
}

The edges/node/cursor shape looks like ceremony, and it partly is, but it pays off. The cursor lives on the edge, not the node, because an item's position in a list is a property of the list, not of the item. pageInfo.hasNextPage tells the client when to stop, and endCursor is what it passes as after next time. The same pattern supports backward paging with last and before. You do not have to adopt the Relay spec, but most tooling and most teams expect it, so the path of least resistance is to use it.

Error handling — everything is 200

This one surprises people moving from REST. A GraphQL response is HTTP 200 even when resolvers fail. The body carries both a data field and an errors array, and a single response can hold both: partial data for the fields that resolved and error entries for the ones that did not. This is a consequence of the query shape — one request can touch many independent fields, and it would be wrong to fail the whole response because one leaf threw.

{
  "data": {
    "user": { "name": "Ada", "posts": null }
  },
  "errors": [
    {
      "message": "Forbidden",
      "path": ["user", "posts"],
      "extensions": { "code": "FORBIDDEN" }
    }
  ]
}

Two things follow. First, your client must read the errors array, not just the status code; a happy 200 can still describe a failure. Second, your monitoring has to follow suit. Dashboards and alerts built on HTTP status codes see nothing but a wall of 200s while resolvers quietly fail, so you have to instrument GraphQL errors explicitly, usually by counting entries in the errors array and grouping by the extensions.code you attach to each. The path field tells you which part of the query tree blew up, which is invaluable when debugging a deep selection.

Security — the schema is an attack surface

Letting clients compose their own queries is the feature, and it is also the risk. A few defences are close to mandatory before a GraphQL API faces the public internet.

  • Depth limiting. Because the graph has cycles (User → posts → author → posts → …), a client can author an arbitrarily deep query that forces enormous work from one small request. Cap the nesting depth so a query cannot recurse forever.
  • Complexity / cost analysis. Depth alone is not enough; a shallow query can still be expensive if it fans out widely. Assign a cost weight to each field, sum the cost of a query before running it, and reject anything over a budget. A list field's cost should scale with the first argument it is given.
  • Persisted-query allow-listing. Restrict production to a known set of queries by hash so an attacker cannot run an arbitrary, expensive one that was never deployed.
  • Introspection control. GraphQL servers can describe their own schema via introspection, which is wonderful for tooling and a gift to an attacker mapping your API. Many teams disable introspection in production while keeping it on in development.
  • Field-level authorisation. Authorisation belongs in resolvers (or a layer they call), not at the endpoint, because one request reads many fields with different access rules. The context object carries the authenticated principal that each resolver checks.

The throughline is that GraphQL moves choices the server used to make to the client, so the server has to re-impose limits it used to get implicitly from a fixed set of endpoints.

Federation — one graph, many teams

A single schema owned by every backend team becomes a coordination bottleneck the moment more than one team has to ship to it. Federation splits one supergraph into several independently owned subgraphs, each living in its own service and deploying on its own schedule. A type can be defined in one subgraph and extended in another, so the users service owns User and the posts service adds a posts field to it without the two teams editing the same file.

# users subgraph — owns User
type User @key(fields: "id") {
  id: ID!
  name: String!
}

# posts subgraph — owns Post, extends User
type Post @key(fields: "id") {
  id: ID!
  title: String!
  author: User!          # references the User from the users subgraph
}

extend type User @key(fields: "id") {
  id: ID! @external
  posts: [Post!]!        # adds posts to a type it does not own
}
clientgatewayplan + stitchusers subgraphUser: id, nameposts subgraphPost + User.postsone query in, fanned to the owning subgraphs, one response back
Federation: the gateway reads the composed supergraph, decides which subgraph owns each field, calls each one, and stitches the pieces into a single response.

A gateway sits in front. It holds the composed supergraph, parses each incoming query, plans which subgraph owns each field, calls the subgraphs (often in parallel, resolving references by @key), and stitches the parts into one response. Each team owns its slice and ships independently; the gateway handles the cross-cuts. The cost is operational: you now run a gateway, you have to compose and check subgraph schemas so a change in one does not break the graph, and a slow subgraph can drag a federated query. Federation is worth it at the scale where schema ownership is a genuine organisational problem, and it is overhead below that.

Where GraphQL earns its keep

  • Mobile clients fetching deep object graphs. One round trip for exactly the fields a screen needs beats several trips to several REST endpoints, and that matters most on the slow, high-latency connections phones live on.
  • Many clients against one backend. iOS, Android, web, a TV app, internal dashboards — each picks its own slice of the graph without a backend change. The schema absorbs the variation that would otherwise become a sprawl of bespoke endpoints.
  • Broad public APIs. When you expose hundreds of related resources and want clients to combine them freely (GitHub, Shopify), publishing one typed graph beats writing and maintaining hundreds of REST endpoints.
  • A strong contract across the wire. The schema is enforced, and codegen turns it into typed client SDKs in many languages, so a renamed field is caught at build time rather than in production.

When to reach for something else

GraphQL's complexity is a tax. It is worth paying when the flexibility it buys is worth more than the resolver discipline, caching machinery, and security limits it demands. Often it is not, and a simpler protocol wins.

  • A small or simple API. A handful of resources read in predictable shapes does not need a schema runtime, DataLoader, and a caching strategy. REST over JSON is faster to ship, simpler to consume, and gets HTTP caching for free. The full trade is laid out in REST vs GraphQL.
  • A TypeScript client talking to a TypeScript server. When both ends share a language, tRPC gives you end-to-end type safety with no schema language, no codegen step, and almost no runtime, by inferring types straight from the server's functions. The comparison lives at GraphQL vs tRPC.
  • Service-to-service traffic you control. Between your own services, the flexible client-composed query buys little, and gRPC's strict contract, compact binary payloads, and native streaming usually win.
  • Caching as a load-bearing layer. If a CDN and HTTP caches carry your read traffic, GraphQL's single POST endpoint fights you, and you are rebuilding by hand what REST gave you for nothing.
The honest summary. GraphQL solves a real problem — many clients needing many shapes of related data — and it solves it well. It is not a default. Reach for it when the shape of the problem matches the shape of the tool, and reach for REST or tRPC when it does not.

Further reading

Found this useful?