New to this? · ELI5 · 1 min Read Service mesh explained simply, in plain English

infra · mesh & traffic

Service Mesh Visualizer: one request at a time.

A service mesh moves networking concerns out of your code and into a sidecar proxy next to every pod. That sidecar wraps egress in mTLS, splits traffic between v1 and v2, enforces retries, and opens a circuit breaker when the backend stops answering, all invisible to the application. Pick a source and a destination below, send a request, watch the hops. Mark a pod as failing and see the breaker open.

Mesh Source Destination Canary % to v2 mTLS

Retries Backoff ms CB threshold

web

sidecar

envoy

api

sidecar

envoy

users v1

users-v1

sidecar

envoy

users v2

users-v2

sidecar

envoy

cache

sidecar

envoy

sidecar

Requests0

Retries0

Failures0

Ejections0

p50 (ms)0

p99 (ms)0

web closed

api closed

users-v1 closed

users-v2 closed

cache closed

db closed

Log

No requests yet. Pick a source and destination, then click "Send request".

What you're looking at

Each card is a pod, and the small "envoy" box clipped to it is that pod's sidecar proxy — the thing that actually carries the request. When you send one, the cards and sidecars light up in the order the bytes travel: source sidecar, the wire (dashed when mTLS is on), destination sidecar, then the service itself. The stat row tracks requests, retries, failures, ejections, and live p50/p99 latency; the grid under it shows each backend's circuit-breaker state.

Send a few requests to users and watch the canary split route most to v1 and a slice to v2. Then click a pod's "healthy" tag to mark it failing and keep sending. The first failures trigger retries with backoff, latency climbs, and once consecutive failures cross the circuit threshold the breaker flips to OPEN and the next requests fail fast without even reaching the backend. The surprise is how much of this — encryption, routing, retries, ejection — happens in the sidecar while the application code stays a plain localhost call.

What a service mesh actually is

A service mesh has two pieces. The data plane is a small L7 proxy injected into every pod as a sidecar container. All traffic in and out of the pod hits the proxy first. The control plane is a separate set of pods that hands configuration to the proxies — routing rules, mTLS certificates, retry policies, telemetry endpoints.

The application code never knows the proxy is there. It dials users.svc.cluster.local on 127.0.0.1:8080 and the sidecar takes the connection, applies policy, opens an mTLS tunnel to the destination sidecar, and forwards the bytes. The cluster's iptables (or eBPF with newer Istio) redirects all outgoing traffic into the sidecar transparently. The proxy is Envoy for Istio and AWS App Mesh; Linkerd uses its own Rust proxy called linkerd2-proxy.

Istio vs Linkerd vs AWS App Mesh

	Istio	Linkerd 2.x	AWS App Mesh
Control plane	istiod (Go)	destination, identity, proxy-injector	AWS-managed
Data plane	Envoy (C++)	linkerd2-proxy (Rust)	Envoy (C++)
mTLS default	Off by default; PeerAuthentication STRICT enables	On by default for all meshed traffic	Off by default; per-mesh TLS config
Latency added per hop	2–5ms (Envoy)	~1ms (Rust proxy)	2–5ms (Envoy)
Memory per sidecar	~70–150 MB	~10–25 MB	~70–150 MB
Install complexity	High; many CRDs; istioctl	Low; `linkerd install` + check	Medium; AWS-only; needs IAM
Best for	Large clusters, advanced routing, ext-auth	Smaller teams, operability, latency-sensitive paths	Teams already deep in AWS, ECS or EKS

Pick by what you actually need. Linkerd is the boring-and-correct choice for most teams; Istio is the right choice when you need WASM filters, external authorisation, or VirtualService routing more flexible than what Linkerd's TrafficSplit gives you.

The sidecar pattern

Each pod gets its own proxy. The alternative — a shared mesh proxy per node, or a global mesh gateway — exists, but the sidecar wins on isolation. One pod's misconfigured retry policy can't take down its neighbour's traffic, because they have separate proxies. The downside is that you pay the proxy's memory and CPU cost N times for N pods.

What the sidecar buys you: uniform observability (every hop emits metrics in the same format), mTLS without application changes (the app keeps speaking HTTP/1 or HTTP/2 plaintext to localhost), out-of-band traffic policy (you change a CRD; the proxies re-resolve; no redeploy), and a clean place to attach external auth, rate limiting, and circuit breaking.

The "sidecarless" mesh has been a topic since 2023 with Istio's ambient mode and Cilium's mesh-via-eBPF. They move the L4 work into a node-level proxy (ztunnel) and the L7 work into a per-namespace waypoint proxy. Resource savings are real (one ztunnel per node instead of one Envoy per pod), but the operational model is still maturing.

Traffic shifting and canary

The mesh's declarative routing API is where the day-to-day value lives. In Istio, a VirtualService defines routes and weights; in Linkerd, an SMI TrafficSplit does the equivalent. You declare "90% of users.svc traffic goes to subset v1, 10% to v2," and the proxies pick up the change. No load balancer reconfig, no DNS change.

Subsets are selected by label. The DestinationRule in Istio binds labels like version=v1 to a subset name; the VirtualService then routes by subset. Canary is just a weight: 0/100 at the start, ramp to 100/0 over hours or days, roll back if metrics regress. Most teams pair this with a progressive-delivery tool like Flagger or Argo Rollouts that watches Prometheus and bumps the weight automatically when SLO is met.

mTLS by default

Linkerd turns on mTLS for all meshed traffic the moment you inject sidecars. Istio's PeerAuthentication: STRICT mode does the same, but you have to set it. Either way, every connection between two sidecars is TLS-wrapped with both ends presenting client certs.

Identity comes from SPIFFE. Each pod gets a SPIFFE ID like spiffe://cluster.local/ns/default/sa/users-sa, where the trailing path encodes namespace and service account. The sidecar fetches a short-lived (24h or less) X.509 certificate from the control plane's identity service, which has the SPIFFE ID in the SAN. Rotation is automatic; the certs typically rotate every 24 hours.

What's encrypted: pod-to-pod traffic through sidecars. What isn't: ingress traffic from outside the cluster (terminated at the ingress gateway, plaintext on the wire after), and anything bypassing the sidecar (init containers running before the sidecar is up; hostNetwork: true pods).

Retry policy and timeouts

Retries belong in the proxy, not the application, because the proxy knows the network shape and the application's authors won't get retry-budget arithmetic right. Istio's VirtualService takes retries.attempts, retries.perTryTimeout, and retries.retryOn (5xx, gateway-error, reset, connect-failure). Linkerd's retry budget is configured on the destination service profile and defaults to 20% of base traffic.

The trap: naive retries multiply load on a slow backend. The fix is retry budgets — a cap on the ratio of retries to original requests, typically 10–20%. Linkerd computes this per-service; Istio's retries are bounded by attempt count but you should add a circuit breaker. Always retry on connection errors and 503s; never retry on 5xx for non-idempotent operations.

Circuit breaking and outlier detection

Envoy's outlier detection ejects a backend host from the load-balancer pool when it accumulates errors. Knobs: consecutive5xx (5 by default), consecutiveGatewayFailure, interval (10s between checks), baseEjectionTime (30s), maxEjectionPercent (10% cap so you don't eject the whole pool).

The breaker has three states. Closed: requests flow. Open: requests fail fast without touching the backend. Half-open: after the ejection timer, one probe request goes through; on success the breaker closes, on failure it opens again with a doubled timeout. Linkerd doesn't have explicit outlier detection in the same shape; it relies on load-aware load balancing (EWMA) to steer traffic away from slow backends without binary ejection.

Observability layer

This is where many mesh adopters get the most value. The sidecar emits three classes of telemetry uniformly: metrics (Prometheus scrape on every proxy), distributed traces (B3 or W3C trace context propagated automatically), and access logs (JSON, configurable schema).

You get a unified "golden signals" dashboard for every service without instrumenting any of them. Linkerd's Viz extension and Istio's Kiali both render a live service graph driven by mesh telemetry. The cost is data volume — at 1000 RPS per pod with default config, expect several MB/s per pod of telemetry, and a Prometheus / Grafana setup that needs serious tuning to keep up.

The cost of a mesh

Per-pod overhead. An Envoy sidecar idles at 70–150 MB resident memory and a few millicores of CPU; under load it can climb to 200+ MB and 100m+ CPU. Linkerd's Rust proxy is meaningfully lighter, often 10–25 MB. Multiply by pod count. A 500-pod cluster pays meaningful overhead even on Linkerd.

Per-hop latency. Linkerd adds about 1ms p50, 3–5ms p99 per hop. Istio with Envoy adds 2–5ms p50, 8–15ms p99. App Mesh sits with Istio because it's the same Envoy data plane. For a 5-hop request, that's a real budget; on hot internal paths (cache lookups, auth checks) it can dominate.

Control plane. istiod, the Linkerd control plane components, or App Mesh's hidden management plane — all consume CPU and memory proportional to the number of meshed services and the change rate of CRDs.

When not to use a service mesh

Single-service apps. If your "microservices" diagram is a monolith plus a Redis, you don't need a mesh. You need a config-management discipline.
Latency-sensitive trading systems. A 3–5ms tax per hop is fatal for sub-millisecond paths. Use direct gRPC with manual mTLS and skip the mesh.
Small teams without operations capacity. A mesh is a distributed system. It fails in interesting ways. If you're a team of five running a SaaS, the mesh's failure modes can eat more time than the problems it solves.
Kubernetes-less environments. Service mesh is achievable outside Kubernetes (Consul Connect, Linkerd's Helm-on-VMs setup), but the experience is rough. If you're not on k8s, look at alternatives first.
When mTLS is the only goal. Cilium's transparent encryption, WireGuard at the node level, or a simple sidecar like Boringcrypto's proxy may give you the encryption story without the operational weight.

Production gotchas

Startup ordering. If the application container starts before the sidecar is ready, outbound calls during init fail because iptables redirects to a not-yet-listening port. Istio added holdApplicationUntilProxyStarts; Linkerd uses a similar mechanism. Most teams hit this on the first job/cron deploy.
Ingress vs mesh boundary. The mesh does mTLS pod-to-pod, but ingress traffic is TLS-terminated at the gateway. Anything between client and gateway is the responsibility of your edge (ALB, NLB, CloudFront), and anything after the gateway is mesh territory. Don't conflate the two.
Observability data volume. Default Istio telemetry config emits per-request Prometheus metrics with high label cardinality (source pod, destination pod, response code, host, path). Cardinality explosion brings Prometheus to its knees in production clusters. Trim labels via Telemetry API early.
DNS interception breaking the host. Istio's smart-DNS feature can capture DNS resolution and rewrite responses for service VIPs. When this misbehaves (and it has, historically), pods get bizarre DNS errors for hosts that work fine from the node.
External services without ServiceEntry. Calls to a non-mesh external host (e.g., a public API) hit the sidecar but have no DestinationRule, so default policy applies. Add a ServiceEntry for every external dependency or you'll get surprising routing behaviour.
Resource limits on the sidecar. Default Istio sidecar resource limits are low (100m CPU, 128 MB memory). Under burst load the sidecar OOMs before the app does. Tune sidecar.istio.io/proxyCPU and proxyMemory annotations early.

An anatomy of one request

Tracing a single request through the mesh, step by step:

# app code in web pod calls users.default.svc.cluster.local
curl http://users.default.svc.cluster.local/users/42

# 1. DNS resolves to ClusterIP 10.96.45.7
# 2. iptables in the web pod's network namespace redirects to 127.0.0.1:15001
#    (the sidecar's outbound port)
# 3. envoy(web) matches the destination against its config:
#      - cluster: users.default.svc.cluster.local
#      - TLS context: SPIFFE-based mTLS
#      - load balancer: ROUND_ROBIN over endpoints
#      - subset selection: 10% v2, 90% v1 per VirtualService
# 4. envoy(web) opens mTLS to envoy(users-v1) at 10.244.2.18:15006
# 5. envoy(users-v1) terminates mTLS, forwards plaintext to 127.0.0.1:8080
# 6. the users app responds 200
# 7. response flows back the same path, sidecars emit metrics and trace span

Two proxies per request. Two policy evaluations per request. Two sets of metrics emitted per request. The mesh is doing real work on every hop.

Service Mesh Visualizer: one request at a time.

What a service mesh actually is

Istio vs Linkerd vs AWS App Mesh

The sidecar pattern

Traffic shifting and canary

mTLS by default

Retry policy and timeouts

Circuit breaking and outlier detection

Observability layer

The cost of a mesh

When not to use a service mesh

Production gotchas

An anatomy of one request

Further reading