Sockets
A socket is the operating system's abstraction for network I/O. To the kernel it is a data
structure that owns two buffers and some protocol state; to your program it is a file
descriptor you can read and write like any other. The same handful of system
calls — socket, bind, listen, accept,
connect, send, recv, close — has been
the API since 1983 BSD, and every higher-level layer (Go’s net package,
Node’s net module, your HTTP framework, a load balancer) calls down to it. This
page works through what the socket is, the exact syscall sequence on each side, where data sits
between your code and the wire, and why one blocking call per connection eventually pushes you
toward an event loop.
What a socket actually is
A socket is a kernel data structure that the program refers to by a small integer — a file descriptor. It holds a few pieces of state: which protocol family it speaks (IPv4, IPv6, Unix), which type it is (stream, datagram, raw), the local address it’s bound to, the remote address it’s connected to, and two buffers — one for data going out, one for data coming in.
In Unix everything is a file. A socket is a file descriptor that, instead of being
backed by bytes on disk, is backed by bytes the kernel reads from or writes to the
network. read and write work on a socket FD just like they
do on a regular file FD; send and recv are the same thing
with extra flags.
(protocol, local IP, local port, remote IP, remote port). The kernel uses
this to decide which socket each incoming packet belongs to. Two TCP sockets cannot
share a five-tuple; everything else is fair game.The two buffers are the part that surprises people, so it is worth being precise. When
your code calls send, the bytes do not go onto the wire. They are copied into
the socket's send buffer inside the kernel, and send returns. The kernel's
TCP code drains that buffer onto the network in its own time, according to the window the
receiver advertised and what congestion control allows. On the other side, packets the NIC
receives are reassembled into the socket's receive buffer, and recv copies
out of that buffer into your memory. Your program and the network are decoupled by these
two queues; that decoupling is what makes a socket feel like a file instead of a wire.
send fills one, recv drains the other; TCP moves bytes between the buffers and the wire on its own schedule.This is also why a socket sits squarely in the operating system's I/O machinery rather than off to the side. The same page-cache, blocking, and readiness concepts that apply to disk files apply here too; the OS I/O internals page covers the file-descriptor table, the difference between an open file description and a descriptor, and how the kernel decides a descriptor is "ready" to read or write. A socket is one more kind of descriptor that plugs into all of it.
Socket types
You pick two things when you call socket(): the address family and the
socket type.
| Address family | For |
|---|---|
AF_INET | IPv4 |
AF_INET6 | IPv6 (and dual-stack IPv4 with mapping) |
AF_UNIX | Local IPC over a filesystem path |
AF_PACKET | Raw Ethernet frames (Linux); needs root |
| Socket type | Maps to | Semantics |
|---|---|---|
SOCK_STREAM | TCP | Reliable, ordered, byte-stream |
SOCK_DGRAM | UDP | Best-effort, message-oriented |
SOCK_RAW | raw IP | You write the IP header yourself; needs root |
SOCK_SEQPACKET | SCTP, Unix | Reliable, message-boundaries preserved |
Almost every program you’ll write uses AF_INET or
AF_INET6 with SOCK_STREAM (TCP) or SOCK_DGRAM
(UDP). The other combinations exist for narrower jobs — packet capture, custom
protocols, IPC.
The server lifecycle
Five system calls, in this order:
int fd = socket(AF_INET, SOCK_STREAM, 0); // 1. create
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(8080),
.sin_addr = { INADDR_ANY }, // 0.0.0.0 = all interfaces
};
bind(fd, (struct sockaddr*)&addr, sizeof addr); // 2. claim address
listen(fd, 128); // 3. mark as accepting,
// backlog = 128
while (1) {
int client = accept(fd, NULL, NULL); // 4. dequeue a connection
handle(client); // 5. read/write on it
close(client);
}Step by step:
socketallocates the kernel data structure and gives you back a fd. The socket is "unbound" — it has no address.bindattaches a local address.0.0.0.0means all interfaces;127.0.0.1binds to localhost only. Port 0 lets the kernel pick a free port, which is handy for tests.listentells the kernel "incoming connections welcome", with a backlog of how many completed connections to queue. The default of 128 is fine for most workloads; high-throughput servers raise it.acceptblocks until a connection completes the TCP handshake, then returns a new fd for that connection. The original fd keeps accepting on its own.closeon the per-connection fd starts the TCP close sequence.closeon the listening fd stops accepting new connections.
The client lifecycle
Even simpler. Three calls:
int fd = socket(AF_INET, SOCK_STREAM, 0); // 1. create
struct sockaddr_in remote = {
.sin_family = AF_INET,
.sin_port = htons(443),
.sin_addr = inet_addr("93.184.216.34"),
};
connect(fd, (struct sockaddr*)&remote, sizeof remote); // 2. handshake
send(fd, "GET / HTTP/1.0
", 18, 0); // 3. write
char buf[4096];
ssize_t n = recv(fd, buf, sizeof buf, 0); // read
close(fd);connect is the active half of the TCP three-way handshake. By the time it
returns, ESTABLISHED has been reached on both sides and the path is open. If the
server isn’t listening or the network is broken, you get ECONNREFUSED,
EHOSTUNREACH, or ETIMEDOUT depending on the failure mode.
The two sequences, side by side
It helps to see both sides as one picture. The server sets up a listening socket once and
then loops on accept. The client makes one socket and calls
connect. The three-way handshake happens between connect on the
client and accept on the server; only after it completes does either side have a
connected socket to read and write.
connect and the server's accept. The listening socket never carries data; the connected socket accept hands back does.Listening sockets vs connected sockets
This is the distinction that makes the rest of socket programming click, and it is the one
beginners most often blur. There are two kinds of TCP socket, and they do different jobs. The
socket you create, bind, and listen on is a listening
socket. It never carries application data. Its only job is to be the rendezvous
point: it has a local address but no remote peer, so its identity is just
(protocol, local IP, local port) with the remote half left open. Every fresh
connection that arrives on that port is matched to this one socket.
Each call to accept hands you a different socket: a connected
socket, with a full identity. Its 4-tuple is
(local IP, local port, remote IP, remote port) — the same local port as the
listener, plus the specific client's address and port. That 4-tuple is what the kernel uses
to demultiplex arriving packets: when a segment shows up, the kernel looks for an established
socket whose 4-tuple matches all four fields, and only if none matches does it fall back to a
listening socket with that local port. This is why thousands of clients can all connect to
your server on port 443 at once. They share the local port, but each connection has a
distinct remote IP-and-port, so each gets its own connected socket.
The listening socket also explains the backlog argument to
listen. The kernel keeps two queues behind a listening socket: a SYN queue of
half-open connections still in the handshake, and an accept queue of fully established
connections waiting for your code to call accept. The backlog sizes the accept
queue. If your program is slow to accept and that queue fills, new completed connections are
dropped, and the client sees a connection that hangs or resets. A backlog of 128 is fine for
most servers; one that does almost no work per connection and accepts in a tight loop can go
higher. The number is a buffer against bursts, not a connection limit.
A first server in Python
The same lifecycle in Python looks almost identical. The standard library wraps the kernel calls in a friendlier shape, but the order is the same.
# echo server — read a line, send it back
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("0.0.0.0", 8080))
s.listen(128)
while True:
conn, addr = s.accept()
print(f"connection from {addr}")
data = conn.recv(4096)
conn.send(data)
conn.close()Run it. From another terminal: nc localhost 8080, type something, hit
enter. You’ve just exchanged bytes through the TCP/IP stack.
send and recv don't return what you expect
This is the most common beginner bug. send(fd, buf, n) doesn’t
send n bytes — it sends up to n bytes and returns how many
it actually sent. recv(fd, buf, n) reads up to n bytes
and returns how many actually arrived. The kernel buffers, the application buffers,
the network is bursty, so partial returns are normal.
You always loop:
# send all bytes
while sent < len(buf):
n = sock.send(buf[sent:])
if n == 0:
raise BrokenPipeError()
sent += n
# recv exactly N bytes
def recv_n(sock, n):
chunks = []
received = 0
while received < n:
chunk = sock.recv(n - received)
if not chunk:
raise ConnectionError("peer closed early")
chunks.append(chunk)
received += len(chunk)
return b"".join(chunks)Python’s sock.sendall(buf) wraps the send loop. There’s no
equivalent recvall because TCP is a stream — there’s no inherent
"end of message". Your protocol on top has to define one (length prefix, delimiter,
fixed-size frame, length + checksum + payload, etc.).
sendto = one packet on the wire =
one recvfrom on the other side. There’s no "partial" UDP recv —
either you get the whole datagram or you don’t. The buffer you pass had better
be big enough; anything past it is silently truncated.Blocking vs non-blocking
By default, recv, accept, and connect block
the calling thread until something happens. recv blocks until the receive
buffer has bytes or the peer closes; accept blocks until a connection sits in
the accept queue; connect blocks until the handshake finishes or fails. The
thread is parked the whole time, doing nothing, holding its stack.
With one connection per thread that is fine. The trouble is the cost of a thread. Each one carries a stack measured in megabytes and a scheduling slot, so a few thousand idle connections turn into gigabytes of stacks and a scheduler thrashing through context switches. This is the old "C10k problem": handling ten thousand simultaneous connections is not a bandwidth problem, it is a problem of how you wait. A model where waiting costs a whole thread does not get there.
Non-blocking sockets change what "wait" means. Mark a socket non-blocking and
recv stops parking the thread: if there is nothing to read it returns
immediately with EAGAIN (or EWOULDBLOCK). Now one thread can poke
many sockets without getting stuck on any one of them. But spinning over every socket asking
"anything yet?" wastes the CPU you just saved, so you need the kernel to tell you which
sockets are ready. That is exactly what select, poll,
epoll, and kqueue do: hand them a set of descriptors, and they
block once until at least one is ready, then return the ready ones. One thread, parked in one
call, waking only when there is real work. That pattern — a single thread blocked in a
readiness call, dispatching to handlers as descriptors become ready — is an
event loop, and it is why non-blocking
sockets and the event loop are really the same idea seen from two angles.
Three ways to handle many connections from one thread:
- Non-blocking + select / poll / epoll. Mark each socket
non-blocking with
fcntl(fd, F_SETFL, O_NONBLOCK).recvnow returns immediately, withEAGAINif there’s nothing to read. Useepoll_wait(Linux),kqueue(BSD/macOS), orpoll(everywhere) to wait on many sockets at once. This is what every high-throughput server uses underneath. - io_uring. The newer Linux mechanism. Submit hundreds of operations through a shared-memory ring; the kernel completes them and notifies you in batches. Higher throughput than epoll for I/O-heavy workloads, but more complex.
- Async runtimes. Tokio (Rust), asyncio (Python), Go’s runtime, Node.js — all sit on top of one of the two above and present a coroutine / goroutine / promise abstraction. You write code as if it’s blocking; the runtime turns each await point into an epoll registration.
All three sit on the same foundation. The async runtime you use does not invent a new way to
talk to the network; it owns an event loop over non-blocking sockets and hides the
registration and dispatch behind await, a goroutine, or a callback. When you
understand the blocking call and the non-blocking-plus-readiness alternative, you understand
what every one of these runtimes is doing underneath.
Stream sockets vs datagram sockets
The socket type you pass to socket() decides the shape of everything above it. A
SOCK_STREAM socket gives you TCP: a reliable, ordered, two-way byte stream. There
are no message boundaries on the wire, so bytes you send arrive in order, but the
receiver's recv calls split them however the kernel happens to have buffered
them. Three sends of 100 bytes can show up as one recv of 300, or six of 50.
This is why a stream protocol on top of TCP always has to define its own framing: a length
prefix, a delimiter, or a fixed-size record. The reliability and ordering are handled for you
by the machinery in the TCP deep dive —
sequence numbers, acknowledgements, retransmission, and flow control.
A SOCK_DGRAM socket gives you UDP: best-effort, message-oriented, no connection.
One sendto produces exactly one datagram on the wire, and one
recvfrom returns exactly one datagram, with its boundaries intact. There is no
ordering guarantee, no retransmission, and no flow control; a datagram can be lost,
duplicated, or arrive out of order, and the kernel will not tell you. In exchange you get no
handshake, no per-connection state, and the lowest possible latency. Anything that needs
reliability on top of UDP has to build it itself, which is the path the
UDP page follows up to QUIC.
| Property | SOCK_STREAM (TCP) | SOCK_DGRAM (UDP) |
|---|---|---|
| Connection | Handshake, kept state | None; address each datagram |
| Boundaries | None, a byte stream | Preserved, one send is one recv |
| Delivery | Reliable, ordered | Best-effort, may reorder or drop |
| Read call | recv returns partial reads | recvfrom returns a whole datagram |
| Cost | Per-connection state, slower start | Stateless, lowest latency |
Socket options that matter
| Option | Use for |
|---|---|
SO_REUSEADDR | Bind to a port even if a previous socket is still in TIME_WAIT. Set this on every server before bind. |
SO_REUSEPORT | Multiple sockets on the same port; the kernel load-balances incoming connections across them. The right way to scale a single-threaded server across cores. |
TCP_NODELAY | Disable Nagle’s algorithm. Set this on RPC-style traffic where every send is a logical message. |
SO_KEEPALIVE | Send periodic probes on idle connections to detect a peer that vanished without RST. Default interval is 2 hours; usually too long. |
TCP_KEEPIDLE / TCP_KEEPINTVL / TCP_KEEPCNT | The actual knobs. Set these to something like 30/10/3 for a 60-second detection window. |
SO_RCVBUF / SO_SNDBUF | Per-socket buffer sizes. Linux auto-tunes; only override on very high-bandwidth-delay paths. |
SO_LINGER | What close does with un-sent data. Default is "send what you can in the background"; rarely change this. |
Address structures, briefly
C’s sockets API takes struct sockaddr* — an opaque pointer to a
family-specific struct. sockaddr_in for IPv4, sockaddr_in6
for IPv6, sockaddr_un for Unix domain sockets. Each starts with a
sa_family field so the kernel can tell which one you actually passed.
struct sockaddr_in { // IPv4
sa_family_t sin_family; // AF_INET
in_port_t sin_port; // network byte order
struct in_addr sin_addr; // 32-bit IPv4 address
};
struct sockaddr_in6 { // IPv6
sa_family_t sin6_family; // AF_INET6
in_port_t sin6_port; // network byte order
uint32_t sin6_flowinfo;
struct in6_addr sin6_addr; // 128-bit IPv6 address
uint32_t sin6_scope_id;
};sin_port is "network byte order" — big-endian on the wire. Your machine
might be little-endian, so you have to convert. htons(8080) = "host to
network short", takes 8080 in your CPU’s order and returns it big-endian. The
next deep dive,
bytes on the wire, covers
this in full.
High-level languages hide this entirely. Python’s
socket.bind(("0.0.0.0", 8080)) takes the port as an integer in host
order and converts internally. C and Rust make you do it explicitly.
The address family you choose decides which of these structures the kernel
expects and which protocol stack the socket plugs into. AF_INET and
AF_INET6 reach the IP network; AF_UNIX stays on the local machine and
addresses by a filesystem path, which skips the whole IP stack and is the fastest way for two
processes on one host to talk. The family is fixed at creation and cannot change, so a
dual-stack server that wants both IPv4 and IPv6 either opens two sockets or opens one
AF_INET6 socket and accepts IPv4 clients through address mapping.
Ports are the 16-bit number that lets one IP address host many services at
once. The kernel reserves the low range (below 1024 on Unix) for privileged processes, which
is why binding port 80 needs root or a capability. The high range, roughly 49152 and up, is
the ephemeral range the kernel draws from when a client calls connect
without binding a port first; it picks a free local port automatically so the connection has a
complete 4-tuple. That automatic allocation is finite, which is the deeper reason a process
that opens tens of thousands of short-lived outbound connections can run out of ports long
before it runs out of memory or bandwidth.
What happens when you close()
close(fd) does two things: it decrements the kernel’s reference count
on the socket, and if that hits zero, the kernel starts the TCP close sequence (sends
a FIN, transitions to FIN_WAIT_1, etc.). The function returns immediately;
the actual close happens in the background.
Two related calls are worth knowing:
shutdown(fd, SHUT_WR)closes the writing half but keeps the reading half open. Useful when you want to say "I’m done sending, but you keep going" — the canonical example is a client sending an HTTP request and then half-closing while it waits for the response.shutdown(fd, SHUT_RD)closes the reading half. Rarely useful; just close.
After close, the kernel keeps the socket in TIME_WAIT for ~60 seconds
(2 × MSL) so it can absorb stragglers. This is normal, and lots of TIME_WAITs on a busy
server is fine. Sockets stuck in CLOSE_WAIT are an application bug — see the
TCP deep dive.
States and errors you will actually hit
A connected socket moves through a fixed set of TCP states, and a handful of them surface as real symptoms in production. Two are worth understanding because they cause the bugs people most often misread.
TIME_WAIT is the state the side that closes first sits in for roughly 60 seconds (twice the maximum segment lifetime) after the connection ends. It is not a leak. Its job is to absorb any stragglers still in flight and to make sure a delayed packet from this connection cannot be mistaken for part of a new connection that reuses the same 4-tuple. A busy server that initiates closes will have thousands of sockets in TIME_WAIT at any moment, and that is healthy. The cost only matters if you exhaust ephemeral ports by opening huge numbers of short-lived outbound connections, which is an argument for connection pooling rather than for tuning the timer.
EADDRINUSE ("Address already in use") is the error new server programmers
meet first. You stop your server and restart it, and bind fails. The reason is
that your old listening port is tied up by connections still in TIME_WAIT, and by default the
kernel will not let a new socket bind a port with lingering state. The fix is to set
SO_REUSEADDR before bind on every server, which tells the kernel a
TIME_WAIT remnant is not a reason to refuse the bind. It is safe and standard; treat it as a
default, not a workaround.
CLOSE_WAIT is the one that signals a real bug. It means the peer sent a FIN
and the kernel is waiting for your program to call close, which it has
not done. A pile of sockets stuck in CLOSE_WAIT almost always means a code path that reads
until the peer closes but then forgets to close its own end, slowly leaking descriptors until
accept starts failing.
| Error / state | What it means | What to do |
|---|---|---|
EADDRINUSE | Port held by a TIME_WAIT remnant on restart | Set SO_REUSEADDR before bind |
ECONNREFUSED | Nothing is listening on that port; the peer sent RST | Check the server is up and on the right port |
ETIMEDOUT | connect got no response at all | Firewall, dropped packets, or wrong host |
EAGAIN / EWOULDBLOCK | Non-blocking socket has nothing ready right now | Wait for readiness via epoll/poll, then retry |
EPIPE / ECONNRESET | Peer closed or reset; you wrote anyway | Handle the closed peer; do not keep sending |
EMFILE | Out of file descriptors for this process | Close FDs you leak; raise ulimit -n |
| TIME_WAIT | Normal post-close wait on the closer | Nothing; expected and healthy |
| CLOSE_WAIT | Peer closed; your code has not | Find the path that forgets to close |
How higher-level networking is built on this
Almost nothing you write day to day calls accept directly, and that is the point.
The eight-call API is small and stable enough that every layer above it is just a more
convenient face on the same syscalls. An HTTP server is a loop around accept that
reads a request off the connected socket, parses it, and writes a response back; a database
driver is a client socket plus a wire protocol and a connection pool; a load balancer is a
program holding two connected sockets and copying bytes between them. TLS is a layer that sits
on the byte stream, encrypting what you send and decrypting what you
recv, with the socket none the wiser.
This is why learning the socket API pays off out of proportion to its size. When a request hangs, when a server will not restart, when throughput stalls under load, the explanation is usually one layer down at the socket: a full accept queue, a peer in CLOSE_WAIT, a send buffer that is not draining, a descriptor leak. The frameworks change; the eight calls and the kernel behaviour behind them do not.
Common mistakes
- Forgetting
SO_REUSEADDRon a server. Restart the program;bindfails with "Address already in use" because the previous socket is in TIME_WAIT. SetSO_REUSEADDRbeforebindand it goes away. - Treating TCP recv as message-oriented. "I sent 100 bytes; my recv returned 60." That’s normal. Loop until you have the full message; define message boundaries in your protocol.
- Not setting non-blocking on a connect timeout. A blocking
connectwaits for whatever the OS’s default TCP timeout is — typically two minutes. Use non-blocking +pollwith a custom timeout if you need a tighter bound. - Closing without checking
send’s return. Ifsendreturned a partial count and you don’t loop, you sent a truncated message. The peer’s parser then gets confused and you spend an hour wondering why. - Leaking file descriptors. Every accepted connection is an FD.
Forget to close on an error path and you eventually hit
ulimit -n(default 1024 on most systems) andacceptstarts failing.
Tools — finding out what your sockets are doing
| Tool | Use for |
|---|---|
ss -tnp | Live socket state, owning process. The first thing to run. |
lsof -i | Open network connections, with the file descriptor and process. |
netstat -an | The classic. ss is faster, but netstat is everywhere. |
strace -e trace=network | Watch a process’s socket syscalls in real time. Indispensable for "why does this program do that". |
nc (netcat) | The "swiss army knife of TCP". Open arbitrary connections; useful for testing your server. |
socat | Like netcat, but supports many more transports — Unix domain sockets, TLS, TUN devices. |
bpftrace -e 'k:tcp_connect { ... }' | Trace specific kernel functions. The networking section of BPF Performance Tools is the right reference. |
Further reading
- Beej’s Guide to Network Programming — free, friendly, the definitive introduction. If you’ve never written a socket program, this is where to start.
- socket(2), connect(2), accept(2) — the man pages. The most authoritative source for corner cases.
- socket(7) — the protocol-independent overview, including all the SO_* options.
- tcp(7) — TCP-specific options, including TCP_NODELAY, TCP_KEEPIDLE, and the rest.
- Kerrisk — The Linux Programming Interface, ch. 56-61 — the canonical reference. Every Linux sockets corner case is in there.
- Marek Majkowski — Bind before connect — a deep dive on the production gotcha around source-port allocation.