01 / 12
Stack / 01

Sockets

A socket is the operating system's abstraction for network I/O. To the kernel it is a data structure that owns two buffers and some protocol state; to your program it is a file descriptor you can read and write like any other. The same handful of system calls — socket, bind, listen, accept, connect, send, recv, close — has been the API since 1983 BSD, and every higher-level layer (Go’s net package, Node’s net module, your HTTP framework, a load balancer) calls down to it. This page works through what the socket is, the exact syscall sequence on each side, where data sits between your code and the wire, and why one blocking call per connection eventually pushes you toward an event loop.


What a socket actually is

A socket is a kernel data structure that the program refers to by a small integer — a file descriptor. It holds a few pieces of state: which protocol family it speaks (IPv4, IPv6, Unix), which type it is (stream, datagram, raw), the local address it’s bound to, the remote address it’s connected to, and two buffers — one for data going out, one for data coming in.

In Unix everything is a file. A socket is a file descriptor that, instead of being backed by bytes on disk, is backed by bytes the kernel reads from or writes to the network. read and write work on a socket FD just like they do on a regular file FD; send and recv are the same thing with extra flags.

The five-tuple. Every TCP connection is uniquely identified by (protocol, local IP, local port, remote IP, remote port). The kernel uses this to decide which socket each incoming packet belongs to. Two TCP sockets cannot share a five-tuple; everything else is fair game.

The two buffers are the part that surprises people, so it is worth being precise. When your code calls send, the bytes do not go onto the wire. They are copied into the socket's send buffer inside the kernel, and send returns. The kernel's TCP code drains that buffer onto the network in its own time, according to the window the receiver advertised and what congestion control allows. On the other side, packets the NIC receives are reassembled into the socket's receive buffer, and recv copies out of that buffer into your memory. Your program and the network are decoupled by these two queues; that decoupling is what makes a socket feel like a file instead of a wire.

your processapplication codekernelsend buffer (SO_SNDBUF)recv buffer (SO_RCVBUF)NICto/from wiresend()recv()kernel drains to wirearriving packets fill ityour code talks to buffers; the kernel talks to the wire
The two kernel buffers decouple your code from the network. send fills one, recv drains the other; TCP moves bytes between the buffers and the wire on its own schedule.

This is also why a socket sits squarely in the operating system's I/O machinery rather than off to the side. The same page-cache, blocking, and readiness concepts that apply to disk files apply here too; the OS I/O internals page covers the file-descriptor table, the difference between an open file description and a descriptor, and how the kernel decides a descriptor is "ready" to read or write. A socket is one more kind of descriptor that plugs into all of it.

Socket types

You pick two things when you call socket(): the address family and the socket type.

Address familyFor
AF_INETIPv4
AF_INET6IPv6 (and dual-stack IPv4 with mapping)
AF_UNIXLocal IPC over a filesystem path
AF_PACKETRaw Ethernet frames (Linux); needs root
Socket typeMaps toSemantics
SOCK_STREAMTCPReliable, ordered, byte-stream
SOCK_DGRAMUDPBest-effort, message-oriented
SOCK_RAWraw IPYou write the IP header yourself; needs root
SOCK_SEQPACKETSCTP, UnixReliable, message-boundaries preserved

Almost every program you’ll write uses AF_INET or AF_INET6 with SOCK_STREAM (TCP) or SOCK_DGRAM (UDP). The other combinations exist for narrower jobs — packet capture, custom protocols, IPC.

The server lifecycle

Five system calls, in this order:

int fd = socket(AF_INET, SOCK_STREAM, 0);    //  1. create

struct sockaddr_in addr = {
    .sin_family = AF_INET,
    .sin_port   = htons(8080),
    .sin_addr   = { INADDR_ANY },             // 0.0.0.0 = all interfaces
};
bind(fd, (struct sockaddr*)&addr, sizeof addr); //  2. claim address

listen(fd, 128);                              //  3. mark as accepting,
                                              //     backlog = 128

while (1) {
    int client = accept(fd, NULL, NULL);      //  4. dequeue a connection
    handle(client);                           //  5. read/write on it
    close(client);
}

Step by step:

  • socket allocates the kernel data structure and gives you back a fd. The socket is "unbound" — it has no address.
  • bind attaches a local address. 0.0.0.0 means all interfaces; 127.0.0.1 binds to localhost only. Port 0 lets the kernel pick a free port, which is handy for tests.
  • listen tells the kernel "incoming connections welcome", with a backlog of how many completed connections to queue. The default of 128 is fine for most workloads; high-throughput servers raise it.
  • accept blocks until a connection completes the TCP handshake, then returns a new fd for that connection. The original fd keeps accepting on its own.
  • close on the per-connection fd starts the TCP close sequence. close on the listening fd stops accepting new connections.

The client lifecycle

Even simpler. Three calls:

int fd = socket(AF_INET, SOCK_STREAM, 0);    //  1. create

struct sockaddr_in remote = {
    .sin_family = AF_INET,
    .sin_port   = htons(443),
    .sin_addr   = inet_addr("93.184.216.34"),
};
connect(fd, (struct sockaddr*)&remote, sizeof remote); //  2. handshake

send(fd, "GET / HTTP/1.0

", 18, 0);  //  3. write
char buf[4096];
ssize_t n = recv(fd, buf, sizeof buf, 0);    //     read
close(fd);

connect is the active half of the TCP three-way handshake. By the time it returns, ESTABLISHED has been reached on both sides and the path is open. If the server isn’t listening or the network is broken, you get ECONNREFUSED, EHOSTUNREACH, or ETIMEDOUT depending on the failure mode.

The two sequences, side by side

It helps to see both sides as one picture. The server sets up a listening socket once and then loops on accept. The client makes one socket and calls connect. The three-way handshake happens between connect on the client and accept on the server; only after it completes does either side have a connected socket to read and write.

serverclientsocket()bind()listen()accept() (blocks)socket()connect()SYNSYN + ACKACKaccept() returns conn fdconnect() returnssend / recv, both ways
The handshake sits between the client's connect and the server's accept. The listening socket never carries data; the connected socket accept hands back does.

Listening sockets vs connected sockets

This is the distinction that makes the rest of socket programming click, and it is the one beginners most often blur. There are two kinds of TCP socket, and they do different jobs. The socket you create, bind, and listen on is a listening socket. It never carries application data. Its only job is to be the rendezvous point: it has a local address but no remote peer, so its identity is just (protocol, local IP, local port) with the remote half left open. Every fresh connection that arrives on that port is matched to this one socket.

Each call to accept hands you a different socket: a connected socket, with a full identity. Its 4-tuple is (local IP, local port, remote IP, remote port) — the same local port as the listener, plus the specific client's address and port. That 4-tuple is what the kernel uses to demultiplex arriving packets: when a segment shows up, the kernel looks for an established socket whose 4-tuple matches all four fields, and only if none matches does it fall back to a listening socket with that local port. This is why thousands of clients can all connect to your server on port 443 at once. They share the local port, but each connection has a distinct remote IP-and-port, so each gets its own connected socket.

listening socket*:8080 (no peer)connected socket10.0.0.5:8080 ↔ 203.0.113.7:51001connected socket10.0.0.5:8080 ↔ 198.51.100.2:44210connected socket10.0.0.5:8080 ↔ 203.0.113.7:51002same local port, one connected socket per client
One listening socket, many connected sockets. The first and third share a remote IP but differ in remote port, so their 4-tuples are still distinct.

The listening socket also explains the backlog argument to listen. The kernel keeps two queues behind a listening socket: a SYN queue of half-open connections still in the handshake, and an accept queue of fully established connections waiting for your code to call accept. The backlog sizes the accept queue. If your program is slow to accept and that queue fills, new completed connections are dropped, and the client sees a connection that hangs or resets. A backlog of 128 is fine for most servers; one that does almost no work per connection and accepts in a tight loop can go higher. The number is a buffer against bursts, not a connection limit.

A first server in Python

The same lifecycle in Python looks almost identical. The standard library wraps the kernel calls in a friendlier shape, but the order is the same.

# echo server — read a line, send it back
import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("0.0.0.0", 8080))
s.listen(128)

while True:
    conn, addr = s.accept()
    print(f"connection from {addr}")
    data = conn.recv(4096)
    conn.send(data)
    conn.close()

Run it. From another terminal: nc localhost 8080, type something, hit enter. You’ve just exchanged bytes through the TCP/IP stack.

send and recv don't return what you expect

This is the most common beginner bug. send(fd, buf, n) doesn’t send n bytes — it sends up to n bytes and returns how many it actually sent. recv(fd, buf, n) reads up to n bytes and returns how many actually arrived. The kernel buffers, the application buffers, the network is bursty, so partial returns are normal.

You always loop:

# send all bytes
while sent < len(buf):
    n = sock.send(buf[sent:])
    if n == 0:
        raise BrokenPipeError()
    sent += n

# recv exactly N bytes
def recv_n(sock, n):
    chunks = []
    received = 0
    while received < n:
        chunk = sock.recv(n - received)
        if not chunk:
            raise ConnectionError("peer closed early")
        chunks.append(chunk)
        received += len(chunk)
    return b"".join(chunks)

Python’s sock.sendall(buf) wraps the send loop. There’s no equivalent recvall because TCP is a stream — there’s no inherent "end of message". Your protocol on top has to define one (length prefix, delimiter, fixed-size frame, length + checksum + payload, etc.).

UDP is different. One sendto = one packet on the wire = one recvfrom on the other side. There’s no "partial" UDP recv — either you get the whole datagram or you don’t. The buffer you pass had better be big enough; anything past it is silently truncated.

Blocking vs non-blocking

By default, recv, accept, and connect block the calling thread until something happens. recv blocks until the receive buffer has bytes or the peer closes; accept blocks until a connection sits in the accept queue; connect blocks until the handshake finishes or fails. The thread is parked the whole time, doing nothing, holding its stack.

With one connection per thread that is fine. The trouble is the cost of a thread. Each one carries a stack measured in megabytes and a scheduling slot, so a few thousand idle connections turn into gigabytes of stacks and a scheduler thrashing through context switches. This is the old "C10k problem": handling ten thousand simultaneous connections is not a bandwidth problem, it is a problem of how you wait. A model where waiting costs a whole thread does not get there.

Non-blocking sockets change what "wait" means. Mark a socket non-blocking and recv stops parking the thread: if there is nothing to read it returns immediately with EAGAIN (or EWOULDBLOCK). Now one thread can poke many sockets without getting stuck on any one of them. But spinning over every socket asking "anything yet?" wastes the CPU you just saved, so you need the kernel to tell you which sockets are ready. That is exactly what select, poll, epoll, and kqueue do: hand them a set of descriptors, and they block once until at least one is ready, then return the ready ones. One thread, parked in one call, waking only when there is real work. That pattern — a single thread blocked in a readiness call, dispatching to handlers as descriptors become ready — is an event loop, and it is why non-blocking sockets and the event loop are really the same idea seen from two angles.

Three ways to handle many connections from one thread:

  • Non-blocking + select / poll / epoll. Mark each socket non-blocking with fcntl(fd, F_SETFL, O_NONBLOCK). recv now returns immediately, with EAGAIN if there’s nothing to read. Use epoll_wait (Linux), kqueue (BSD/macOS), or poll (everywhere) to wait on many sockets at once. This is what every high-throughput server uses underneath.
  • io_uring. The newer Linux mechanism. Submit hundreds of operations through a shared-memory ring; the kernel completes them and notifies you in batches. Higher throughput than epoll for I/O-heavy workloads, but more complex.
  • Async runtimes. Tokio (Rust), asyncio (Python), Go’s runtime, Node.js — all sit on top of one of the two above and present a coroutine / goroutine / promise abstraction. You write code as if it’s blocking; the runtime turns each await point into an epoll registration.

All three sit on the same foundation. The async runtime you use does not invent a new way to talk to the network; it owns an event loop over non-blocking sockets and hides the registration and dispatch behind await, a goroutine, or a callback. When you understand the blocking call and the non-blocking-plus-readiness alternative, you understand what every one of these runtimes is doing underneath.

Stream sockets vs datagram sockets

The socket type you pass to socket() decides the shape of everything above it. A SOCK_STREAM socket gives you TCP: a reliable, ordered, two-way byte stream. There are no message boundaries on the wire, so bytes you send arrive in order, but the receiver's recv calls split them however the kernel happens to have buffered them. Three sends of 100 bytes can show up as one recv of 300, or six of 50. This is why a stream protocol on top of TCP always has to define its own framing: a length prefix, a delimiter, or a fixed-size record. The reliability and ordering are handled for you by the machinery in the TCP deep dive — sequence numbers, acknowledgements, retransmission, and flow control.

A SOCK_DGRAM socket gives you UDP: best-effort, message-oriented, no connection. One sendto produces exactly one datagram on the wire, and one recvfrom returns exactly one datagram, with its boundaries intact. There is no ordering guarantee, no retransmission, and no flow control; a datagram can be lost, duplicated, or arrive out of order, and the kernel will not tell you. In exchange you get no handshake, no per-connection state, and the lowest possible latency. Anything that needs reliability on top of UDP has to build it itself, which is the path the UDP page follows up to QUIC.

PropertySOCK_STREAM (TCP)SOCK_DGRAM (UDP)
ConnectionHandshake, kept stateNone; address each datagram
BoundariesNone, a byte streamPreserved, one send is one recv
DeliveryReliable, orderedBest-effort, may reorder or drop
Read callrecv returns partial readsrecvfrom returns a whole datagram
CostPer-connection state, slower startStateless, lowest latency

Socket options that matter

OptionUse for
SO_REUSEADDRBind to a port even if a previous socket is still in TIME_WAIT. Set this on every server before bind.
SO_REUSEPORTMultiple sockets on the same port; the kernel load-balances incoming connections across them. The right way to scale a single-threaded server across cores.
TCP_NODELAYDisable Nagle’s algorithm. Set this on RPC-style traffic where every send is a logical message.
SO_KEEPALIVESend periodic probes on idle connections to detect a peer that vanished without RST. Default interval is 2 hours; usually too long.
TCP_KEEPIDLE / TCP_KEEPINTVL / TCP_KEEPCNTThe actual knobs. Set these to something like 30/10/3 for a 60-second detection window.
SO_RCVBUF / SO_SNDBUFPer-socket buffer sizes. Linux auto-tunes; only override on very high-bandwidth-delay paths.
SO_LINGERWhat close does with un-sent data. Default is "send what you can in the background"; rarely change this.

Address structures, briefly

C’s sockets API takes struct sockaddr* — an opaque pointer to a family-specific struct. sockaddr_in for IPv4, sockaddr_in6 for IPv6, sockaddr_un for Unix domain sockets. Each starts with a sa_family field so the kernel can tell which one you actually passed.

struct sockaddr_in {            // IPv4
    sa_family_t    sin_family;  // AF_INET
    in_port_t      sin_port;    // network byte order
    struct in_addr sin_addr;    // 32-bit IPv4 address
};

struct sockaddr_in6 {           // IPv6
    sa_family_t     sin6_family;  // AF_INET6
    in_port_t       sin6_port;    // network byte order
    uint32_t        sin6_flowinfo;
    struct in6_addr sin6_addr;    // 128-bit IPv6 address
    uint32_t        sin6_scope_id;
};

sin_port is "network byte order" — big-endian on the wire. Your machine might be little-endian, so you have to convert. htons(8080) = "host to network short", takes 8080 in your CPU’s order and returns it big-endian. The next deep dive, bytes on the wire, covers this in full.

High-level languages hide this entirely. Python’s socket.bind(("0.0.0.0", 8080)) takes the port as an integer in host order and converts internally. C and Rust make you do it explicitly.

The address family you choose decides which of these structures the kernel expects and which protocol stack the socket plugs into. AF_INET and AF_INET6 reach the IP network; AF_UNIX stays on the local machine and addresses by a filesystem path, which skips the whole IP stack and is the fastest way for two processes on one host to talk. The family is fixed at creation and cannot change, so a dual-stack server that wants both IPv4 and IPv6 either opens two sockets or opens one AF_INET6 socket and accepts IPv4 clients through address mapping.

Ports are the 16-bit number that lets one IP address host many services at once. The kernel reserves the low range (below 1024 on Unix) for privileged processes, which is why binding port 80 needs root or a capability. The high range, roughly 49152 and up, is the ephemeral range the kernel draws from when a client calls connect without binding a port first; it picks a free local port automatically so the connection has a complete 4-tuple. That automatic allocation is finite, which is the deeper reason a process that opens tens of thousands of short-lived outbound connections can run out of ports long before it runs out of memory or bandwidth.

What happens when you close()

close(fd) does two things: it decrements the kernel’s reference count on the socket, and if that hits zero, the kernel starts the TCP close sequence (sends a FIN, transitions to FIN_WAIT_1, etc.). The function returns immediately; the actual close happens in the background.

Two related calls are worth knowing:

  • shutdown(fd, SHUT_WR) closes the writing half but keeps the reading half open. Useful when you want to say "I’m done sending, but you keep going" — the canonical example is a client sending an HTTP request and then half-closing while it waits for the response.
  • shutdown(fd, SHUT_RD) closes the reading half. Rarely useful; just close.

After close, the kernel keeps the socket in TIME_WAIT for ~60 seconds (2 × MSL) so it can absorb stragglers. This is normal, and lots of TIME_WAITs on a busy server is fine. Sockets stuck in CLOSE_WAIT are an application bug — see the TCP deep dive.

States and errors you will actually hit

A connected socket moves through a fixed set of TCP states, and a handful of them surface as real symptoms in production. Two are worth understanding because they cause the bugs people most often misread.

TIME_WAIT is the state the side that closes first sits in for roughly 60 seconds (twice the maximum segment lifetime) after the connection ends. It is not a leak. Its job is to absorb any stragglers still in flight and to make sure a delayed packet from this connection cannot be mistaken for part of a new connection that reuses the same 4-tuple. A busy server that initiates closes will have thousands of sockets in TIME_WAIT at any moment, and that is healthy. The cost only matters if you exhaust ephemeral ports by opening huge numbers of short-lived outbound connections, which is an argument for connection pooling rather than for tuning the timer.

EADDRINUSE ("Address already in use") is the error new server programmers meet first. You stop your server and restart it, and bind fails. The reason is that your old listening port is tied up by connections still in TIME_WAIT, and by default the kernel will not let a new socket bind a port with lingering state. The fix is to set SO_REUSEADDR before bind on every server, which tells the kernel a TIME_WAIT remnant is not a reason to refuse the bind. It is safe and standard; treat it as a default, not a workaround.

CLOSE_WAIT is the one that signals a real bug. It means the peer sent a FIN and the kernel is waiting for your program to call close, which it has not done. A pile of sockets stuck in CLOSE_WAIT almost always means a code path that reads until the peer closes but then forgets to close its own end, slowly leaking descriptors until accept starts failing.

Error / stateWhat it meansWhat to do
EADDRINUSEPort held by a TIME_WAIT remnant on restartSet SO_REUSEADDR before bind
ECONNREFUSEDNothing is listening on that port; the peer sent RSTCheck the server is up and on the right port
ETIMEDOUTconnect got no response at allFirewall, dropped packets, or wrong host
EAGAIN / EWOULDBLOCKNon-blocking socket has nothing ready right nowWait for readiness via epoll/poll, then retry
EPIPE / ECONNRESETPeer closed or reset; you wrote anywayHandle the closed peer; do not keep sending
EMFILEOut of file descriptors for this processClose FDs you leak; raise ulimit -n
TIME_WAITNormal post-close wait on the closerNothing; expected and healthy
CLOSE_WAITPeer closed; your code has notFind the path that forgets to close

How higher-level networking is built on this

Almost nothing you write day to day calls accept directly, and that is the point. The eight-call API is small and stable enough that every layer above it is just a more convenient face on the same syscalls. An HTTP server is a loop around accept that reads a request off the connected socket, parses it, and writes a response back; a database driver is a client socket plus a wire protocol and a connection pool; a load balancer is a program holding two connected sockets and copying bytes between them. TLS is a layer that sits on the byte stream, encrypting what you send and decrypting what you recv, with the socket none the wiser.

This is why learning the socket API pays off out of proportion to its size. When a request hangs, when a server will not restart, when throughput stalls under load, the explanation is usually one layer down at the socket: a full accept queue, a peer in CLOSE_WAIT, a send buffer that is not draining, a descriptor leak. The frameworks change; the eight calls and the kernel behaviour behind them do not.

Common mistakes

  • Forgetting SO_REUSEADDR on a server. Restart the program; bind fails with "Address already in use" because the previous socket is in TIME_WAIT. Set SO_REUSEADDR before bind and it goes away.
  • Treating TCP recv as message-oriented. "I sent 100 bytes; my recv returned 60." That’s normal. Loop until you have the full message; define message boundaries in your protocol.
  • Not setting non-blocking on a connect timeout. A blocking connect waits for whatever the OS’s default TCP timeout is — typically two minutes. Use non-blocking + poll with a custom timeout if you need a tighter bound.
  • Closing without checking send’s return. If send returned a partial count and you don’t loop, you sent a truncated message. The peer’s parser then gets confused and you spend an hour wondering why.
  • Leaking file descriptors. Every accepted connection is an FD. Forget to close on an error path and you eventually hit ulimit -n (default 1024 on most systems) and accept starts failing.

Tools — finding out what your sockets are doing

ToolUse for
ss -tnpLive socket state, owning process. The first thing to run.
lsof -iOpen network connections, with the file descriptor and process.
netstat -anThe classic. ss is faster, but netstat is everywhere.
strace -e trace=networkWatch a process’s socket syscalls in real time. Indispensable for "why does this program do that".
nc (netcat)The "swiss army knife of TCP". Open arbitrary connections; useful for testing your server.
socatLike netcat, but supports many more transports — Unix domain sockets, TLS, TUN devices.
bpftrace -e 'k:tcp_connect { ... }'Trace specific kernel functions. The networking section of BPF Performance Tools is the right reference.

Further reading

Found this useful?