What is cosine similarity?

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It's the dot product of the two vectors divided by the product of their lengths, ranging from -1 (opposite) to 1 (same direction). For normalised embeddings (unit length) cosine similarity equals dot product, which is faster to compute. Most embedding pipelines normalise to length 1 and use dot product directly.

What is approximate nearest neighbour (ANN) search?

ANN search trades exact accuracy for speed. Exact nearest-neighbour over a million 768-dimensional vectors costs ~3 GB of computation per query; an HNSW or IVF index brings this to milliseconds with 95-99% recall. The standard production stack: FAISS, ScaNN, hnswlib, or a vector database (Pinecone, Weaviate, pgvector, Qdrant) that wraps these.

New to this? · ELI5 · 1 min Read Embeddings explained simply, in plain English

Vector Embeddings Simulator: things, embedded.

Each item is a vector. "Similar" = high cosine. The substrate of every modern recommendation system, semantic search, and LLM retrieval.

Query

kitten

Pick a query item

Query vector

[ 0.65, 0.25, 0.15, 0.70 ] animal · baby · soft

Ranked by cosine similarity

cat animal · pet · soft 0.992

puppy animal · baby · loud 0.990

dog animal · pet · loud 0.978

banana fruit 0.862

apple fruit 0.854

lion animal · wild 0.809

bicycle vehicle · pedal 0.559

car vehicle · road 0.359

truck vehicle · road · heavy 0.354

What you're looking at

Each of the ten items is a short vector — here four hand-set numbers standing in for the hundreds or thousands of dimensions a real model emits. Pick a query and its raw vector appears, followed by every other item ranked by cosine similarity: the angle between the two vectors, ignoring length. The score sits on the right, tinted green when it is very close, plum when it is a near neighbour, grey when it is far, with a bar echoing the same value. Nothing here looks at spelling — only at direction in the space.

Start with kitten. The top of the list is cat, puppy, dog — the other small soft animals — while car, truck and the fruits sink to the bottom. Switch the query to car and the ranking reshuffles: truck and bicycle rise, the animals fall away. What should surprise you is that items cluster by meaning even though the vectors were never told which words are animals or vehicles; closeness is geometry, and that single idea is what powers semantic search, recommendations, and retrieval.

What is a vector embedding?

Teaching a computer the meaning of "similar".

A vector embedding is a learned dense numerical representation of a piece of content (a word, sentence, image) in a high-dimensional space, where geometric similarity reflects semantic similarity. word2vec (Mikolov 2013) made the concept practical; OpenAI's text-embedding-3, Cohere's embed-v3, and BGE are today's dominant models. Embeddings power semantic search, recommendations, and the retrieval half of every RAG pipeline.

Suppose you are building search for a help centre. A user types how do I cancel my subscription?. Your existing keyword index is a list of every word in every article and a count of where it appears. It does great when the user's words match the article's words. It collapses when they don't. The article that explains cancellation is titled Stopping recurring billing. The keyword index sees no overlap. The right answer is on page one of your knowledge base, and your search returns nothing useful.

You could try to fix this by hand. Add a synonym list: cancel = stop = end = terminate. Add another for nouns: subscription = membership = recurring billing. Repeat for every concept your users might phrase differently from your authors. Within a week you have a thousand-line synonym table that nobody can audit, that drifts whenever the product copy changes, and that still doesn't catch the next phrasing you didn't anticipate. Synonym lists are a tax on every change to either side.

What you actually want is a function that converts each piece of text — a query, an article, a paragraph, a chat message — into a list of numbers. Sentences that mean the same thing should produce nearly identical lists; unrelated sentences should produce wildly different ones. If that function exists, you can search by similarity rather than by overlap: encode every article once at indexing time, encode the query at search time, and rank by how close the query's numbers are to each article's. No synonym list. No drift. The function does the matching itself.

This is what an embedding is. A neural network — trained on enough text to have absorbed the structure of language — reads your input and emits a fixed-length list of typically 384 to 3072 floating-point numbers. Each number captures one of the latent dimensions the model has learned. Run two passages through the same model and the resulting vectors sit close together in this high-dimensional space if the passages are about the same topic, and far apart if they are not. The simulator above shows the principle on a tiny ten-item corpus: pick a query, watch how cosine similarity ranks the rest. Real systems do exactly this on millions or billions of vectors.

The most-cited demonstration of why this works is the famous word2vec analogy: take the vector for king, subtract the vector for man, add the vector for woman, and the result lands closest to the vector for queen. The model was never told that royalty has a gender axis. It learned the structure from millions of co-occurring sentences. Modern embedding models — OpenAI's text-embedding-3, Cohere's embed-v3, the open-source BGE and E5 families — extend the same idea from single words to whole paragraphs and from English to a hundred-plus languages, but the geometric trick at the bottom is unchanged.

How embeddings are produced — tokenise, encode, pool, normalise

Tokenise, encode, pool, normalise.

An embedding is a learned function that maps a piece of text — or an image, or audio — to a fixed-length vector of floats. The vectors are arranged so that similar inputs land near each other and dissimilar inputs land far apart. The trick that makes everything else work is that “similar” is something the model learned, not something we declared. The training objective — contrastive loss on positive/negative pairs, masked-language-model objectives, instruction-tuning on retrieval data — shapes what nearness means in the resulting space.

Modern text embeddings range from 256 to 4096 dimensions. The pipeline is: tokenise (Byte-Pair Encoding, WordPiece, or SentencePiece) → run through a transformer encoder → pool the final layer (mean pooling, CLS-token pooling, or attention-weighted pooling) → L2-normalise. The output is one row of floats. The pooling step is more important than it looks: BGE and E5 use mean pooling; Sentence-BERT typically uses mean pooling; OpenAI's models use a learned pooling head whose details aren't published. Switching pooling strategies on the same backbone can change retrieval accuracy by several MTEB points.

Concrete numbers for orientation. Embedding one paragraph through OpenAI's text-embedding-3-small takes about 50 ms over the public API and costs roughly $0.00002 — twenty thousand paragraphs to a dollar. The result is 1536 floats, six kilobytes per vector. A million paragraphs is six gigabytes of vectors plus an index that fits comfortably in RAM on a 16-GB machine. A billion paragraphs is six terabytes; that's the territory where you stop using a laptop and start using a vector database.

Why dimensions matter

Higher dimensions hold more information but cost more storage, compute per query, and memory. Matryoshka Representation Learning (Kusupati et al, NeurIPS 2022) trains models so the first k dimensions are still meaningful — you can truncate from 3072 to 256 with marginal accuracy loss. OpenAI's text-embedding-3 uses this. Pick small for cost, large for accuracy.

Origins — from word2vec to text-embedding-3

From word2vec to text-embedding-3.

Representing words as dense real-valued vectors goes back to the 1980s — Hinton's 1986 distributed-representation work and Deerwester, Dumais, Furnas, Landauer, and Harshman's 1990 Latent Semantic Indexing both gave low-dimensional projections of word-by-document matrices — but the modern story starts in 2013. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean published Efficient Estimation of Word Representations in Vector Space at ICLR 2013 (the “word2vec” paper), followed by Distributed Representations of Words and Phrases and their Compositionality at NeurIPS 2013. The astonishment was that such simple objectives — CBOW or skip-gram with negative sampling — produced vectors where vec(king) − vec(man) + vec(woman) ≈ vec(queen).

Stanford's GloVe followed a year later (Pennington, Socher, Manning, EMNLP 2014), training on global co-occurrence statistics rather than local windows. Facebook AI Research's fastText (Bojanowski, Grave, Joulin, Mikolov, TACL 2017) added subword n-grams, finally giving useful representations for out-of-vocabulary words and morphologically rich languages. By 2017 every NLP system started by loading a 300-dimensional pre-trained word vector table.

The big shift was contextual embeddings. ELMo (Peters et al, NAACL 2018) used a bidirectional LSTM so the same word got different vectors in different sentences. BERT (Devlin, Chang, Lee, Toutanova, NAACL 2019) replaced the LSTM with a transformer trained on masked-language-model and next-sentence objectives; its 768-dimensional [CLS] token became the default sentence representation overnight. GPT-3 (Brown et al, NeurIPS 2020) demonstrated that scaling the same architecture to 175B parameters produced embeddings strong enough to drive zero-shot retrieval.

Sentence-BERT (Reimers and Gurevych, EMNLP 2019) was the practical breakthrough that made BERT-class embeddings usable for retrieval at scale: a Siamese-network fine-tune on natural-language inference data so that cosine similarity in the embedding space approximated semantic similarity. Their all-MiniLM-L6-v2 (384 dims, 23 MB) is still the default lightweight embedding for budget-constrained pipelines.

Commercial embedding APIs followed. OpenAI's text-embedding-ada-002 launched in December 2022 at 1536 dimensions and $0.0004 per 1K tokens; text-embedding-3-large followed in January 2024 at 3072 dimensions, $0.00013 per 1K tokens, and Matryoshka-truncatable down to 256 dims. Cohere, Voyage, Jina AI, and Mistral all sell competitive APIs; open-source options (BGE from BAAI, E5 from Microsoft, Nomic Embed) are within a few MTEB points of the proprietary leaders. The cost per million embedded tokens has dropped roughly 100× in three years.

Similarity metrics — cosine, dot product, Euclidean

Three ways to measure near.

Once vectors are L2-normalised — that is, scaled so each one has length 1 — cosine similarity, dot product, and squared Euclidean distance all give the same ranking. The proof is one line of algebra: for unit vectors a and b, ||a − b||² = 2 − 2(a·b). Minimising Euclidean distance is the same as maximising dot product, which is the same as maximising cosine. Pick whichever metric your index supports natively, and your search runs at full hardware speed.

Metric	Range	Best for	Notes
Cosine	[−1, 1] (1 = identical)	Default for text. Magnitude-invariant.	Equal to dot-product on normalised vectors.
Dot product	[−∞, ∞]	Recommendation systems with magnitude as signal.	Cheaper than cosine if pre-normalised.
Euclidean (L2)	[0, ∞] (0 = identical)	Image embeddings, raw features.	Same ranking as cosine when normalised.

In high dimensions, distance loses some of the intuition it has in two or three. The curse of dimensionality — identified in Beyer et al's When is “Nearest Neighbor” Meaningful? (ICDT 1999) — says that for many distributions in high dimension, the ratio of the farthest to the nearest neighbour approaches 1; everything is roughly equidistant. Embeddings get away with this because they're not random points in a hypersphere; they live on a much lower-dimensional manifold inside the ambient 1536- or 3072-dimensional space, and the metric structure on that manifold is meaningful.

A subtlety: most text-embedding APIs return already-normalised vectors. If yours doesn't, normalise once at ingest time — the index will then run dot-product (a single fused-multiply-add per dimension, vectorised by AVX-512 to one cycle per 16 floats on Sapphire Rapids).

ANN search — approximate but fast (HNSW, IVF, ScaNN)

Approximate, but fast.

Exact nearest-neighbour search is O(n·d). At one million vectors of 1536 dimensions, that's 1.5 billion floating-point multiplies per query. SIMD plus a tight loop gives you maybe 50 ms per query on a modern CPU; for billion-scale corpora you'd need hundreds of cores per query to hit conversational latency. Approximate Nearest Neighbour (ANN) indexes trade a small accuracy cost for 100×–1000× speed-up by structuring the search so most of the corpus is never compared against the query.

HNSW · Hierarchical Navigable Small World.
Yury Malkov and Dmitry Yashunin's algorithm (TPAMI 2018, arXiv 1603.09320). A multi-layer graph where each node points to a small set of neighbours; queries enter at the top sparse layer and descend greedily. State-of-the-art recall; high memory cost (3–5× the raw vectors). Used by pgvector, Qdrant, Weaviate, Milvus, Chroma, Lance.
IVF · Inverted File Index.
k-means clusters the corpus into nlist partitions; the query probes the nprobe nearest. Lower memory than HNSW, slightly worse recall, easy to combine with quantisation. The classical FAISS recipe, used by Vespa and Lance and many in-house Meta systems.
ScaNN · Scalable Nearest Neighbours.
Google's 2020 algorithm (Guo et al, ICML 2020) using anisotropic vector quantisation tuned for inner-product search. Powers parts of Google Search and Vertex Matching Engine; widely cited but less commonly self-hosted than HNSW.
PQ / SQ · Product / Scalar Quantisation.
Jégou, Douze, Schmid (TPAMI 2010) introduced product quantisation for billion-scale image search. Compress each 1536×4-byte vector to ~64 bytes by quantising sub-vectors against learned codebooks. 16–96× memory reduction; small recall loss. Almost always combined with HNSW or IVF.
FLAT · Brute force.
No index — compare against every vector. Perfect recall. Fine up to ~100k vectors with SIMD. The right baseline before adding complexity, and the right answer when accuracy matters more than latency.

FAISS — Facebook AI Similarity Search (Johnson, Douze, Jégou, IEEE Trans. Big Data 2021) — is the reference open-source library, supporting all of the above plus GPU-accelerated variants. NMSLIB (Boytsov & Naidan) was the original HNSW implementation. The vector-database market that grew up around these algorithms includes Pinecone (founded 2019), Weaviate (2019), Milvus (2019), Qdrant (2021), Chroma (2022), pgvector (2021), Vespa (open-sourced from Yahoo in 2017), and Lance (2023). Each picks a different point on the recall-vs-latency-vs-cost frontier.

HNSW's memory bill deserves attention. The graph stores M neighbours per node at layer 0 and M/2 at higher layers; with M = 16 and 1536-dim float32 vectors, the index adds roughly 130 bytes of pointers per vector on top of the 6 KB of vector data — about 2% overhead in absolute terms but the pointers must stay in RAM for fast traversal. The typical recommendation, codified in the pgvector and Qdrant docs, is to size the host with 1.5–2× the raw vector data in RAM. Postgres' shared buffer cache should comfortably hold the hot graph, and the OS page cache will hold the rest.

A pragmatic stack

Under 100k vectors: FLAT in your existing database (pgvector with no index). 100k–10M: HNSW + scalar quantisation in pgvector or Qdrant. 10M–1B: IVF + PQ in a dedicated vector store (Milvus, Vespa, Lance). Above 1B: sharding, on-disk indexes (DiskANN, RaBitQ), and dedicated infrastructure.

-- pgvector — Postgres extension for ANN
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE doc_chunks (
  id        bigserial PRIMARY KEY,
  doc_id    bigint NOT NULL,
  chunk     text   NOT NULL,
  embedding vector(1536) NOT NULL  -- text-embedding-3-small
);

-- HNSW index, cosine distance.
CREATE INDEX ON doc_chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Query: ten nearest chunks to a fresh embedding.
SELECT id, doc_id, chunk
FROM   doc_chunks
ORDER  BY embedding <=> $1   -- <=> is cosine distance
LIMIT  10;

Embeddings + RAG — the thing every LLM app actually does

The thing every LLM app actually does.

Retrieval-Augmented Generation — RAG — is the architectural pattern that connects vector search to large language models. The term was coined by Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020). The idea: rather than relying on a model's parametric memory, embed a corpus of source documents into a vector store, embed the query at inference time, retrieve the top-k nearest chunks, and condition the generation on them.

Pure dense retrieval misses exact-match queries. Search for “S3 bucket-name-with-dashes-2024” and the embedding of that string is a vague soup of “S3” and “bucket”. A traditional keyword index using BM25 (Robertson and Walker, SIGIR 1994) returns the right document instantly. Pure BM25 misses semantic relevance: “how to revoke a token” doesn't lexically match documents titled “OAuth refresh rotation”. Production RAG systems run both and fuse the results.

Strategy	How	When it wins
RRF · Reciprocal Rank Fusion	Score = Σ 1/(k + rank). k=60 typical.	Default. Score-free; works without calibration.
Convex sum	Score = α·dense + (1−α)·BM25	When you can calibrate scores. α ≈ 0.6 typical.
Re-rank	Top-100 from each → cross-encoder rescoring	Highest quality. 10–50× cost; budget for it.

Anthropic's Contextual Retrieval (engineering post, September 2024) showed that combining dense + BM25 + cross-encoder re-rank reduces retrieval failures by roughly 67% over dense alone, with the additional trick of prepending an LLM-generated context summary to each chunk before embedding. The cross-encoder — typically a fine-tuned BERT or T5 that scores query-document pairs jointly — produces a more calibrated relevance score than either dense or BM25 in isolation. The catch is cost: a cross-encoder evaluation costs roughly the same as the LLM call it precedes, so you only run it on the top 50–100 candidates from the cheaper retrieval stages.

The retriever-reader split that Lewis et al introduced has been refined into many variants. Fusion-in-Decoder (Izacard and Grave, EACL 2021) feeds retrieved passages independently to a T5 encoder and lets the decoder cross-attend across them. Atlas (Izacard et al, JMLR 2023) jointly trains the retriever and the reader. REALM (Guu et al, ICML 2020) embeds the retriever directly in the language-model pretraining loop. Production systems mostly stick to the simpler retrieve-then-prompt pattern with off-the-shelf embeddings, because it composes cleanly with whatever LLM you deploy and doesn't require joint training.

What breaks in real corpora — drift, multilingual, domain

What breaks in real corpora.

Five issues account for most production embedding incidents. Naming them up front is cheaper than rediscovering them through a postmortem.

CHUNKING IS HALF THE WORK
Embeddings work best on coherent chunks of about 200–500 tokens. Too small and you lose context; too big and pooling dilutes the signal. Sliding-window with 50-token overlap is the safe default. Recursive splitting on semantic boundaries (LangChain's RecursiveCharacterTextSplitter, LlamaIndex's SentenceSplitter) wins for technical docs. Late chunking — embed the whole document with a long-context model, then segment the resulting token-level vectors — is the 2024 alternative.
EMBEDDING MODEL DRIFT
Switching from text-embedding-ada-002 to text-embedding-3-small means your vectors live in a different geometry. There is no projection that preserves rankings across models; you must re-embed the entire corpus. Plan for full reindex when changing the model. Budget the API spend; for a 100M-chunk corpus on a 1536-dim model, the bill is real.
METADATA FILTERING
“Find docs about X, but only from team Y” requires post-filtering after ANN, or a pre-filter index combined with the vector search. Pre-filter is faster but can cut recall; when the filter is selective, you may need to oversample (top-1000 then filter to top-10) to get enough results. Qdrant, Weaviate, and Milvus support payload filters natively; pgvector handles this through standard SQL WHERE clauses combined with the vector operator.
EVALUATION IS NON-OPTIONAL
Build a labelled query/doc test set early. Track recall@k and Mean Reciprocal Rank (MRR) on every change. Hugging Face's MTEB (Muennighoff et al, 2022) and the BEIR benchmark (Thakur et al, NeurIPS 2021) are good starting points; a small in-house set tuned to your domain is essential. Recall@10 of 0.85 is a usable baseline for most retrieval pipelines.
RECALL VS LATENCY TRADEOFFS
HNSW exposes M (graph degree) and ef_construction/ef_search (search beam width). Higher values give better recall at higher cost. IVF exposes nlist and nprobe. Tune by sweeping the parameter space against your eval set; the right point depends on whether your latency budget is 5 ms or 50 ms.

Performance — what the curves actually look like

What the curves actually look like.

Concrete numbers calibrate intuition. ANN-Benchmarks (Aumüller, Bernhardsson, Faithfull, Information Systems 2020), maintained at ann-benchmarks.com, runs every major ANN library against standard datasets — SIFT-1M (1 million 128-dim image descriptors), GloVe-100 (1.2M 100-dim word vectors), GloVe-200, Deep1B (1B 96-dim image embeddings), and several Microsoft SPACEV variants up to 1B vectors.

Workload	Index	Recall@10	QPS
SIFT-1M, single core	FLAT	1.000	~120
SIFT-1M, single core	HNSW (M=16)	0.99	~12000
SIFT-1M, single core	IVF-PQ (16x)	0.93	~25000
Deep1B, GPU FAISS	IVF-PQ + GPU	0.85	~80000
Production OpenAI 1536-dim, 10M docs	pgvector HNSW	0.97	~2000

Storage cost matters at scale. A 1536-dimensional float32 vector is 6 KB; a million such vectors is 6 GB; a billion is 6 TB — before any index overhead, which adds 30–50% for HNSW and 5–15% for IVF. Scalar quantisation to int8 cuts storage 4× with negligible recall loss; product quantisation to 64-byte codes cuts it 96× with a small recall hit. The trade is recall against memory cost, and memory cost against latency. A million 1536-dim vectors fits comfortably in RAM on a laptop; a billion needs either disk-resident indexing (DiskANN, Microsoft, NeurIPS 2019) or sharding across nodes.

Recent algorithmic progress is visible in the leaderboards. RaBitQ (Gao & Long, SIGMOD 2024) gives provably better recall than product quantisation at the same memory budget. Microsoft's SPANN (NeurIPS 2021) handles billion-scale on a single machine by keeping a memory-resident index of cluster centroids and on-disk posting lists. The frontier moves; the underlying primitives — graphs, partitions, quantisation — remain.

When NOT to use embeddings — exact match, structured search

The retrieval problems vectors don't solve.

Vector search is the right tool for “find documents semantically similar to this query”. It is the wrong tool for several adjacent problems that look superficially similar.

Exact-match retrieval — finding all documents containing a specific phrase, code identifier, error message, or product SKU — belongs to inverted indexes. Lucene (1999), Elasticsearch (2010), OpenSearch, Tantivy, and Postgres GIN indexes all do this faster and more accurately than dense retrieval ever will. The sweet spot for vectors is conceptual matching, not literal matching.

Structured queries — “customers in Germany who purchased between March and June” — belong to relational databases. Embedding the query and ranking by cosine returns plausible-looking but mostly wrong results, because the embedding has no notion of the structural constraints. Use SQL; use vectors only on the unstructured fields.

Numerical similarity — finding rows with similar values in a few specific columns — belongs to standard nearest-neighbour search on those columns directly, not on a learned embedding of them. Embedding models trained on natural language are bad at numbers; they confuse 7 with 8 more often than you'd expect.

Reasoning queries — “which document explains how to fix this stack trace?” — benefit from retrieval but rarely solve with retrieval alone. The pattern that works is retrieve-then-reason: a fast vector retrieval pulls 20–100 candidate chunks, and a downstream LLM call synthesises the answer. The vector store's job is to narrow the search space, not to answer the question.

Finally, small corpora don't need vector indexes at all. Below about 10000 documents, a brute-force scan with SIMD comparisons is faster than any ANN structure once you account for build time, memory overhead, and quantisation error. The tipping point varies with vector dimension and CPU, but the rule holds: don't build infrastructure you won't use. Many production RAG systems would be simpler, faster, and more accurate if they replaced their vector store with a single in-memory float32 array and a tight loop.

The pattern that has emerged across 2023–2025 production deployments is: keyword index for exact match, vector index for semantic match, structured database for filters, and an LLM call to compose the answer. The vector index is one component, not the answer to the whole problem. Treating it as such — building a sophisticated HNSW pipeline before validating that vectors are even the right primitive — is the most common architectural mistake in early-stage RAG projects.