ML & Math also: Kullback-Leibler

KL divergence

D_KL(P‖Q) = Σ p log(p/q). Extra bits to encode P with Q.


In plain terms

Asymmetric: D_KL(P‖Q) ≠ D_KL(Q‖P). Cross-entropy loss = entropy + KL.

Origin

Solomon Kullback and Richard Leibler, "On Information and Sufficiency," 1951. Fundamental to information theory, variational inference, and modern deep learning loss functions.

Where it shows up in production
  • Variational autoencoders KL divergence between encoder distribution and prior is half the loss function.
  • Reinforcement learning (PPO) KL constraint on policy updates keeps the new policy close to the old. The proximal in PPO.
On Semicolony
Sources & further reading
Found this useful?