KL divergence
D_KL(P‖Q) = Σ p log(p/q). Extra bits to encode P with Q.
Origin
Solomon Kullback and Richard Leibler, "On Information and Sufficiency," 1951. Fundamental to information theory, variational inference, and modern deep learning loss functions.
Where it shows up in production
- Variational autoencoders KL divergence between encoder distribution and prior is half the loss function.
- Reinforcement learning (PPO) KL constraint on policy updates keeps the new policy close to the old. The proximal in PPO.
On Semicolony
Sources & further reading
Found this useful?