Here is another view, which also explains where the logarithm comes from.
The KL between two distributions of a random variable, say Kl[p|q], says that if you made a perfect compression algorithm for samples from distribution q, how many extra bits/nats you expect to need to code samples that actually come from p instead if you use that compression algorithm.
And compression is all about keeping only the true information that is encoded in a sample.
The KL between two distributions of a random variable, say Kl[p|q], says that if you made a perfect compression algorithm for samples from distribution q, how many extra bits/nats you expect to need to code samples that actually come from p instead if you use that compression algorithm.
And compression is all about keeping only the true information that is encoded in a sample.