Here is another view, which also explains where the logarithm comes from. The KL...

Here is another view, which also explains where the logarithm comes from.

The KL between two distributions of a random variable, say Kl[p|q], says that if you made a perfect compression algorithm for samples from distribution q, how many extra bits/nats you expect to need to code samples that actually come from p instead if you use that compression algorithm.

And compression is all about keeping only the true information that is encoded in a sample.