Forgive my peasant mind, I've heard about K-L divergence before, I just don't know what it is used for nor what's special about it compared to other metrics?
I think people who love math expressed in nomenclature well never understand those who don't ...
K-L divergence is the pretty much really dumb, obvious thing you would probably think up if someone asked you to compare two probability distributions.
You would start by going "oh I suppose I will look at how far apart they are at all the different values" and add it up. Then you would say, "oh, but 0.1 and 0.01 are really an order of magnitude apart, just like 0.01 and 0.001. Perhaps it will work better if I use the log of the probability". Then you would pause for thought and say, "Hmmmm hang on some of the values are really extreme but almost never happen, shouldn't it be weighted by how frequently they occur?".
But of course paragraphs of mathematical symbols are the way many people prefer to express this.
> K-L divergence is the pretty much really dumb, obvious thing you would probably think up if someone asked you to compare two probability distributions.
I don't think if anyone was asked simply to "compare two probability distributions", they'd come up with something asymmetric like the KL divergence. You'd need to at least add that one of the distributions is the "real" distribution.
I agree, however the next step would be to symmetrize the KL divergence because otherwise we get a result that depends on the order that we parse the distributions.
Practically speaking, it's a simple measure of how similar two probability distributions are, minimised (with value zero) when they are the same. So it's often used as a loss term in optimisations when you want two distributions to be pushed towards being similar. Sometimes this motivated by clever reasoning about information/probability ... but often it's more just "slap a KL on it", because it tends to work.