27-March-2018 (edited 14-October-2020)
A Zero-Maths Interactive Introduction to KL-Divergence
KL-(Kullback–Leibler)-divergence is a common measure used in machine learning, neuroscience, statistics and information theory that represents the similarity between two probability distributions. This post provides some interactive tools for getting a feel for how the measure works without having to delve into any of the maths. At the end you see what it's like to use KL-divergence to attempt to match a three component Gaussian mixture model to a hidden distribution (skip to the final tool).
Below is a graph showing a normal distribution (we will call it \(q\)) with adjustable sliders for moving it about. There is also an (initially hidden) target distribution (we will call \(p\)) that you are trying to match. The KL-divergence between your current guess and the target is shown at the top of the graph. Your task is to reduce the KL-divergence as low as possible. A value of 0 indicates that the two distributions exactly match. You can hit reveal to see how close you are, after which you can still play about with the sliders or restart to set a new random target.
You have just performed variational inference! Machine learning algorithms are often used to adjust multiple parameters (just like you did) to reduce the divergence and match different distributions. As you've seen, with the KL-divergence available for feedback, it's quite easy. Of course, these distribution can be much more complicated than the simple one-dimensional Gaussians used above. As long as it's possible to calculate (or even estimate) the KL-divergence, it can be used as a guide.
In the previous example it was possible to get an exact match. But what happens in a more complex setting? In the next example the hidden distribution is more complicated. It may not be possible to match it exactly. The value will also change less predictably as you modify the parameters. Try to explore the space and get the value as low as possible before revealing the target.
The hidden model is now a combination of two Gaussians. It is impossible to match it exactly using only the tools provided. You should find that the best approximations involve covering one of the peaks as much as possible. This is because the most important aspect of \(KL(q,p)\) is minimizing the area of \(q\) that is not overlapping with \(p\). In other words you are punished for each part of your guess that isn't over \(p\). If the two peaks are far apart then the best you can do is cover one as much as possible. If they are close together then you want to make sure you also cover some of the second one.
KL-divergence is not symmetric. In most cases \(KL(p,q) \neq KL(q,p)\). It matters which order you do the calculation. It is most common to use \(KL(q,p)\) as it tends to be easier to compute or estimate (sometimes it is the only one it is possible to compute/estimate). Below you will see what happens if you try to minimize \(KL(p,q)\) instead.
You should find that this way round you end up with one sweet spot in between the two peaks. Intuitively, this is because \(KL(p,q)\) punishes any part of p that is not covered by q. You have to spread your guess out to cover as much of the two peaks as possible.
Finally, the big finale involves matching a 3x Gaussian mixture model to a hidden distribution. You can use both \(KL(p,q)\) or \(KL(q, p)\) to guide you. There are no tricks here, it should be possible to get an exact match where both measures go to 0.
With only a slightly more complicated model it's already getting hard to find an exact fit. However, the same general methodology used in machine learning helps here. By continually making small adjustments that head in the right direction it's possible to eventually get there.
I hope you learned something about KL-divergence from this post. If you want to delve into the maths behind it there are plenty of technical introductions out there (so many I couldn't decide which to link to).
Next in this series:
- An introduction to gradient descent using KL-Divergence.
- Gradient descent on a 6-variable model.
If you got something out of this (or didn't), or have any suggestions please let me know (twitter and email work) so I can work out if it's worth producing more content like this and how to improve them in the future (this took way longer than it probably should have).