What is the Kullback-Leibler divergence?

Before we understand what the Kullback-Leibler (or KL) divergence is, we need to understand what the term entropy means or represents in the field of information theory and statistics.

Entropy
This term first found it’s way into a myriad number of statistical applications (ignoring the relation to thermodynamics) through a paper published by the legendary Claude Shannon in 1948 called “A Mathematical Theory of Communication”[pdf]. The main problem being targeted here was the communication of bits in a network. With the advent of the digital age in the mid-1900s, it became increasingly crucial to efficiently send bits(or information) from a source to a destination.

We know that a 2 digit binary number can represent 22 pieces of information. If we add another digit to it, we get 23 ways of representing the same information. In a nutshell, a single bit can reduce our uncertainty of knowing something by a factor of 2 or we get twice the number of ways to represent the same information. However in our bit sequence, there will be certain combinations which aren’t used at all or maybe some of them represent an error code of some sort. What if we used a 3-bit uniform sequence to represent 6 pieces of information? Not that efficient, right? Wouldn’t it be better if we devise a code to reduc