Suppose you invent a function that operates on some domain
. An incredibly fruitful exercise in mathematics is to try and see if
distributes in some way over some operator in the domain. Consider these familiar examples.
; multiplication distributes over addition
; probability distributes over the intersection of events when
are independent
; conjugation by
in a group distributes over the product
Sometimes, you are only fortunate enough to get an inequality
, the triangle inequality that says that the straight line distance between two points is always the shortest. Here the distance function is distributed over addition.
, the joint entropy of
is always smaller than the sum of their individual entropies. A variation of the above.
, this is Jensen’s inequality where
is a convex function and
is the expectation of a random variable
. This is incredibly useful when it comes to maximizing the probability of the data in machine learning.
Why all this motivation? Well, there is one equality that everyone would have seen many times and one that I have constantly failed to remember or work out why is the case. And that is the Cauchy-Schwarz inequality. In the case of probability (Cauchy-Bunyakovskii inequality) it is the following
Now, what’s this all about? I motivated this post by saying that it’s often useful to investigate if some interesting function (in this case ) distributes over other operators. See, we already know that
is a linear operator and hence we get
But can we get something if we multiply ? Let’s try. Consider squaring an expectation:
. When you expand this out you will see that it is an expectation over the cartesian product of the underlying distribution. This means we can’t relate it to
. So, what will work? How about this:
? This is equal to
. Though trivial, we seem to be getting somewhere if we look at the expectation of the square of the random variable. Now the general case
. Unfortunately, the proof in the book is not very intuitive/revealing. And I may come back to this later when I find one. But, the proof is as follows
Pingback: Correlation – I (6/365) | Latent observations