I’ve recently been working through the problems in the book Probability by A. N. Shiryayev. I’ll probably end up posting some answers to interesting exercises here as I go through them. Regrettably, although I did take many math courses at university, I never took one on probability.
There is a related gripe to be had here which I may explore in a later post but the gist of it is that when it comes to machine learning research even chapter 2 in a book on foundations of probability seems unapplicable in this field. The major technical aspects in machine learning seem to only require the tools of optimization theory and not much else. Nevertheless, given that probability is our best understanding of uncertainty, it’s worth studying in some depth and working through this book is my attempt.
The most basic way to characterize uncertainty in a set of points that you may have observed from some system is with a single point
. You then quantify this point
by stating its variance against the
. There is not just one way to characterize this variance but a common approach is to define variance as
More generally, if is a set of points and
is a density function (i.e. weights such that the sum over
is 1) we write the variance as the following weighted sum
Now why is this an good/interesting way to characterize a summary point ? Now that we have a way to characterize a special point we can ask what is the best such point? And you should see that such a point should minimize the variance
Now optimize it by taking the derivative
We see that the special point has a clean form and is nothing more than the weighted mean we all know and that is the relationship between mean and variance. If you try to use a different definition of variance, say, raising the difference to a power of
instead of
you will not end up with a simple result as this. You might consider taking the absolute value of the difference instead but I will come to this in the next post.
So, let me write what we have seen in the notation commonly used in probability. The variance we defined is written as , and the mean as
, and we did the following optimization
Pingback: Correlation – III, Linear Dependence (8/365) | Latent observations