Dirichlet (either by itself, or as a mixture of, or as a hierarchy of) priors are by no means the only option of controlling sparsity of topic mixtures. Entropic priors stand out as an interesting alternative. Given a probability distribution , the entropic prior is defined as
Positive and negative values of control the favorability towards higher or lower entropies.
The paper “Sparse Overcomplete Latent Variable Decomposition of Counts Data” [1] explores the use of entropic priors to counts data such as words in documents or colors in images. Their main claim is that entropic priors provide increasingly finer, yet non-trival, refinements of data as the number of components are increased. This is in contrast to the diminishing returns one sees when the number of topics are increased in LDA or clustering models.
The paper on “Pattern Discovery via Entropy Minimization” [2] offers an in-depth analysis of the consequences of using entropic priors. Two points are claimed:
- An entropic prior
“is a bias for compact models having less ambiguity, more determinism, and therefore more structure.”
“is invariant to reparametrizations of the model, because the entropy is defined in terms of the model’s joint and/or factored distributions.”
The first point is illustrated with experiments in [1] and [2]. The second point is an interesting property. A reparametrization of a model is a change of variables given by a bijective function . Take for example the reparametrization
that creates a bijection by reversing the co-ordinates. We can clearly see that
but
.
Inference
I mainly want to look at the inference of models with entropic priors and [1] provides the setup. Consider that we are working with documents,
topics with
as the document-level mixture distributions, and
as the word distributions. The paper performs an EM-derivation – so we’ll start with the complete log-likelihood equation (where
is the count of word
in document
)
The E-step is given by
For the M-step we take the derivative of w.r.t.
and
(only this is shown below as the other is very similar) under the constraints
and
(lagrange multipliers
below).
Normally, at this point we’re able to solve these set of equations by writing in terms of
but the
is problematic. The involvement of
in this fashion makes these into a set of transcendental equations.
Fixed point equations
The solution, as pointed to in the paper, is to employ the Lambert function and solve a set of fixed-point equations. The derivation is explained in [2] and i’ll follow it here with a bit of background. Why do want this? Well, in situations such as this where we need to find an
such that
one way to solve these is to use fixed-point iterations as follows:
- Convert the equation to the form
- Start with an initial guess
- Iterate, using
for
and when is continuous and the limit
exists, then, this limit is a root of
.
In our set of equations we have a function of two variables and
. To apply the fixed point method here we need to come up with two functions
and
so that we may iterate
.
Writing as a function of
is obtained by rearranging
. To write
as a function of
requires help. Consider the Lambert
function known to satisfy the following
The structure of this identity is very similar to what we have in that if we could coax this identity to look exactly like our equation with taking the place of
and
involved in
then we’ll end up writing
in terms of
. Let
. Starting with the identity above and setting
to get rid of the
term we have
Setting and
we arrive at
Thus, we have arrived at the fixed-point equations for and
. I leave it to [2] to show that these fixed-point equations do indeed converge.
[1] Madhusudana Shashanka and Bhiksha Raj and Smaragdis. 2008. “Sparse Overcomplete Latent Variable Decomposition of Counts Data.” Advances in Neural Information Processing Systems 20.
[2] Matthew Brand. 1999. “Pattern Discovery via Entropy Minimization.” Uncertainty 1999.