Dirichlet (either by itself, or as a mixture of, or as a hierarchy of) priors are by no means the only option of controlling sparsity of topic mixtures. *Entropic priors* stand out as an interesting alternative. Given a probability distribution , the entropic prior is defined as

Positive and negative values of control the favorability towards higher or lower entropies.

The paper “Sparse Overcomplete Latent Variable Decomposition of Counts Data” [1] explores the use of entropic priors to counts data such as words in documents or colors in images. Their main claim is that entropic priors provide increasingly finer, yet non-trival, refinements of data as the number of components are increased. This is in contrast to the diminishing returns one sees when the number of topics are increased in LDA or clustering models.

The paper on “Pattern Discovery via Entropy Minimization” [2] offers an in-depth analysis of the consequences of using entropic priors. Two points are claimed:

- An entropic prior “is a bias for compact models having less ambiguity, more determinism, and therefore more structure.”
- “is invariant to reparametrizations of the model, because the entropy is defined in terms of the model’s joint and/or factored distributions.”

The first point is illustrated with experiments in [1] and [2]. The second point is an interesting property. A reparametrization of a model is a change of variables given by a bijective function . Take for example the reparametrization that creates a bijection by reversing the co-ordinates. We can clearly see that but .

## Inference

I mainly want to look at the inference of models with entropic priors and [1] provides the setup. Consider that we are working with documents, topics with as the document-level mixture distributions, and as the word distributions. The paper performs an EM-derivation – so we’ll start with the complete log-likelihood equation (where is the count of word in document )

The E-step is given by

For the M-step we take the derivative of w.r.t. and (only this is shown below as the other is very similar) under the constraints and (lagrange multipliers below).

Normally, at this point we’re able to solve these set of equations by writing in terms of but the is problematic. The involvement of in this fashion makes these into a set of *transcendental equations*.

### Fixed point equations

The solution, as pointed to in the paper, is to employ the Lambert function and solve a set of fixed-point equations. The derivation is explained in [2] and i’ll follow it here with a bit of background. Why do want this? Well, in situations such as this where we need to find an such that one way to solve these is to use fixed-point iterations as follows:

- Convert the equation to the form
- Start with an initial guess
- Iterate, using for

and when is continuous and the limit exists, then, this limit is a root of .

In our set of equations we have a function of two variables and . To apply the fixed point method here we need to come up with two functions and so that we may iterate .

Writing as a function of is obtained by rearranging . To write as a function of requires help. Consider the Lambert function known to satisfy the following

The structure of this identity is very similar to what we have in that if we could coax this identity to look exactly like our equation with taking the place of and involved in then we’ll end up writing in terms of . Let . Starting with the identity above and setting to get rid of the term we have

Setting and we arrive at

Thus, we have arrived at the fixed-point equations for and . I leave it to [2] to show that these fixed-point equations do indeed converge.

[1] Madhusudana Shashanka and Bhiksha Raj and Smaragdis. 2008. “Sparse Overcomplete Latent Variable Decomposition of Counts Data.” *Advances in Neural Information Processing Systems 20*.

[2] Matthew Brand. 1999. “Pattern Discovery via Entropy Minimization.” *Uncertainty 1999*.