Entropic Priors

Dirichlet (either by itself, or as a mixture of, or as a hierarchy of) priors are by no means the only option of controlling sparsity of topic mixtures. Entropic priors stand out as an interesting alternative. Given a probability distribution \theta, the entropic prior is defined as

\displaystyle  P_e(\theta) \propto e^{-\alpha \mathcal{H}(\theta)} \\  \mathcal{H}(\theta) = - \sum_i \theta_i \log \theta_i

Positive and negative values of \alpha control the favorability towards higher or lower entropies.

The paper “Sparse Overcomplete Latent Variable Decomposition of Counts Data” [1] explores the use of entropic priors to counts data such as words in documents or colors in images. Their main claim is that entropic priors provide increasingly finer, yet non-trival, refinements of data as the number of components are increased. This is in contrast to the diminishing returns one sees when the number of topics are increased in LDA or clustering models.

The paper on “Pattern Discovery via Entropy Minimization” [2] offers an in-depth analysis of the consequences of using entropic priors. Two points are claimed:

  1. An entropic prior P_e(\cdot) “is a bias for compact models having less ambiguity, more determinism, and therefore more structure.”
  2. P_e(\cdot) “is invariant to reparametrizations of the model, because the entropy is defined in terms of the model’s joint and/or factored distributions.”

The first point is illustrated with experiments in [1] and [2]. The second point is an interesting property. A reparametrization of a model is a change of variables given by a bijective function f : \theta \mapsto \phi. Take for example the reparametrization f(\theta = \langle \theta_1, \dots, \theta_N \rangle) = \langle \theta_N,\dots,\theta_1 \rangle that creates a bijection by reversing the co-ordinates. We can clearly see that P_e(\theta) = P_e(f(\theta)) but \text{Dirichlet}(\theta | \alpha_1,\dots,\alpha_N) \ne \text{Dirichlet}(f(\theta) | \alpha_1,\dots,\alpha_N).


I mainly want to look at the inference of models with entropic priors and [1] provides the setup. Consider that we are working with M documents, K topics with \theta_{i} as the document-level mixture distributions, and \omega_k as the word distributions. The paper performs an EM-derivation – so we’ll start with the complete log-likelihood equation (where V_{ij} is the count of word j in document i)

\displaystyle  \Lambda \propto \prod_{k=1}^K P_e(\omega_k;\beta)  \times \prod_{i=1}^M P_e(\theta_i;\alpha)  \prod_{j=1}^{W} \left( \theta_{ik} \omega_{k \mathbf{w}_{ij}} \right)^{V_{ij} \gamma_{ijk}} \\  \log \Lambda \propto \alpha \sum_i \sum_k \theta_{ik} \log \theta_{ik}  + \beta \sum_k \sum_j \omega_{kj} \log \omega_{kj}  + \sum_i \sum_j V_{ij} \gamma_{ijk} \left( \log \theta_{ik} + \log \omega_{kj} \right)

The E-step is given by

\displaystyle  \gamma_{ijk} \propto \frac{ \theta_{ik} \omega_{kj} }{ \sum_k \theta_{ik} \omega_{kj} }

For the M-step we take the derivative of \Lambda w.r.t. \theta_{ik} and \omega_{kj} (only this is shown below as the other is very similar) under the constraints \sum_k \theta_{ik} = 1 and \sum_j \omega_{kj} = 1 (lagrange multipliers \lambda_k below).

\displaystyle  \frac{\partial \Lambda}{\partial \omega_{kj}} = \beta + \log \omega_{kj} + \sum_i V_{ij} \frac{\gamma_{ijk}}{\omega_{kj}} + \lambda_k = 0

Normally, at this point we’re able to solve these set of equations by writing \omega_{kj} in terms of \lambda_k but the \log \omega_{kj} is problematic. The involvement of \log in this fashion makes these into a set of transcendental equations.

Fixed point equations

The solution, as pointed to in the paper, is to employ the Lambert \mathcal{W} function and solve a set of fixed-point equations. The derivation is explained in [2] and i’ll follow it here with a bit of background. Why do want this? Well, in situations such as this where we need to find an x such that f(x)=0 one way to solve these is to use fixed-point iterations as follows:

  1. Convert the equation to the form x = g(x)
  2. Start with an initial guess x_0
  3. Iterate, using x_{n+1} := g(x_n) for n = 0,1,2 \dots

and when f is continuous and the limit (x_n)_0^\infty exists, then, this limit is a root of f.

In our set of equations we have a function of two variables \omega_{kj} and \lambda_k. To apply the fixed point method here we need to come up with two functions \omega_{kj} = g(\lambda_k) and \lambda_k = h(\omega_{ki}) so that we may iterate \lambda_k^{(0)} \mapsto \{ \omega_{k1}^{(0)},\dots,\omega_{kW}^{(0)} \} \mapsto \lambda_k^{(1)} \mapsto \dots \mapsto \{ \omega_{k1}^{(n)},\dots,\omega_{kW}^{(n)} \}.

Writing \lambda_k as a function of \omega_{ki} is obtained by rearranging \frac{\partial \Lambda}{\partial \omega_{kj}} = 0. To write \omega_{kj} as a function of \lambda_k requires help. Consider the Lambert \mathcal{W} function known to satisfy the following

\displaystyle  \mathcal{W}(y) e^{\mathcal{W}(y)} = y \\  \log \mathcal{W}(y) + \mathcal{W}(y) = \log y

The structure of this identity is very similar to what we have in that if we could coax this identity to look exactly like our equation with \mathcal{W}(y) taking the place of \omega_{kj} and \lambda_k involved in y then we’ll end up writing \omega_{kj} in terms of \lambda_k. Let \sum_i V_{ij} \gamma_{ijk} = \xi. Starting with the identity above and setting y = e^x to get rid of the \log y term we have

\displaystyle  0 = -\mathcal{W}(e^x) - \log \mathcal{W}(e^x) + x \\  = \frac{-1}{1 / \mathcal{W}(e^x)} - \log \mathcal{W}(e^x) + x + \log q - \log q \\  = \frac{-q}{q / \mathcal{W}(e^x)} + \log q / \mathcal(W)(e^x) + x - \log q

Setting x = 1 + \lambda_k / \beta + \log q and q = - \xi / \beta we arrive at

\displaystyle  \omega_{kj} = \frac {- \xi / \beta}{\mathcal{W}(-\xi e^{1 + \lambda_k / \beta} / \beta)}

Thus, we have arrived at the fixed-point equations for \omega_{k\cdot} and \lambda. I leave it to [2] to show that these fixed-point equations do indeed converge.

[1] Madhusudana Shashanka and Bhiksha Raj and Smaragdis. 2008. “Sparse Overcomplete Latent Variable Decomposition of Counts Data.” Advances in Neural Information Processing Systems 20.

[2] Matthew Brand. 1999. “Pattern Discovery via Entropy Minimization.” Uncertainty 1999.

This entry was posted in optimization, statistics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s