In the previous post on posterior regularization we saw how to specify constraints during the -step of expectation maximization that would otherwise be difficult to incorporate into the model itself. The constraints took the following form
where we specified our constraints as expectations over the distribution of hidden variables. As an example, we took GMM clustering and used this framework to specify the constraint that at most two of the points in should be in the same cluster:
.
What if we wanted to specify instead that those points should be separated into as many clusters as possible – if possible, each of those three points in three different clusters? This poses a problem because we do not know a priori if the data will allow (in a sensible way) for the three points to be separated and all we may know is that they are very likely well-separated.
A solution is presented in “Posterior vs. Parameter Sparsity in Latent Variable Models” [1] where the posterior regularization framework is relaxed by allowing to be variable while being penalized by a function
. Specifically, the
-step now looks like
With this setup, a simple way to ensure our points to live in separate clusters is to let where
is the strength of the regularization whose value must be ascertained experimentally. From the test section we find values of
between
and
.
Experiments
The experiments in this paper are conducted with respect to Part-of-speech (POS) tagging which warrants its own look later on. For now, let’s look at a useful number computed to compare models. The experiments in the paper want to show that a HMM augmented with this framework manages to assign fewer POS tags to a word than otherwise. To show this they consider words occurring more than times in the data and compute
where is the number of times
occurred and
is the number of times
is assigned tag
. The closer this value is to
the sparser the tag assignments.
[1] Joao Graca and Kuzman Ganchev and others. 2009. “Posterior vs Parameter Sparsity in Latent Variable Models”. Advances in Neural Information Processing Systems 22
Pingback: Starting Part-of-Speech Tagging | Latent observations