The paper “DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification” [1] describes a model that not only generates documents but learns them by associating each document with a label. The discrimination of a document is a function of the generative representation of the document (i.e. the topic mixtures) and its label. This is a theme of generative models that do more than provide density estimates.
This paper interests me for two reasons. First, it describes a model that combines two modes of learning: the generative and the discriminative. It generates the documents whereas it only discriminates the labels unlike the supervised LDA paper [2] that generates both the documents and the responses. Second, as a result of the combination of both modes of learning we get a slightly different inference procedure that maximizes the conditional likelihood of responses rather than the total likelihood of both the responses and the documents.
The model works in the following manner
-
Each document is associated with a topic mixture
. Same as LDA.
-
Given a label
for document
, the topics are transfomed to topics
. Thus,
can be viewed as a transformation matrix that linearly transforms the document-level topic mixture
to a topic mixture
that is dependent on the document label
. The key being that this transformation is constrained by the same set of shared word distributions
.
-
Words are generated as
Inference
The main part I want to look at is the learning of these transformation matrices for each label
. The inference follows an EM procedure so we start by stating the conditional log-likelihood (after introducing the hidden variables
)
where is the number of
words in document
and
is the distribution over labels. Taking the derivative with respect to a component of
after adding lagrange multipliers we get the
-step of the EM algorithm below (the paper solves this using Gradient descent)
The -step is given by
Here’s comes the difference from a vanilla EM. To actually compute this, we’ll require samples of which can be taken from a Gibbs sampling procedure that iterates between sampling
and
.
Notes
It’s interesting that we can make discriminative modifications to a generative model. It would have been useful to see how this approach compares with supervised LDA, which also supports non-discrete document labels.
[1] Lacoste-Julien, Simon and Sha, Fei and Michael I. Jordan. 2009. “DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification.” Advances in Neural Information Processing Systems 21.
[2] Jon D. Mcauliffe and David M. Blei. 2008. “Supervised Topic Models.” Advances in Neural Information Processing Systems 20.