The paper “DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification” [1] describes a model that not only generates documents but learns them by associating each document with a label. The discrimination of a document is a function of the generative representation of the document (i.e. the topic mixtures) and its label. This is a theme of generative models that do more than provide density estimates.
This paper interests me for two reasons. First, it describes a model that combines two modes of learning: the generative and the discriminative. It generates the documents whereas it only discriminates the labels unlike the supervised LDA paper [2] that generates both the documents and the responses. Second, as a result of the combination of both modes of learning we get a slightly different inference procedure that maximizes the conditional likelihood of responses rather than the total likelihood of both the responses and the documents.
The model works in the following manner

Each document is associated with a topic mixture . Same as LDA.

Given a label for document , the topics are transfomed to topics . Thus, can be viewed as a transformation matrix that linearly transforms the documentlevel topic mixture to a topic mixture that is dependent on the document label . The key being that this transformation is constrained by the same set of shared word distributions .

Words are generated as
Inference
The main part I want to look at is the learning of these transformation matrices for each label . The inference follows an EM procedure so we start by stating the conditional loglikelihood (after introducing the hidden variables )
where is the number of words in document and is the distribution over labels. Taking the derivative with respect to a component of after adding lagrange multipliers we get the step of the EM algorithm below (the paper solves this using Gradient descent)
The step is given by
Here’s comes the difference from a vanilla EM. To actually compute this, we’ll require samples of which can be taken from a Gibbs sampling procedure that iterates between sampling and .
Notes
It’s interesting that we can make discriminative modifications to a generative model. It would have been useful to see how this approach compares with supervised LDA, which also supports nondiscrete document labels.
[1] LacosteJulien, Simon and Sha, Fei and Michael I. Jordan. 2009. “DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification.” Advances in Neural Information Processing Systems 21.
[2] Jon D. Mcauliffe and David M. Blei. 2008. “Supervised Topic Models.” Advances in Neural Information Processing Systems 20.