Category Archives: statistics

Starting Probabilistic Document Retrieval

I want to work through some papers on probabilistic document retrieval mainly to find out the state of things in this area with regards to the depth of infiltration of generative models in this domain. Note that literature refers to … Continue reading

Posted in modeling, statistics | Tagged , , , , | Leave a comment

Reservoir Sampling

If you want to uniformly sample a handful of elements from a very large stream of data you probably don’t want to read it all into memory first. It would be ideal if you could sample while streaming the data. … Continue reading

Posted in statistics | Tagged , , | Leave a comment

Regression-guided Generative Models

A generative model is pretty pointless on its own unless the generative structure itself holds intrinsic interest. Hence, papers justify their generative models either by comparing its predictive performance against another model or by extending the model to accommodate for … Continue reading

Posted in modeling, statistics | Tagged , , , | Leave a comment

Topic Coherence

Evaluating unsupervised topic models is tricky business. If the resulting model is not employed in retrieval, classification, or regression there really is no way of convincing someone of the model’s worth. You may, rightly, say that there is no use … Continue reading

Posted in modeling, statistics | Tagged , , , | Leave a comment

Starting Part-of-Speech Tagging

This is by no means the latest on the subject of probabilistic part-of-speech tagging of documents but nevertheless provides a good starting point to look at the basic model along with training and testing data. This paper [1] takes a … Continue reading

Posted in modeling, statistics | Tagged , , , | Leave a comment

Adding (more Relaxed) Constraints during Model Inference

In the previous post on posterior regularization we saw how to specify constraints during the -step of expectation maximization that would otherwise be difficult to incorporate into the model itself. The constraints took the following form where we specified our … Continue reading

Posted in modeling, statistics | Tagged , , , | 1 Comment

Adding Constraints during Model Inference

Coming up with a probabilistic model and its inference procedure is only half the work because it’s well known that just a single run of the inference procedure is hardly likely to give you a satisfactory answer. Out of the … Continue reading

Posted in modeling, statistics | Tagged , , , | 1 Comment

Modeling and Indexing

It has been well-tested in the real-world and is generally accepted that simple models of indexing perform really well. They have no problems scaling or dealing with gigantic vocabularies. The biggest downside to them is that they can only match … Continue reading

Posted in modeling, statistics | Tagged , , , | Leave a comment

Modeling atop a document representation

The paper “DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification” [1] describes a model that not only generates documents but learns them by associating each document with a label. The discrimination of a document is a function of the generative … Continue reading

Posted in optimization, statistics | Tagged , , , , | Leave a comment

Entropic Priors

Dirichlet (either by itself, or as a mixture of, or as a hierarchy of) priors are by no means the only option of controlling sparsity of topic mixtures. Entropic priors stand out as an interesting alternative. Given a probability distribution … Continue reading

Posted in optimization, statistics | Tagged , , | Leave a comment