Today’s post is a brief overview of one of my research projects; one that’s unfortunately already been done before. It is going to be a little mathematical, but I’ll try and provide sufficient intuitive reasoning that you don’t have to really be at one with the mathematics.

So, these days, every Tom, Dick and Harry can crawl the web or some other incomprehensibly large source of news, views and… I’d rather not continue. What could you do with this kind of data? Well, you could use it to classify/cluster web-pages for use in search engines like DuckDuckGo, or for recommending books, etc. While analysing these documents text, single words on their own like “files” could have several different connotations, depending on whether you’re talking about computers, bureaucracy or breaking out of a prison. However, you might be able to disambiguate which “topic” is being talked about depending on other words you saw like “screen”, “touchpad” and “Steve Jobs”.

It turns out that extracting out these latent (hidden) semantics of which category the document is in (e.g. “computer-related”, “bureaucracy” or “prison-related”) can actually improve the accuracy of the above techniques. This the rough motivation for topic modelling, which an area that studies models of how categories (topics) are (a) represented and (b) distributed in text.

The most common approach to representing “topics” is as a ‘bag of words’ from which you pick out words at random. Suppose you had a “sports” topic bag; the model says that you’d pick out words like “champion” or “tournament” with a much higher probability than words like “ethernet” or “antelope”, but each of these words are drawn *independently* (using a multinomial distribution). This additionally implies that the words that are drawn are *exchangable*; any rearrangement of the words is effectively the same. Sure, this totally ignores grammar and such, but at the level we’re modelling, I think its a fair simplification.

Most models also assume that documents are a mixture of topics; to generate a word, you first choose a “topic bag” at random, and then draw a word from that bag, also at random. The important questions now are: how are the topic bags themselves created, and how are topics distributed within a document?

From a statistical point of view, the “natural” prior for a multinomial distribution is the Dirichlet. Before describing a Dirichlet distribution, I’ll talk about Binomial and Beta distributions. A binomial gives you the probability of flipping \(N\) heads and \(M\) tails, given the probability of getting a head is \(p\). A Beta distribution on the other hand gives you the *probability that the probability of getting a head is ‘\(p\)’* given you flipped \(N\) heads and \(M\) tails. I know that’s a bit of a twister, let’s give it some sanity checks. The “most likely value” of ‘\(p\)’ is \(\frac{N}{N+M}\). Furthermore, the distribution is more “spread out” when \(N+M\) is small, and becomes more peaked as \(N+M\) increases till it approaches the \(\delta\) function, in accordance with the law of large numbers. The Dirichlet distribution is a generalisation of the Beta distribution, in the same way that the multinomial distribution extends a coin (Binomial distribution) to an N-sided die.

The Dirichlet is a great assumption for how the topic bags are created. In fact, the model that possibly started the entire field of topic modelling (because it was successful), Latent Dirichlet Allocation^{1}, assumes that both the bag of words for each topic, and the bag of topics themselves are drawn from Dirichlet distributions. There have been several fancy extensions as well, for example modelling the evolution of a topic over time^{2} (how have the keywords for the internet changed over time?).

However, a drawback of using the Dirichlet distribution is that it can not model *correlations between topics*, i.e. that I’m likely to talk about cricket and Tendulkar together, and unlikely to talk about religion and neutron stars at the same time^{3}. This limitation can also apply to the words within a topic itself^{4}, though it seems like this is less common of a problem.

The correlated topic model^{5} proposed by Blei, et. al, is one approach for modelling topic correlations. It assumes that the bag of topics, i.e. the multinomial probabilities (\(\theta\)), come from a Normal (Gaussian) distribution,

\[ \eta \sim \textrm{Normal}(\mu, \Sigma) \theta_j = \frac{ \exp(\eta_{j}) }{ \sum_{j'=0}^{K} \exp{\eta_{j'}}} \]

Unfortunately, the logistic function (\(\frac{ \exp(\eta_{j}) }{ \sum_{j'=0}^{K} \exp{\eta_{j'}}}\)) is notoriously hard to deal with mathematically. I haven’t spoken at all about *how* people learn the topic distributions (also called parameter learning) or how people find out the topic proportions in a document (inference); it’s hard to without going into quite a lot of prerequisites, so I’m sorry if I lose you completely in the following.

The two most common approaches to inference are (i) the all-powerful “Monte-Carlo Markov Chain” sampling, of which Gibbs sampling is a special instance that is much more efficient, and (ii) variational inference, an approximate technique that gives you a handle on convergence, etc. In a short detour, I’m going to describe the general idea of Gibbs sampling. We want to find the most likely setting of a number of random variables \(X_1, X_2, \dots\), and what makes this difficult is that the most likely value of \(X_1\) depends directly on \(X_2\) and \(X_3\), which in turn are dependent on other variables. However, to list out every possible setting of \(X_1, \dots X_N\), compute the probability of each setting, and choose the maximum is just not tractable. Consider the case where we just have 10 documents, each with 10 words from a vocabulary of 10 words (very very small numbers), then we’d have in total, 100 variables (1 for each word), each with 10 choices. That means, 10^{100} possible combinations (which incidentally is a Googol).

What Gibbs sampling proposes instead is the following simple steps: 1. Initialise all your variables \(X_1\), \(X_2\), … to random values 2. For each \(X_i\), choose the most likely \(X_i\) based on it’s immediate dependencies; the number of such dependencies is usually small, of the order of 2-3. 3. Repeat! And when you do so, count the number of times a particular sequence of \(X_i\) occurs

The cool thing is that, subject to some conditions, counting the most often occurring set of \(X\) is *guaranteed to converge* to the actual answer.

Back to our problem, the Gibbs sampling approach is fairly simple, *if* you can sample \(X_i\) given the other values of \(X\). In the case of a Dirichlet and multinomial, this probability forms a nice closed form answer, and inference works nicely. Unfortunately, the mathematical messiness of the logistic function makes it hard to use a simple Gibbs sampler (it’s possible, just not simple^{6}, and variational inference is also rather involved^{7}.

The methods for inference described above try to find the most likely assignment of topics to words *as well as the most likely topic proportion itself*. When I said the Dirichlet and multinomial were a nice pair, I didn’t describe how nice; you can in fact average over *all possible topic proportions* for a particular set of topic assignments. Thus, the number of variables you have to deal with reduces to just the set of topic assignments to words. This approach is called “collapsed” inference. As you can guess, exact collapsed inference is impossible for the correlated topic model because of the aforementioned mathematical nastiness.

This is where I enter. For almost a year now, along with Kirtika Ruchandani, and my professor, Balaraman Ravindran, I’ve been working off and on (more off than on) on an efficient (though approximate) collapsed inference scheme for the correlated topic model.

After a lot of time beating around the bush, I realised that the denominator logistic function that was causing all the problems could just be approximated with a quadratic function. This is not a new idea - several people have studied this approximation before^{8}; just not in this context. Anyways, if you plug in the quadratic approximation, you are left with a beautiful Gaussian integral, which is a cinch to compute^{9}.

\[ \int dx ~ \exp( - \frac{1}{2} x^T A x + b^T x ) =|2 \pi A|^{\frac{1}{2}} \exp( \frac{1}{2} b^T A^{-1} b ) \].

Ultimately, we get simple(-ish) set of deterministic updates for each document that can be massively fielded out onto GPUs^{10}. I thought it was quite elegant (even if it wasn’t particularly innovative).

I’ve been looking at this problem for a long time, and of course, checking to see if anyone else tackled it; since nobody else had, I was targeting to submit this to EMNLP 2012. This time however, I realised all of the trick lied in that quadratic approximation of the logistic function, and I had a more general search query. Alas, to my consternation, it turns out that Ahmed and Xing did this back in 2007, in their paper titled, “On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model”^{11}, uses an obscure (to me) name for the correlated topic model^{12}, and an obscurer reference to collapsed variational inference (generalised mean field message passing). I wasn’t too surprised to see that someone else figured this out, but it did come as a bummer. On the plus side, my idea actually works in practice, and was novel enough for a paper.

For me, it’s back to the drawing board.

D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research, 2003. ↩

C. Wang, D. Blei, and D. Heckerman, Continuous Time Dynamic Topic Models, UAI, 2008.↩

I said unlikely, not impossible. And on a side note, I mourn the end of the most lovely (and entertaining) Calamities of Nature.↩

K. Than and T.-bao Ho, Modeling the diversity and log-normality of data, 2011.↩

D. Mimno, H. M. Wallach, and A. Mccallum, Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors, in NIPS, 2008.↩

D. Mimno, H. M. Wallach, and A. Mccallum, Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors, in NIPS, 2008.↩

D. M. Blei and J. D. Lafferty, A Correlated Topic Model of Science, Annals of Applied Statistics, 2007.↩

G. Bouchard and D. Maupertuis, Efficient Bounds for the Softmax Function and Applications to Approximate Inference in Hybrid models, in NIPS, 2007.↩

And I must credit Siva for teaching me about Gaussian integrals. ↩

I hopefully will collate all of our work into a tech-report sometime and post it for the world to see.↩

A. Ahmed and E. P. Xing, On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model, in Artificial Intelligence and Statistics, 2007.↩

In retrospect, Blei and Lafferty’s paper also came out at about the same time, so perhaps the name hadn’t spread yet.↩