The Correlated Topic Model
Posted on
This note is for Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17–35.
Abstract
Background
- Topic models can be useful tools for the statistical analysis of document collections and other disease data, such as latent Dirichlet allocation (LDA).
- LDA assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary.
- Limitation of LDA: the inability to model topic correlation, and it stems from the use of the Dirichlet distribution to model the variability among the topics proportions.
Proposal
- develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution.
- derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial.
- apply the CTM to the articles from Science published from 1990-1999.
Introduction
- CTM explicitly models the correlation between the latent topics in the collection, and enables the construction of topics graphs and document browsers that allow a user to navigate the collection in a topic-guide manner.
- Application to Science, founded in 1880 by Thomas Edison and continues as one of the most influential scientific journals.
- CTM builds on the earlier latent Dirichlet allocation (LDA), which is an instance of a general family of mixed membership models for decomposing data into multiple latent components.
- The starting point here is a perceived limitation of topic models such as LDA: they fail to directly model correlation between topics.
- The ability to model correlation between topics sacrifices some of the computational conveniences that LDA affords.
- Develop a fast variational inference procedure for carrying out approximate inference with the CTM model. Variational inference trades the unbiased estimates of MCMC procedures for potentially biased but computationally efficient algorithms whose numerical convergence is easy to assess.
The correlated topic model
- A hierarchical model of document collections.
- Models the words of each document from a mixture model, and the mixture components are shared by all documents in the collection, the mixture proportions are document-specific random variables.
Terminology:
- Words and documents. $w_{d,n}$, $\w_d$
- Topics: $\bbeta$, $\bbeta_{1:K}$
- Topics assignments: $z_{d,n}$
- Topics proportions: $\btheta_d$
Assume that an $N$-word document arises from the following generative process. Given topics $\bbeta_{1:K}$, a $K$-vector and a $K\times K$ covariance matrix $\bSigma$:
- Draw $\bfeta_d\mid\{\bmu,\bSigma\}\sim N(\bmu, \bSigma)$
- For $n=1,\ldots,N_d$
- Draw topic assignment $Z_{d,n}\mid \bfeta_d$ from $\Mult(f(\bfeta_d))$(same procedure for each $n$?)
- Draw word $W_{d,n}\mid \{\z_{d,n}, \bbeta_{1:K}\}$ from $\Mult(\bbeta_{z_{d,n}})$
- the logistic normal was originally studied in the context of analyzing observed compositional data, such as proportional of minerals in geological samples. CTM used it to model the latent composition of topics associated with each document.
- One may find strong correlations between the latent topics, e.g., a document about geology is more likely to asl be about archeology than genetics. Aim to use the covariance matrix of the logistic normal to capture such relationships.(I didn’t get why we need covariance matrix from this example, it seems that only proportions can also work. Of course, if some topics occurs at the same time, it is OK, but no further discussion. Even though, why not directly use independent topics.)