In the previous lecture, you saw that the main problem with PLSA is that it has a large number of parameters which grow linearly with the documents. Although estimating these parameters is not impossible, it is computationally very expensive.
LDA (Latent Dirichlet Distribution) is an alternative topic model that solves this problem. Unlike PLSA, LDA is a parametric model, i.e. you do not have to learn all the individual probabilities. Rather, you assume that the probabilities come from an underlying probability distribution (the ‘Dirichlet’ distribution) which you can model using a handful of parameters. For example, the normal distribution is parameterized by only two parameters – the mean and the standard deviation. Modelling data using this distribution (such as the age of N people) means to estimate these two parameters.Similarly, in LDA, we assume that the document-topic and topic-term distributions are Dirichlet distributions (parameterized by some variables), and we want to infer these two distributions.
In the subsequent lectures, you will study the learning algorithm of LDA, which is a generalised form of PLSA. Recall that in PLSA, we had assumed a generative process: each document is a distribution of topics and each topic is a distribution of terms.
Let’s have Rahim explain this generative process.
To summarise, the generative process is assumed to be as follows:
For each term in a document, you first pick a topic from the document-topic distribution, then from the chosen topic, you pick a term from the topic-term distribution. You do this for all documents to create the corpus. Both document-topic and topic-term distributions come from a Dirichlet distribution.
Lets now understand the various parameters that are involved in LDA modelling.
Note that alpha is a parameter of the Dirichlet distribution which determines the document-topic distribution, while eta is the parameter which determines the topic-term distribution. The effect of alpha and eta on the shape of the Dirichlet distributions is explained in the optional content provided above.