IKH

Introduction to Probabilistic Latent Semantics Analysis (PLSA)

Let’s now discuss another approach for topic modelling – PLSA. It is a more generalized form of LSA. Note that you can study topic modelling (PLSA and LDA) in the next (optional) session in detail, though the following lectures will give you a good primer of the next session.

To summarise, the basic idea of PLSA is this –  

We are given a list of documents and we want to identify the topics being talked about in each document. For e.g., if the documents are news articles, each article can be a collection of topics such as elections, democracy, economy etc. Similarly, technical documents such as research papers can have topics such as hypertension, diabetes, molecular biology etc.

PLSA is a probabilistic technique for topic modelling. First, we fix an arbitrary number of topics which is a hyperparameter (say 20 topics in all documents). The basic model we assume is this – each document is a collection of some topics and each topic is a collection of some terms.

For e.g. a topic t1 can be a collection of terms (hypertension, sugar, insulin, …) etc. t2 can be (numpy, variance, learning, …) etc. The topics are, of course, not given to us, we are only given the documents and the terms. 

That is, we do not know:

  1. How many topics are there in each document (we only know the total number of topics across all documents).
  2. What is each ‘topic’ c, i.e. which terms represent each topic.

The task of the PLSA algorithm is to figure out the set of topics c. PLSA is often represented as a graphical model with shaded nodes representing observed random variables (d, w) and unshaded ones unobserved random variables (c). The basic idea for setting up the optimisation routine is to find the set of topics c which maximises the joint probability P(d, w).

Also note that the term ‘explicit’ in ESA indicates that the topics are represented by explicit terms such as hypertension, machine learning etc, rather than ‘latent’ topics such as those used by PLSA.

You can learn PLSA (and other techniques in topic modelling) in the next session on topic modelling.

Report an error