In the previous segment, you had learnt that a topic is a distribution over terms, i.e. each term has a certain ‘weight’ in each topic (which can be zero as well). But is that the only way to define topics? what are the other ways in which we could define topics, and what are some pros and cons of them?
Let’s discuss these questions in the following lecture.
To summarise, there are two major tasks in topic modelling:
- Estimating the topic-term distribution. in this case, we have defined each topic as a single term (though we’ll change that definition soon).
- Estimating the coverage of topics in a document ,i.e. the document- topic distribution:
- Coverage=the frequency of topic j in document i / ∑j (the frequency of topic j in document i )
Some problems with defining topics as a single term are:
Polysemy: If a document has words having the same meaning (such as lunch, food, cuisine etc.), the model would only choose one word (say food) as a topic and ignore all the others.
Word sense disambiguation: Words with multiple meanings such as ‘stars’ would be incorrectly inferred as representing only one topic, though the document could actually have both topics (movie stars and astronomical stars)
Thus, we need a more complex definition of a topic to solve the problem of polysemy and word sense disambiguation. Let’s discuss the method used by topic models such as PLSA and LDA.
To summarise, there are multiple advantages of defining a topic as a distribution over terms.
Consider two topics – ‘magic‘ and ‘science‘. The term ‘magic’ would have a very high weight in the topic ‘magic’ and a very low weight in the topic ‘science’. That is, a word can now have different weights in different topics. You can also represent more complex topics which are hard to define via a single term.
There are multiple models through which you can model the topics in this manner. Let’s first briefly study the matrix factorization approach to topic modelling.