This segment covers the mathematics of the algorithms of topic modelling.
The three main types of algorithms in topic modelling are as follows:
- Non-negative matrix factorisation
- Latent dirichlet allocation
- Latent semantic analysis
In this session, we will concentrate on a widely used algorithm called non-negative matrix factorisation.
NMF approximates a matrix X as a product of two matrices W, H such that X ~ WH where W >=0 and H>=0.
This is done by minimising the square difference between X and W * H (also know as Frobenius Norm). This is an optimisation problem, as the equation tries to get as close to the minimum value as possible.
min ∥X−WH∥2 such that W ≥ 0 and H ≥ 0
You can manipulate the values in the below Excel sheet and try to decrease the frobenius norm.
To calculate the frobenius norm, we calculate the sum of the square of all the values of the X-WH matrix.
The formula for calculating the frobenius norm for matrix A is as follows where m and n represent number of rows and columns respectively.
∥A∥2F=∑mi=1∑nj=1|ai,j|2
The dimensions of W, H depend on the number of topics. The original matrix X is of the size (number of documents X number of terms). This is broken down into W of the size (number of documents X number of topics) and H of the size (number of topics X number of terms).
Let us solve the following questions based on our understanding:
Now that you learnt how the NMF algorithm works, let’s take a look at an example.
You learnt how NMF works using an example wherein we found topics from a text corpus of the TV series scripts.
You need to keep in mind that NMF or any other topic modelling algorithm does not give you the topics explicitly. Topics are neither coherent or self-contained nor meaningful ideas or concepts. A topic is a bag of words. It is the responsibility of data scientists to assign topics using domain knowledge and logic.
You learnt how the NMF algorithm works and applied it to TV series scripts.
In the next segment, you will learn how to implement the NMF algorithm using Python.