Latent Semantic Analysis(LSA)

Let’s now discuss a frequency-based approach to generate word embedding- Latent Semantic Analysis.

Latent Semantic Analysis (LSA) uses Singular Value Decomposition (SVD) to reduce the dimensionality of the dimensionality of the matrix. Let’s now visualize the process of latent semantic analysis.

In LSA, you take a noisy higher dimensional vector of a word and project it onto a lower dimensional space. The lower dimensional space is a much richer representation of the semantics of the word.

LSA is widely used in processing large sets of documents for various purposes such as document clustering and classification (in the lower dimensional space), comparing the similarity between document (e.g. recommending similar books to what a user has liked), finding relations between terms (such as synonymy and polysemy) etc.

Apart from its many advantages, LSA has some drawbacks as well. One is that the resulting dimensions are not interpretable (the typical disadvantage of any matrix factorisation based technique such as PCA). Also, LSA cannot deal with issues such as polysemy. For e.g. we had mentioned earlier that the term ‘Java’ has three senses, and the representation of the term in the lower dimensional space will represent some sort of an ‘average meaning’ of the term rather than three different meanings.

However, the convenience offered by LSA probably outweighs its disadvantages, and thus, it is a commonly used technique in semantic processing (you’ll study one use on the next page).

Let’s now learn to implement LSA in Python.

The task here is to decrease the number of dimensions of the document-term matrix, i.e. reduce the #documents x #terms matrix to #documents x #LSA_dimensions. So, each document vector will now be represented by a lower number of dimensions.

Let’s now work through a comprehension on LSA on the next page which will give you more clarity on the topic.

Report an error