IKH

Comprehension- Latent Semantic Analysis

Say you work at a media company (e.g. the time group, free press, the Hindu etc.) and have a set of 100,000 news articles (documents) printed over the few months. you want to conduct various types of analyses on news articles, such as recommending relevant articles to a given user, comparing the similarity between articles etc. The size of the vocabulary is 20,000 words.

You create two matrices to capture the data – a vanilla term-document matrix A where each row represents a document d and each column represents a term w, and another matrix B which is created by performing LSA with k=300 dimensions on the matrix A.

Assume that the matrix A is constructed using tf-idf frequencies. Also, assume that each document has a label representing its news category (sports, stock market, politics, startups etc.).

t1t2t20000label
d10.671.342.11sports
d20.870.020.00finance
0.003.220.56….
d100,0000.070.000.00startups

Report an error