Say you work at a media company (e.g. the time group, free press, the Hindu etc.) and have a set of 100,000 news articles (documents) printed over the few months. you want to conduct various types of analyses on news articles, such as recommending relevant articles to a given user, comparing the similarity between articles etc. The size of the vocabulary is 20,000 words.
You create two matrices to capture the data – a vanilla term-document matrix A where each row represents a document d and each column represents a term w, and another matrix B which is created by performing LSA with k=300 dimensions on the matrix A.
Assume that the matrix A is constructed using tf-idf frequencies. Also, assume that each document has a label representing its news category (sports, stock market, politics, startups etc.).
t1 | t2 | … | t20000 | label | |
d1 | 0.67 | 1.34 | … | 2.11 | sports |
d2 | 0.87 | 0.02 | … | 0.00 | finance |
… | 0.00 | 3.22 | … | 0.56 | …. |
d100,000 | 0.07 | 0.00 | … | 0.00 | startups |