In this session, you studied the idea of distributional semantics and word vectors in detail. You learnt that words can be represented as vectors and that usual vector algebra operations can be performed on these vectors.
Word vectors can be represented as matrices in broadly two ways – using the term-document (occurrence context matrices) or the term-term co-occurrence matrices. Further, there are various techniques to create the co-occurrence matrices such as context-based co-occurrence, skip-grams etc.
You studied that word vectors created using both the above techniques (term-document/occurrence context matrices and the term-term/co-occurrence matrices) are high-dimensional and sparse.
Word embeddings are a lower-dimensional representation of the word vectors. There are broadly two ways to generate word embeddings – frequency-based and prediction-based:
- In a frequency-based approach, you take the high-dimensional occurrence-context or a co-occurrence matrix. Word embeddings are then generated by performing the dimensionality reduction of the matrix using matrix factorisation (e.g. LSA).
- Prediction based approach involves training a shallow neural network which learns to predict the words in the context of a given input word. The two widely used prediction-based models are the skip-gram model and the Continuous Bag of Words (CBOW) model. In the skip-gram model, the input is the current/target word and the output are the context words. The embeddings then are represented by the weight matrix between the input layer and the hidden layer. Also, word2vec and GloVe vectors are two of the most popular pre-trained word embeddings available for use.
You also studied the notion of aboutness and the task of topic modelling – text is usually about some (and usually more than one) ‘topics’. There are multiple techniques that are used for topic modelling such as ESA, PLSA, LDA etc. You can study PLSA and LDA in detail in the next session.