Word Embeddings

In the previous segment, you saw how to create different kinds of word vectors. you may have noticed is that the occurrence and co-occurrence matrices have really large dimension(equal to the size of the vocabulary V). this is a problem because working with such huge matrices make them almost impractical to use. You will see how to tackle this problem now.

Let’s first summarise all that you have learnt about word vectors till now.

You already know that the occurrence and co-occurrence matrices are sparse(really sparse!) and high-dimensional. Talking about high dimensionality- why not reduce the dimensionality using matrix factorization techniques such as SVD etc.?

This is exactly what word embedding aim to do. word embedding are a compressed, low dimensional version of the mammoth-sized occurrence and co-occurrence matrices.

Each row (i.e word)

has a much shorter vector (of size say 100, rather than tens of thousands) and is dense, i.e. most entries are non-zero (and you still get to retain most of the information that a full-size sparse matrix would hold).

Let’s see how you can create such dense word vectors.

What are the different ways in which you can generate word embeddings? Let’s see in the following lecture.

Word embeddings can be generated using the following two broad approaches:

Frequency-based approach: Reducing the term-document matrix (which can as well be a tf-idf, incidence matrix etc.) using a dimensionality reduction technique such as SVD
Prediction based approach: In this approach, the input is a single word (or a combination of words) and output is a combination of context words (or a single word). A shallow neural network learns the embeddings such that the output words can be predicted using the input words.

You will learn prediction based approaches in more detail shortly.

Report an error