You already know that neural networks can only process numeric data. In RNNs, you need to feed data in a 3-dimensional tensor where the dimensions are: number of samples, number of timesteps and number of features.
The number of samples and timesteps are clear from their name. But what about the number of features? What are the features of an entity? Features are the numeric representation of an entity. Sequences can be of different types: video, time series, text, etc. Features of images and videos are pretty straightforward, that is, you don’t need to derive them. You can use the pixel values of an image as features for the video. So for a video, the dimensions will be: (#videos, #frames in each video, #pixels in each frame). Similarly, in other domains, the features of an entity are pretty clear.
Professor Raghavan talks about how entities, in some of the scenarios in the following video.
But what about sequences involving words? How do you represent words? The dimensions of data in case of text will be: (#sequences, #words in each sequence, #features). Let’s look at these one by one:
- #sequences: The number of sequences depends on the sequence length that you consider which eventually depends on the problem that you’re solving. You can either each sentence as a sequence, or you can take a moving window. You’ll look at both of these techniques in the next session while building a part-of-speech (POS) tagger and a code generator using RNNs.
- #sequence length: The number of entities in each sequence is the sequence length. In case the number of entities is not equal in a sequence, you need to make them equal. You’ll see how you could do that in the next session while building a POS tagger.
- #features: The features that you use is what we will discuss in this section below.
However, things are quite different in the case of textual data. The most naive way to represent text would be replacing words with an integer called as integer encoding. Consider the following sentence:
“Recurrent neural networks are sequence models”
You could represent the above sentence as an array of the form: [1, 2, 3, 4, 5, 6] where 1 represents the word “recurrent”, 2 stands for “neural” and so on. However, this is a very naive way, sort of a jugaad just to make a network run.
Another thing which people often try is represent a word with its one-hot encoding. Here, you represent each word with a vector of size of the entire vocabulary. The vector contains zeroes at every but one place in the vector. In the sentence “Recurrent neural networks are sequence models”, the one-hot encoding will look like this:
recurrent = [1, 0, 0, 0, 0, 0]
neural = [0, 1, 0, 0, 0, 0]
networks = [0, 0, 1, 0, 0, 0]
are = [0, 0, 0, 1, 0, 0]
sequence = [0, 0, 0, 0, 1, 0]
models = [0, 0, 0, 0, 0, 1]
Each vector has length six which is the vocabulary size (the number of unique words) of our tiny corpus. Now, it turns out that this is a very popular technique but also ridiculously sparse representation. As you can see in the above example, each word vector has five 0s and a single 1. Now imagine a corpus with, say, vocabulary size of a million words. Each word vector will have 9,99,999 zeroes. There are a couple of problems with a sparse representation of one-hot encoding:
- One-hot encoding is unable to represent words meaningfully and thus, it is unable to capture meaningful relationship among them. While representing images, each pixel can have a value between 0 and 255. Each number in this range tells you about the colour and the darkness of the image. A value of 0 suggests black colour, whereas a value of 255 represents white colour. There is a relationship between pixel values. You can compare the number and tell which one is a darker pixel and which one is a lighter one. However, one-hot encoding is not able to capture any kind of relationship between words. Suppose, if I give you two word vectors – [1, 0, 0] and [0, 0, 1]. Will you be able to tell what kind of words are these or what kind of relationship these words have? Would you be able to tell if the words are singular, plural, masculine, feminine, noun, adjective or article? The answer is no.
- Another big problem with sparse matrices is that they take a lot of computational resources. A one-hot encoding is a square matrix(if there are no repetitive words) and its size is the vocabulary size. Imagine a corpus with a vocabulary size of 20 thousand. Do you know how much memory will a square matrix of size 20k x 20k will take? Run the following Python code to know. Also, try replacing the vocabulary size with a million and check the results again.
In a nutshell, by using one-hot vectors, we waste a lot of resources and still aren’t able to capture any kind of meaning.
There’s another way of representing words which is a recent innovation in the field of NLP. This representation is called a word embedding. In the following video, let’s look at what a word embedding is and how it overcomes the problems with one-hot encoding.
A word embedding is able to represent a word meaningfully which was the primary drawback of a one-hot encoding. You’ll look at the effectiveness of a word embedding in the next session while building POS tagger.
Moreover, an embedding saves a lot of memory as compared to a one-hot encoding.
The two most popular embeddings are:
- Word2vec: It was developed using supervised techniques at Google.
- GloVe: It was developed using unsupervised techniques at Stanford University.
Now, the size of a word embedding is: (vocabulary size, dimension of the embedding). Vocabulary size if the number of unique words present in the corpus. Dimension is the length of each word vector. For example, if a corpus has three words: “See you tomorrow.”. Suppose the dimension of the embedding, that is, the length of each word vector is 5 then the word embedding size will be: (3, 5). The following table shows the embedding of words present in this corpus.
The blue table has size (3, 5). Also, note that each word vector in the embedding is unique from each other. No two words can have the same embedding because that would mean that they are same.
Now, if you look at the distance such as euclidean distance or manhattan distance between a pair of similar, it will be much smaller than the distance between a pair of unrelated words. For example, the distance between ‘soccer’ and ‘goal’ will be less than the distance between ‘soccer’ and ‘bottle’. Another measure to measure relationship is cosine similarity which is a number between 0 and 1. A pair of similar words will have higher cosine similarity than a pair of unrelated words.
In the next section, you’ll learn about bidirectional RNNs which is an even more powerful way to use RNNs.
Report an error