Understanding the Encoder

The role of an encoder is to observe an entire input sequence in a single shot and map the relative importance of one token corresponding to other tokens present in the sequence using the principle of the attention mechanism.

In the next video, Ankush will introduce you to the components inside an encoder block and how they process each input.

As explained in the video, any input fed into an encoder undergoes the following steps:

A given input sequence (“a black dog on blue mat”) is converted into its respective tokens(‘a’, ‘black’, ‘on’, ‘blue’, mat’).

Each token is then translated into its respective numerical representations using a tokenizer, which is then passed on to the embedding layer.

The input embedding layer generates meaningful embeddings of dimension 512 for each token. For example, if the shape of the input tokens is 4*1, then the input embedding layer converts the entire input into the shape of 4*512 (1 vector of size 512 for each token).

Words of a similar family have similar embeddings. Therefore, if we get the words ‘blue’ and ‘black’, their embeddings should be similar. However, the same would not be the case for the words ‘black’ and ‘on’.

However, if these inputs are provided in parallel, the information about their positions will be lost. Therefore, the embedding layer will not be able to account for the position at which any two words appear. So, how can we pass the information to the encoder that the words ‘black’ and ‘blue’ are located far from each other in the given sentence, even though they sound similar?

To resolve this problem, we must pass another embedding, along with the input embedding, to ensure the information about the position of every token is also considered. These embeddings are called positional embeddings.

You can refer to these embeddings as a disturbance added to their original embeddings, such that sufficient distances are provided to them in the form of their position.

However, if we push the words too far away, the semantic information will be irrelevant with respect to the positioning information. Using Fourier analysis, the authors of the Transformer model proposed an index-dependent function that encodes the position of each word as a sinusoidal wave while keeping low values.

If we push the words too far away, the semantic information will be irrelevant with respect to the positioning information. The diagram above represents the addition of these disturbances or the ‘push’ that the position vectors exert on the embeddings. The arrows in blue push towards one cluster while the arrows in red push towards another cluster.

Let’s understand the concept of positional embeddings in detail in the next video.

The addition of these vectors should satisfy certain conditions for the efficient processing of information:

It should output a unique encoding for each time step (word’s position in a sentence).

The distance between any two time steps should be consistent across sentences of different lengths.

Our model should generalise to longer sentences without any effort. Its values should be bounded.

It must be deterministic.

To satisfy all these constraints, the author of the Transformer model suggested the use of the sine and cosine waves of different frequencies. Using Fourier analysis, the authors proposed an index-dependant function that encodes the position of each word as a sinusoidal/cosine wave while keeping low values.

$$//PE_{(pos,2i)}=\sin\;(\frac{pos}{10000{\displaystyle\frac{2i}d}})\\//$$

$$//PE_{(pos,2i+1)\;}=\cos\left(\frac{pos}{10000{\displaystyle\frac{2i}d}}\right)//$$

Here, pos is the position of the word, i is the i-th dimension of the word embedding, and d/dmodel is the number of dimensions in the embeddings.

In summary, for every token at index ‘pos’ in the output embedding from the ‘input embedding’ layer, we will generate a position vector of dimension 512 (i.e., d_model that is equal to the embedding dimension for each token), where for every even embedding index (from 0 →512 i.e ₀,₂,₄,6,…..₅10), we follow the first formula, and for odd indexes (₁,₃,₅,₇….511), we follow the second one.

Having a periodic function ensures values are mapped in a given interval. Also, we can avoid saturation of values as seen using sigmoid or hyperbolic tangent functions. Each position is completely identified by the frequency and offset of the sine/cosine wave.

Therefore, by adding a positional encoding block before passing the embedded output to the encoder, we can successfully pass the positional of the respective tokens.

Once the positional embeddings are added, next comes the core of the encoder – ‘Attention’. In the next segment, we take a deep dive into the self-attention mechanism in transformer architecture.

Report an error