Self-attention, sometimes called intra-attention, is an attention mechanism to compute relative dependencies of the given sequence. As the transformer model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the input sequence for clues that can help lead to better encoding of that word.
Let’s dive right in and understand the application of self-attention in transformer architecture in the next video.
Self-attention starts with the creation of three different variants/vectors from each of the encoder’s input vectors. So, for each word, we create the following:
- Query vector(Q)
- Key vector(K)
- Value vector(V)
These vectors are the abstractions of the input vectors, created by multiplying the input embedding with three matrices Wq, Wk & Wv respectively. The values of these matrices are obtained after the training process.
The authors of the Transformer model proposed a ‘scaled dot-product attention’ and then built on it to propose multi-head attention. Within the context of neural machine translation, the query, keys and values that are used as inputs to these attention mechanisms are different projections of the same input sentence. In summary, to arrive at the final context vector, we must perform the following mathematical operations:
- The scaled dot-product attention first computes a dot product for each query (Q) with all of the keys (K). This provides a square matrix, which captures the similarity scores of one token with the other tokens in the input sequence
- A scaling factor was introduced to shrink the result of the dot products, or else the softmax operation following it will produce small gradients, which ultimately can lead to a vanishing gradient problem. The scaling factor (1/dk) avoids this problem by shrinking the values.
- The distribution of scores generated earlier should add up 1, i., e., it should have a probability distribution. Therefore, we perform a softmax operation to normalise the result in a limited range, which gives you the attention weights of each vector.
- For a given word/token, the resulting weights are then multiplied with the value vectors using weighted addition, produce the context vector.
When given a set of value and query vectors, the attention function computes a weighted sum of the values dependent on the query. This weighted sum (context vector) is a selective summary of the information represented by the values, which the query attends to.
The scaled dot-product attention is formulated as shown below: