The Attention mechanism like all other critical model breakthroughs exploits underlying structure in the data. NLP is not a random mixture of word, and therefore a mechanism that can more easily represent the data will be able to perform more powerful functionality.

The attention function is a map between an query and a set of key-value pairs to an output. The output is a weighted sum of the values, where the weight is computed by a compatibility function of the query with the corresponding key.

The paper uses Multi-Head Attention which is composed of the Scaled Dot-Product Attention mechanism:

## Scaled Dot-Product Attention

The left of figure 2 shows the inputs of the Scaled Dot-Product Attention mechanism with a Query and Key-Value pairs, denoted by their respective letters. The output is computed as: $\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V$

The two most often used attention functions are additive or dot-product attention. The transformer uses the multiplicative attention with the only difference being a scaling factor of $\frac{1}{\sqrt{d_k}}$. In practice, all attention computations are executed on a matrix of multiple attentions stacked into a single unit.

## Multi-Head Attention

This paper found instead of using a single attention function a model wide, uses a variety of projected dimensions. With this, they can perform operations on all attention heads in parallel, yielding $d_v$-dimensional output values to be concatenated and once again projected to final resulting values. This is shown on the right of fig2. In this work, they used 8 parallel attention heads. With $h=8$ the dimension of each head is the model dimension divided down to $64$.

## Additional notes:

**Position-wise Feed-Forward Networks**: Each encoder or decoder output is equipped with a very simple feed forward network with ReLU activation between two linear layers.

**Embeddings**: Similar to transduction models, the Transformer uses learned embeddings to convert the tokens to a $n$ dimensional vector space.

**Positional Encoding**: In order to compensate the lack of embedded position information inherent in RNNs, they added ‘positional encodings’ to the input embeddings at the bottom of the encoder and decoder stacks.