'Improving Language Understanding by Generative Pre-Training' Reviewed

This paper describes in high-level how the authors leveraged a neural network to generate high quality text in different benchmarks by training it in a unsupervised step on an large dataset then fine tunning in a task specific dataset.

The fist step is learning a language model in a large corpus, trying to optimize the following funciton

L_{1} (U) = \sum_{i} \log P (u_{i} | u_{i - k}, . . ., u_{i - 1}; θ)

where P is the probability of the token (u_i) comes after the sequence of size (k). The probabilities are encoded in the parameters (\theta).

The network architecture is composed of twelve layers of multi-headed attention blocks. As the equations bellow:

$h_{0} = U W_{e} + W_{p}$ $h_{i} = t r a n s f o r m e r (h_{i - 1}) \forall i \in [1, n]$ $P (u) = s o f t m a x (h_{n} W_{e}^{T})$

Where (U) is the sequence (u_{i - k}, …, u_{i - 1})
(W_p) is the position embeddings
(W_e) is the token embeddings

The paper doesn’t give much attention on how the attention block is composed, which optimization method it uses or how the network wheights are adjusted.

During the fine-tuning the network is submited to a supervised step to predict the label y, given the parameters (W_y):

P (y | x^{1}, . . ., x^{m}) = s o f t m a x (h_{l}^{m} W_{y})

Maximizing the objective:

L_{2} (C) = \sum_{i} \log P (y | x^{1}, . . ., x^{m})

And another objective is given to the algorithm to accelerate convergence, composing supervised and usupervised objectives together. Although the authors don’t explain how this accelerate it.

L_{3} (C) = L_{2} (C) + λ L 1 (C)

With lamda equals to 0.5.

For task specific dataset transformations different combinations of the model where done, but no explanations on how this affect the training algorithm.

The paper shows how pre-training is relevant for archieving good results, as the lack of it causes a negative impact of 14% of performance.