Decoding paper - Parameter-Efficient Transfer Learning for NLP

Paper explanation
Author

Yash Surange

Published

April 9, 2023

I will be discussing about the paper : Parameter efficient transfer learning for NLP. I will be going through the intuition behind the paper and will be writing my understanding of the same. If you find any problems, don’t hesitate to contact me.

1 Lets begin 🏁

1.0.1 Abstract

Authors note that fine tuning of pretrained models is an efficient way of transfer learning in NLP. Transfer learning becomes parameter inefficient when the number of downstream tasks increase as for each task, a new model has to be created. To make this parameter efficient, authors propose to use adapter modules for efficient transfer learning.

For each new task, adapters add very few trainable parameters and the parameters of the pretrained model remain unchanged. To demonstrate the effectiveness of this approach, the authors transfer BERT to 26 text classification tasks and achieve within 0.4% of the performance after full fine tuning. Only 3.6% parameters were added per task.

1.0.1.1 Points to note:

There are different methods of transfer learning in NLP. These are as follows:

  1. Training the entire architecture (Full fine tuning): The pretrained model is trained in entirety. All the trainable parameters are updated during backpropogation.

  2. Training some layers and freezing others: Freezing of the initial layers and training of the later layers. In this case, we have to experimentally determine which layers to be frozen.

  3. Freezing the entire architecture: We freeze all the layers of the architecture and add new layers on top of them. We train only the additional layers.

Reference : Transfer Learning for NLP: Fine-Tuning BERT for Text Classification

1.0.2 Introduction

In this paper, the authors have targeted the online setting. In this setting tasks arrive in a stream. A better way to understand this is by taking an example. The example is that of google translate. The process involves detection of language, translation of language. These tasks have to be performed on the go. Logically, we would want models to require minimum number of extra parameters to adapt to new task. We also want to make sure that information about the tasks is not lost when we train on new tasks.

Proposed adapter modules are added in between layers of pretrained models.

The authors introduce the concept behind adapter based fine tuning by first explaining two common methods of transfer learning: feature based transfer learning and fine tuning.

Let’s look at equations to understand feature based transfer learning, fine tuning and adapter based fine tuning.

  1. Feature based transfer learning : Consider a function Ο†π“Œ(x) (a neural network). This method generates a composition of functions where it composed Ο†π“Œ with function 𝓧𝓋 to produce 𝓧𝓋(Ο†π“Œ(x)). Only the new parameters 𝓋 are trained.

  2. Fine tuning: For each new task, original parameters π“Œ are updated.

  3. Adapter based fine tuning: πœ“π“Œ,𝓋 is created where π“Œ are taken from pretrained model. Initialisation of parameters 𝓋 is done such that this new function πœ“π“Œ,𝓋(x) β‰ˆ Ο†π“Œ(x). Only 𝓋 parameters are changed during training.Authors also note that if |𝓋| << |π“Œ|, then many tasks will require only |π“Œ| parameters. Adapter based fine tuning thus enables model to be extended to many tasks without affecting previous ones.