Fine-tuning vs Instruction-tuning: Key Differences

NLP is getting all the headlines. This is due to its ability to generate coherent text almost to the human generated standard. Apart from that, they can also solve various other tasks such as Question Answering, Linguistic Acceptability, Summarization and Semantic Textual Similarity and many more. In this article, I will introduce the popular T5 Transformer, 1.5B GPT-2 and FLAN 137B, the differences between these architectures, methodology and recommendations.

Before starting the debate, I would like to introduce the T5 text-to-text transform transformer so that we can better understand the upcoming architectures. T5 Text-to-Text Transform transformer was first introduced in the paper, "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". This paper proposes a unified architecture to solve number of text-based language problems by converting it into a text-to-text format. To give you an example, let us consider a language translation problem. Suppose we have a English to French sentences pair. We formulate the language translation problem as "translate English to German: That is good." and the model's job is to output "Das ist gut.". Similarly, linguistic acceptability problem is formulated as "cola sentence: The course is jumping well." and the model job is to predict "not acceptable". The figure below describes this in more appropriate way. One main thing to remember is that we have to add the task specific prefix in every sentence. The figure below shows how we can gather all the tasks as well as datasets to fine-tune a T5 architecture.

Now, that we have covered the T5 architecture briefly, we now proceed to GPT-2. By, now we might all have heard about GPT family of Large Language Models. GPT-2 was first introduced in the paper, "Language Models are Unsupervised Multitask Learners". Back in 2019, the creators of GPT-2 claimed that their model achieved the state of the art results in 7 out of 8 language modeling datasets in zero-shot setting.

Unlike T5 architecture, where we add task specific prefix to each training instance, GPT-2 formulates all the tasks as language modeling task. This is mainly because, the paper argues that fine-tuning on a downstream task is the major contributing factor for the lack of generalization of large language models. So, to address this problem, they collect a wide variety of data that incorporates the natural language demonstration of wide variety of tasks. The dataset they use to train this architecture is gathered by semi-manual extraction from Common Crawl Web archive. The resulting dataset "WebText" is free from HTML elements and excludes Wikipedia pages. The GPT-2 architecture is compared to original GPT, SoTA and BERT on 8 datasets such as LAMBADA, CBT-CN, WikiText, enwiki8.

So, what makes GPT achieve these results?

The language modeling problem is generally formulated as \(P(output|input)\). GPT-2 language modeling technique is slightly modified. System such as GPT-2 handles various tasks by conditioning on input as well as the task to be performed. So, the problem becomes \(P(output|input,task)\). The input, task is again the sequence of symbols. For example, a translation task can be represented as (translate to French, English text, French text). Experiments show large language models can effectively perform multitask learning in this framework.

Next architecture in the lineup is a 137B giant. This model was first introduced in the paper, "FINETUNED LANGUAGE MODELS ARE ZERO-SHOT LEARNERS". The paper discusses a method to enhance zero-shot learning in large language models using instruction-tuning. Fine-tuning language models on datasets described with instruction boosts the performance.

For instruction-tuning, 62 datasets available on Tensorflow is taken based on the language understanding and generation tasks. Each dataset is placed on one of twelve tasks based on task type. For each dataset, 10 templates are created, including variations to diversify instructions. The instruction tuning process begins using randomly selected templates from the mixture of datasets. LaMBDA-PT, a decoder only transformer is pretrained on this dataset. FLAN is an instruction-tuned version of LAMDA-PT. FLAN a.k.a Finetuned Language Net is subjected to instruction tuning on over 60 NLP datasets verbalized via natural language instruction templates.

FLAN architecture undergoes zero-shot evaluation on new tasks after being trained on a mix of datasets with balanced sizes. The training examples are limited to 30K per dataset and a mixing rate of maximum 3K is applied.

The model was tested on various tasks including natural language inference, reading comprehension, closed-book QA, translation, commonsense reasoning, coreference resolution, and struct-to-text. For each dataset, the mean of performance on all templates is used as a metric. Performance of FLAN is the mean of up to 10 instruction templates per task. While some task clusters showed strong results, instruction tuning did not benefit performance in tasks like commonsense reasoning or coreference resolution. FLAN outperformed LAMDA-PT on only three out of seven tasks, indicating limited usefulness of instruction tuning in language modeling tasks.

Text generation with auto-regressive or causal language models such as GPT-2 allows us to train them on large amounts of unlabeled data and generate output more quickly. They are very reliable when we want zero-shot or few-shots inference but require a huge dataset and architecture to start with. On the other hand, encoder-decoder architecture provides us with a more task specific and light-weight solution but is more task specific. These trade-offs between architectures and understanding them are more complex than the problem itself.

Thank you. See you around with some other information regarding the Large Language Models.

Fine-tuning vs Instruction-tuning

Zero shot inference: Is Fine-tuning or Instruction-tuning better?