Instruction Tuning: FLAN, T0, and InstructGPT Compared¶

Level: Intermediate–Advanced

This document compares three landmark instruction-tuning approaches — T0 [Sanh2022], Flan [Chung2022], and InstructGPT [Ouyang2022] — and explains how they shape the models that prompt engineers work with today.

Why This Matters for Prompt Engineering¶

Instruction tuning is the process of fine-tuning a pretrained LLM on a dataset of (instruction, response) pairs so that the model learns to follow natural-language instructions. Understanding how models were instruction-tuned helps prompt engineers write better prompts, because it reveals what kinds of instructions the model was trained to follow and where its instruction-following capabilities may break down.

Approach 1: T0 — Multitask Prompted Training¶

Source: Sanh et al. (2022) [Sanh2022]

Base model: T5 (11 billion parameters, encoder-decoder architecture).

Training method: T0 was trained on a diverse set of NLP tasks (sentiment analysis, question answering, summarization, etc.) reformatted as natural-language prompts using templates from PromptSource [Bach2022]. The key innovation was using a large variety of prompt templates per task, exposing the model to many different ways of expressing the same instruction.

Key findings: - T0 achieved zero-shot performance competitive with GPT-3's few-shot performance on several benchmarks, despite being approximately 16× smaller. - Exposure to diverse prompt phrasings during training improved robustness — T0 was less sensitive to the exact wording of prompts than models trained on a single template per task. - Performance improved with both the number of training tasks and the diversity of prompt templates.

Implications for prompt engineering: - Models trained with T0-style methods are relatively robust to prompt phrasing variation. - These models respond well to direct task descriptions without examples. - The training methodology favors classification and short-answer tasks; long-form generation was less emphasized.

Approach 2: Flan — Scaling Instruction-Finetuned Models¶

Source: Chung et al. (2022) [Chung2022]

Base models: T5, PaLM (including PaLM 540B).

Training method: Flan extended the T0 approach by significantly scaling the number of tasks (1,800+ tasks), adding chain-of-thought training data, and applying instruction tuning to much larger models. Flan-PaLM was trained on both standard instruction-following tasks and tasks requiring explicit reasoning steps.

Key findings: - Scaling the number and diversity of instruction-tuning tasks continued to improve performance. - Including chain-of-thought (CoT) reasoning data during fine-tuning improved the model's ability to produce reasoning traces at inference time, even on tasks not seen during training. - Flan-PaLM achieved strong results across a wide range of benchmarks, outperforming the base PaLM model on both standard and reasoning-intensive tasks.

Implications for prompt engineering: - Flan-style models respond well to both direct instructions and "think step by step" reasoning requests. - The inclusion of CoT data during training means that chain-of-thought prompting (Module 3, §3.4) is particularly effective with these models. - Larger, more diversely tuned models are generally more responsive to complex, multi-requirement prompts.

Approach 3: InstructGPT — RLHF Alignment¶

Source: Ouyang et al. (2022) [Ouyang2022]

Base model: GPT-3 (various sizes, up to 175B parameters, decoder-only architecture).

Training method: InstructGPT used a three-stage process: 1. Supervised fine-tuning (SFT): GPT-3 was fine-tuned on a dataset of (prompt, ideal response) pairs written by human annotators. 2. Reward model training: Human raters ranked multiple model outputs for the same prompt. A reward model was trained to predict these rankings. 3. Reinforcement learning from human feedback (RLHF): The SFT model was further optimized using PPO (Proximal Policy Optimization) to maximize the reward model's score.

Key findings: - InstructGPT 1.3B (a much smaller model) was preferred by human raters over the base GPT-3 175B, demonstrating that alignment training can be more impactful than raw model scale. - InstructGPT produced fewer harmful, untruthful, and unhelpful outputs compared to the base GPT-3. - The model showed improved instruction following, especially for nuanced requests involving tone, format, and constraint adherence.

Implications for prompt engineering: - RLHF-aligned models respond well to natural, conversational instructions — they were trained to interpret human intent even when instructions are imprecise. - These models are generally better at following negative constraints ("do not ...") and format specifications. - The alignment process may introduce a tendency toward verbose, cautious responses ("hedging"), which can be counteracted with explicit brevity instructions.

Side-by-Side Comparison¶

Dimension	T0 [Sanh2022]	Flan [Chung2022]	InstructGPT [Ouyang2022]
Architecture	Encoder-decoder (T5)	Encoder-decoder (T5) + decoder-only (PaLM)	Decoder-only (GPT-3)
Key training signal	Diverse prompted tasks	Scaled prompted tasks + CoT data	Human preferences via RLHF
Scale demonstrated	11B parameters	Up to 540B parameters	Up to 175B parameters
Zero-shot strength	Classification, short QA	Reasoning, diverse tasks	Instruction following, safety
Prompt sensitivity	Lower (diverse templates)	Lower (diverse tasks)	Lower (human-preference tuning)
Reasoning capability	Moderate	Strong (CoT training)	Moderate (not CoT-trained)
Alignment / safety	Not specifically targeted	Not specifically targeted	Core objective
Open-source availability	Yes (T0 weights available)	Partially (Flan-T5 available)	No (proprietary)

How These Approaches Relate to Modern Models¶

Most modern instruction-following LLMs combine elements from all three approaches:

GPT-4, Claude, Gemini use RLHF (InstructGPT lineage) for alignment and safety, combined with diverse instruction-tuning data (T0/Flan lineage) for broad task competence.
Open-source models (Llama, Mistral fine-tunes) often use supervised instruction tuning with community-generated datasets, following the T0/Flan methodology, sometimes augmented with RLHF or DPO (Direct Preference Optimization).

For prompt engineers, the practical takeaway is that modern models are responsive to both explicit instructions (T0/Flan heritage) and natural conversational guidance (InstructGPT heritage). The most effective prompts leverage both: explicit structural instructions for format and task specification, combined with natural-language guidance for tone, role, and reasoning style.

Cross-References¶

Module 1 (01-introduction.md, §1.1) introduces why prompt engineering matters in the context of instruction-following models.
Module 3 (03-patterns.md) catalog patterns that work because of instruction tuning — especially zero-shot (§3.2) and chain-of-thought (§3.4).
The PromptSource Comparison examines the template system used to train T0.
The Chain-of-Thought Comparison explores CoT in detail, including its connection to Flan's CoT training data.

References¶

[Sanh2022] Sanh, V., et al. (2022). Multitask prompted training enables zero-shot task generalization. ICLR.
[Chung2022] Chung, H. W., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint.
[Ouyang2022] Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 35, 27730–27744.
[Bach2022] Bach, S. H., et al. (2022). PromptSource: An integrated development environment and repository for natural language prompts. ACL 2022 System Demonstrations, 93–104.

See references.md for full citations with DOIs.