ADR-001: Few-shot prompting over fine-tuning for classification¶

Status: Accepted Date: 2026-02-15 Context module: Core Principles | Patterns

Evidence & Reproducibility Metadata¶

Field	Value
Evidence type	Internal pilot evaluation (project-specific)
Dataset scope	1,000 support tickets total (800 development + 200 held-out test)
Model family	Claude instruction-tuned model (provider-hosted)
Evaluation metric	Agreement with human labelers on held-out set
As-of date	2026-02-15
External reproducibility	Not yet published as an open artifact

Interpretation rule: Numerical values in this ADR are production decision inputs for one internal deployment, not universal benchmarks. If you adopt this pattern, re-run the evaluation on your own dataset before treating the figures as targets.

Context¶

Our team needed to classify incoming customer support tickets into one of 12 categories (billing, account-access, bug-report, feature-request, etc.). The classifier feeds a routing system that assigns tickets to the correct specialist queue.

Key constraints:

The category taxonomy changes roughly once per quarter as the product evolves.
We had approximately 800 labelled examples at the time of the decision, unevenly distributed across categories (some had fewer than 20 examples).
The system needed to be operational within two weeks.
The team had prompt engineering experience but limited ML-ops infrastructure for training and serving custom models.
Accuracy target: at least 90 % agreement with human labellers on a held-out test set of 200 tickets.

Decision¶

Use few-shot prompting with a foundation model (Claude) rather than fine-tuning a custom classifier. The prompt includes:

A system message defining the role and the complete taxonomy with one-line descriptions of each category.
Eight carefully selected exemplar tickets (few-shot examples) covering edge cases and commonly confused categories.
An instruction to output only the category label, followed by a one-sentence justification to enable lightweight auditing.

Rationale¶

Few-shot prompting was favoured for three main reasons:

Iteration speed. Updating the taxonomy requires editing the prompt, not retraining a model. Given quarterly taxonomy changes, this reduces ongoing maintenance from days to minutes. This aligns with the principle of keeping prompts easily auditable and version-controllable [Brown2020].
Low data regime. Several categories had fewer than 20 examples -- insufficient for reliable fine-tuning. Few-shot prompting leverages the model's pre-trained knowledge to generalise from a handful of exemplars [Brown2020].
Infrastructure simplicity. The team could deploy the classifier as a single API call without building training pipelines, model registries, or GPU serving infrastructure.

On the held-out test set the few-shot prompt achieved 93 % accuracy, exceeding the 90 % target. The three most confused category pairs were identified and addressed by adding targeted exemplars for those pairs.

Alternatives Considered¶

Alternative A: Fine-tuning a smaller model¶

Fine-tuning a model like distilbert-base on the 800 labelled examples would have produced a fast, cheap-to-serve classifier. However:

The uneven label distribution would have required data augmentation or class-weighting strategies the team was not experienced with.
Every taxonomy change would trigger a retrain-evaluate-deploy cycle.
The two-week timeline did not leave room for ML-ops setup.

Rejected because maintenance cost and timeline risk were too high.

Alternative B: Zero-shot classification (no exemplars)¶

A simpler prompt with just the taxonomy definitions and no examples. Early testing showed 84 % accuracy -- below the 90 % target -- particularly failing on the subtle distinction between "bug-report" and "feature-request" tickets.

Rejected because accuracy was insufficient without exemplars to anchor the model's understanding of boundary cases.

Alternative C: Retrieval-augmented few-shot (dynamic example selection)¶

Instead of static exemplars, retrieve the most similar past tickets from a vector store and inject them at inference time. This is a strong approach but:

Required building and maintaining an embedding index.
Added latency (~200 ms for retrieval on top of LLM inference).
Introduced a dependency on the vector store's availability.

Deferred as a future enhancement if accuracy degrades when the taxonomy grows beyond 20 categories.

Consequences¶

Positive¶

Achieved 93 % accuracy, exceeding the target.
Deployed within one week, well ahead of the two-week deadline.
First taxonomy update (adding a "security-incident" category) took 15 minutes: add the definition, add one exemplar, run the test set.
The one-sentence justification output enables human auditors to spot-check the classifier's reasoning at scale.

Negative¶

Per-ticket inference cost is higher than a fine-tuned model (~$0.002 per ticket vs. ~$0.0001). At current volume (5,000 tickets/day) this is approximately $10/day, which is acceptable but not negligible.
Latency is ~1.2 seconds per classification vs. ~50 ms for a fine-tuned model. Acceptable for an async routing system but would be problematic for real-time, user-facing classification.

Risks¶

If ticket volume grows 10x, cost becomes significant. Mitigated by monitoring monthly spend and re-evaluating fine-tuning if costs exceed the threshold.
Prompt injection: a malicious ticket could attempt to override the classification instructions. Mitigated by input sanitisation and the constrained output format. See ADR-003 for the safety-gate pattern.
Model updates could shift classification behaviour. Mitigated by running the held-out test set after each model version change.

ADR-003: Add safety-gate validation prompt -- addresses prompt injection risk for this classifier.
Patterns module -- covers few-shot prompting patterns in depth.
Core Principles -- discusses when to prefer prompting over training.
Comparison: Few-shot comparison