Fine-tuning vs RAG vs prompting: a decision guide

Three ways to make a model do what you want — and most teams reach for the heaviest one first. Here is how to choose in the right order.

research2026-04-20 10:42 KST·Lead Editor·7 min read

When a language model does not behave the way you want, you have three broad levers to pull: change the prompt, give the model the right material to read, or change the model's weights. These are prompting, retrieval-augmented generation (RAG), and fine-tuning. They are often discussed as competitors, but they answer different questions, and most teams reach for the heaviest, most expensive one first when a lighter one would have worked.

This guide is about choosing in the right order. The order matters because each lever has a different cost, a different failure mode, and a different problem it actually solves. Get the diagnosis right and the choice usually makes itself.

What each lever actually changes

It helps to be precise about what you are modifying.

Prompting changes the instructions and context you send at request time. The model is untouched; you are steering a fixed system with words and examples.
RAG changes the knowledge available to the model at request time. You fetch relevant documents and place them in the model's context before it answers. The model is still untouched; you have changed what it gets to read.
Fine-tuning changes the model itself. You continue training on your own examples so the weights shift toward your desired behavior. This is the only lever that alters the model.

Notice that two of the three leave the model alone. That is the central insight: most problems are not problems with the model. They are problems with what you asked or what you supplied.

Prompting: the default, not the consolation prize

Prompting has a reputation as the cheap option you use before doing the "real" work. That framing is backwards. Prompting is the first thing to try because it is fast, reversible, and shockingly capable. A clear instruction, a worked example or two, a defined output format, and an explicit statement of what to do when unsure — these resolve a large share of "the model is acting weird" complaints.

Prompting is the right tool when the model already has the capability and knowledge, and you simply need to elicit it reliably. Its limits are also clear. It cannot teach the model facts it never learned, and it cannot reliably enforce behavior across thousands of varied inputs if the instruction is long and brittle. When your prompt grows into a sprawling rulebook that still leaks edge cases, that is a signal you may need a different lever — but you should reach that conclusion by exhausting prompting first, not by skipping it.

RAG: when the problem is knowledge

If the model's failures are about what it knows — it lacks your private documents, it cannot see recent information, it invents specifics it was never given — the problem is knowledge, and the answer is usually RAG, not fine-tuning. This is the most common misdiagnosis in the field. Teams feel the model "doesn't know our domain" and assume they must retrain it, when in fact they need to hand it the right pages to read.

RAG shines because knowledge that lives in retrievable documents stays current, auditable, and easy to correct. Update a document and the model's answers update with it. Show which passages were used and a human can verify the answer. Fine-tuning, by contrast, bakes knowledge into weights where it is hard to inspect, hard to update, and prone to drifting out of date. As a rule: if the answer should change when your documents change, use retrieval, not training.

Fine-tuning: when the problem is behavior, not facts

Fine-tuning earns its place when you need to change how the model behaves in a way that prompting cannot reliably reach: a consistent tone or format across enormous volume, a narrow specialized task the base model handles clumsily, or a structured output it keeps deviating from despite clear instructions. The signal for fine-tuning is a behavior you can demonstrate with many examples but cannot capture in a short instruction.

It is the heaviest lever for good reason. It requires curated training data, a training run, evaluation, and a maintenance commitment, because a fine-tuned model is a thing you now own and must keep aligned as your needs evolve. Crucially, fine-tuning is poor at teaching facts. It shifts tendencies and styles far more reliably than it implants a knowledge base. Reaching for fine-tuning to fix a knowledge gap is the expensive way to get a fragile result.

The decision in order

A practical sequence, cheapest and most reversible first:

Start with prompting. Write the clearest instruction you can, add a few examples, define the output, and state the fallback when the model is unsure. Measure on real cases.
If failures are about knowledge, add RAG. Missing facts, stale information, private documents, invented specifics — give the model the right material to read.
If failures are about consistent behavior, consider fine-tuning. A demonstrable pattern you cannot compress into an instruction, repeated at scale.
Combine when warranted. These are not mutually exclusive. A common mature setup is a fine-tuned model and RAG and a careful prompt — each doing the job it is best at.

Most teams should travel down this list, not jump to the bottom. The order is a cost gradient: each step up demands more effort, more data, and more ongoing maintenance.

How to tell which problem you have

The fastest way to choose is to diagnose the failure honestly. Ask of a bad answer: would the right document have fixed this? If yes, it is a knowledge problem, and RAG is your lever. Ask: would a clearer instruction or example have fixed this? If yes, it is a prompting problem. Ask: is this a pattern the model gets wrong consistently, that I can show in many examples but not say in a sentence? If yes, fine-tuning is on the table.

When more than one is true, fix the cheapest one first and re-measure. Often the cheaper fix resolves enough of the problem that the expensive one becomes unnecessary. The teams that struggle are usually the ones that picked a lever based on which sounded most serious, rather than on what the failures were actually made of.

What none of them fix

No lever turns a model into something it is not. Prompting cannot summon knowledge that was never present. RAG grounds answers in supplied text but does not make the model reason better, and it inherits the errors of your documents. Fine-tuning shifts behavior but does not reliably install facts and will not rescue a task the underlying model fundamentally cannot do. All three improve elicitation, grounding, or tendency — none manufacture capability from nothing. Knowing the ceiling of each keeps you from spending weeks on the wrong one.

The takeaway

Prompting, RAG, and fine-tuning are not rivals; they are answers to three different questions. Prompting fixes how you asked. RAG fixes what the model can read. Fine-tuning fixes how the model behaves. Diagnose the failure, then climb the cost gradient only as far as the problem demands — start with the prompt, add retrieval when the gap is knowledge, and reserve fine-tuning for behavior you can demonstrate but cannot state. The cheapest lever that solves your problem is the right one.

#fine-tuning#rag#prompting#decision-guide

Primary sources

Hugging Face — fine-tuning documentation Anthropic — prompt engineering overview