Data extraction with LLMs: turning messy text into tables
Turning unstructured text into clean rows and columns is where LLMs quietly shine — if you define the schema, validate every field, and plan for the messy inputs.
A great deal of useful information lives in text that no spreadsheet can read: emails, invoices, resumes, support tickets, contracts, notes. The dream is to pour all of it into clean rows and columns, and this is one of the tasks where language models genuinely deliver. Unlike free-form generation, extraction has a right answer — the value is either in the text or it isn't — which makes it both more useful and more checkable than most AI use cases. This is a conceptual walkthrough of how to do it well, and where it goes wrong.
Start from the schema, not the text
The most common mistake is to ask the model to "pull out the important information." That instruction has no right answer, so you get inconsistent fields that change shape from one document to the next and can never be loaded into a table. Begin instead by defining the target schema precisely: the exact fields you want, the type of each (text, number, date, category, boolean), and what each one means. "Invoice total as a number, currency as a three-letter code, due date as YYYY-MM-DD" is a target the model can hit consistently. "The money stuff" is not. The schema is the specification; everything downstream depends on getting it explicit first.
Make absence a first-class value
Real documents are incomplete. A field you expect will often be missing, and this is the single most important thing to handle, because a model under pressure to fill a field will invent a plausible value rather than leave it blank. An invented invoice number looks exactly like a real one and is far more dangerous than an empty cell, because nothing flags it as wrong. So define explicitly what "not present" looks like — a null, an empty string, a specific marker — and instruct the model to use it whenever the value genuinely is not in the text. "Return null if the field is not present; do not guess" is one of the most valuable sentences in any extraction prompt.
Request structured output and constrain it
Ask for the data in a structured format — typically JSON matching your schema — rather than prose you then have to parse. Most current models and serving tools support constrained or structured output that conforms to a schema you provide, and the mechanics are well documented in resources like the Hugging Face documentation. Provide the field names exactly as you want them, specify the type and allowed values for each, and for categorical fields give the closed list of options so the model picks from your taxonomy instead of inventing labels. The tighter the constraint, the more consistent the output, and consistency is the whole point — you are building rows that have to line up.
Show, don't just tell
Extraction quality jumps when you include a couple of worked examples in the instructions: a snippet of representative input and the exact structured output you want from it. Examples communicate the edge cases that prose struggles to — how to format a partial date, how to handle two values when you expect one, what to do with a field that is present but ambiguous. Choose examples that cover the awkward cases, not just the clean ones, because the clean cases were never the problem. A handful of well-chosen examples is often worth more than several paragraphs of rules, and it costs you almost nothing to add them.
Validate every field after extraction
Never trust extracted data on faith — validate it the moment it comes back, with ordinary code rather than another model. Check that types match: a date field parses as a date, a number field is numeric, a category is one of the allowed values. Check ranges and formats: a quantity isn't negative, an email contains an "@", a code matches its expected pattern. Cross-check internal consistency where you can: do the line items sum to the stated total. Validation is where you catch both model errors and genuinely malformed inputs, and it turns silent bad data into a visible, routable exception. Anything that fails validation should be flagged for a human, not loaded into the table.
Plan the path for what fails
Even a strong pipeline will not extract everything correctly, and pretending otherwise is how bad data poisons a database. Decide in advance what happens when extraction or validation fails: route low-confidence or invalid records to a human review queue rather than dropping them or accepting them blindly. This human-in-the-loop fallback is what makes the difference between a system you can trust and one that quietly corrupts your data over time. It is also exactly the kind of consequence-aware control that frameworks like the NIST AI Risk Management Framework encourage — when an error has downstream cost, a person stays in the loop. The review queue is not a sign of failure; it is the safety valve that lets you automate the easy ninety percent confidently.
Measure against a labeled sample
Before you trust a pipeline at scale, measure it. Hand-label a representative sample of documents with the correct extraction, then run the pipeline and compare field by field. Per-field accuracy tells you which fields are reliable and which need work — and the answer is almost always uneven, with clean fields near-perfect and messy ones struggling. This lets you make an informed decision: automate the reliable fields, route the unreliable ones to review. Re-run the measurement whenever you change the schema, the prompt, the examples, or the model, because each of those can silently degrade fields that used to work. Extraction is one of the few AI tasks where ground truth is concrete, so there is no excuse not to measure.
Mind the inputs before you blame the model
A surprising share of extraction trouble has nothing to do with the model and everything to do with how the text arrived. Documents reach you as clean digital text, as scanned images, as exports that mangle layout, or as formats where the visual structure carries meaning the raw text loses. When a table is flattened into a run-on line, or a scan is converted to text with character errors, the model is extracting from garbage and will produce garbage no matter how good your schema is. Before you tune prompts, look at what the model is actually being given — the input as the pipeline sees it, not as it looks to you on screen. Often the highest-leverage fix is upstream: better conversion, preserved structure, or handling images as images rather than forcing them through a text channel that destroys the very layout that held the answer.
It also pays to think about cost and scale early, because extraction is frequently a high-volume job. Running every document through the most capable, most expensive model is rarely necessary; many fields are easy and a smaller, cheaper model handles them reliably, with the harder fields or harder documents escalated to a stronger one. Batch what you can, and measure cost per document the same way you measure accuracy. A pipeline that is accurate but uneconomical at your real volume is not actually deployable — and the time to discover that is during design, not after you have pointed it at a million records.
The takeaway
Turning messy text into tables is one of the most reliable, checkable, and quietly valuable things you can do with a language model — if you treat it as engineering, not magic. Define the schema first, make absence explicit so the model never invents a value, request constrained structured output, teach with examples, validate every field in code, route failures to a human, and measure against labeled data. Skip the validation and the review queue and you get a fast machine for generating plausible, wrong rows. Build them in, and you get clean data from text that no spreadsheet could ever read.
