Structured Outputs Guarantee Shape, Not Semantics

the false sense of safety

A client came to us six months ago with a document processing pipeline that had been silently corrupting their PostgreSQL database for three weeks. They were running GPT-4o with response_format={"type": "json_object"}, piping the output straight into a Django model's save() call, and feeling good about it because the JSON never failed to parse. The bug was a confidence score field that the model occasionally returned as 0 instead of something between 0.1 and 0.99, which then triggered a division-by-zero in a downstream scoring function that was swallowed by a bare except clause, which then wrote a null into a column that was supposed to drive billing calculations. Three weeks. Nobody noticed until a customer called.

The thing engineers keep conflating is structural validity with semantic correctness. JSON mode, OpenAI's structured outputs feature, Anthropic's tool use with schemas, Google's responseMimeType: 'application/json' in Gemini, all of these guarantee you get parseable JSON that conforms to the shape you asked for. That's it. The model can return an age field of -7, a probability of 1.4, an enum value of "UNKNOWN" when your code only handles "ACTIVE" and "INACTIVE", a date string of "2024-13-45", or a required field set to null when the spec said it would never be null. All of these will parse. All of these will blow up your application logic in ways that are genuinely hard to debug because the failure happens far from the source.

what structured outputs actually guarantee

OpenAI's structured outputs, which they distinguish from raw JSON mode, do go further than just parsing. With a proper JSON Schema passed to the response_format parameter as of their late-2024 API versions, you get guaranteed adherence to the schema's structural constraints: required fields will be present, types will match, enum fields will only contain listed values. That's genuinely useful. It eliminates a whole class of bugs.

But JSON Schema's constraint vocabulary is limited in ways that matter enormously for real pipelines. You can say minimum: 0, maximum: 1 for a float field, and OpenAI's structured outputs will honor that... sometimes. The enforcement is model-side, meaning it's part of the constrained decoding process, and there are edge cases where extremely long outputs or certain schema configurations cause the constraint enforcement to degrade. More importantly, JSON Schema can't express cross-field invariants. You can't say "if status is REFUNDED, then refund_amount must be greater than zero." You can't say "end_date must be after start_date." You can't say "exactly one of user_id and guest_token must be non-null." These are the constraints that actually matter for keeping your data coherent, and no model provider's structured output feature enforces them, because JSON Schema fundamentally can't express them.

Anthropic's situation is similar. Their tool use API is a solid way to extract structured data, but you're still getting JSON that you're responsible for validating before it touches your business logic. The shape might be right. The semantics are the model's best guess.

the contract layer you actually need

The fix is a dedicated validation and coercion layer that sits between your LLM call and your application logic. Not a one-liner, not a try/except around a dict access, a proper schema contract with Pydantic v2.

Pydantic v2 (currently at 2.11.x as of early 2026) is genuinely good for this. The model_validator and field_validator decorators give you a clean place to encode every invariant you care about, and the error messages it generates are structured and machine-readable, which means you can log them, alert on them, and even feed them back to the model for a retry loop. Python 3.14's improved type system and the new @deprecated decorator are nice-to-haves, but the core pattern works the same from 3.11 upward.

Here's what a contract layer actually looks like in practice:

python

 1from pydantic import BaseModel, Field, model_validator, field_validator
 2from typing import Literal, Optional
 3from datetime import date
 4
 5class DocumentExtractionResult(BaseModel):
 6    status: Literal["ACTIVE", "INACTIVE", "PENDING"]
 7    confidence: float = Field(ge=0.01, le=1.0)
 8    user_id: Optional[int] = None
 9    guest_token: Optional[str] = None
10    start_date: date
11    end_date: date
12    refund_amount: Optional[float] = Field(default=None, ge=0.0)
13
14    @field_validator("confidence", mode="before")
15    @classmethod
16    def clamp_or_reject_confidence(cls, v):
17        # Models sometimes return 0 exactly, which breaks downstream math
18        if v == 0:
19            raise ValueError("confidence must be non-zero; got 0 from model output")
20        return v
21
22    @model_validator(mode="after")
23    def check_cross_field_invariants(self):
24        if self.user_id is None and self.guest_token is None:
25            raise ValueError("exactly one of user_id or guest_token must be set")
26        if self.user_id is not None and self.guest_token is not None:
27            raise ValueError("user_id and guest_token are mutually exclusive")
28        if self.end_date <= self.start_date:
29            raise ValueError(f"end_date {self.end_date} must be after start_date {self.start_date}")
30        if self.status == "REFUNDED" and (self.refund_amount is None or self.refund_amount == 0):
31            raise ValueError("status REFUNDED requires a positive refund_amount")
32        return self

This is maybe 30 lines of code, and it catches the entire category of bugs that structured outputs miss. When validation fails, you get a ValidationError with a list of exactly which fields failed and why, which you can log with full context including the raw model output.

building a retry loop with validation feedback

Once you have the contract layer, you can close the loop by feeding validation errors back to the model. This works better than you'd expect, because modern models (GPT-4.1, Claude 3.7 Sonnet, Gemini 2.5 Pro) are genuinely good at correcting their output when you tell them specifically what was wrong.

The pattern looks roughly like this: call the model, parse the JSON, run it through your Pydantic model, catch ValidationError, extract the error detail, append it to the conversation as a user message saying something like "Your previous response failed validation with these errors: [errors]. Please correct only the fields that failed.", call the model again with a max retry count of two or three. I'd cap it at three retries and raise a hard exception if you're still getting invalid output after that, because at that point you have a data problem that a human needs to see.

One thing worth doing is structuring your error feedback carefully. Don't just dump str(validation_error) into the prompt. The Pydantic v2 ValidationError.errors() method returns a list of dicts with loc, msg, and type keys, and you can format those into something much more parseable for the model:

python

 1def format_validation_errors(e: ValidationError) -> str:
 2    lines = []
 3    for error in e.errors():
 4        field = " -> ".join(str(x) for x in error["loc"])
 5        lines.append(f"Field '{field}': {error['msg']} (error type: {error['type']})")
 6    return "\n".join(lines)

This retry loop adds latency, obviously. A second model call costs time and money. But it's dramatically cheaper than silent data corruption that you discover three weeks later. The correct mental model is that the validation layer with retry is your circuit breaker, and if a particular prompt is consistently hitting retries, that's a signal to fix the prompt or tighten the schema, not to remove the validation.

where people get this wrong

The most common mistake I see is treating Pydantic as a parsing tool rather than a contract tool. People write models that match the shape of what the LLM returns, with every field Optional and no validators, and then wonder why they still have bad data. If every field is optional and there are no cross-field validators, you're just doing JSON deserialization with extra steps.

The second mistake is coercing bad data silently instead of rejecting it. Pydantic v2 will, by default, try to coerce types, and that's usually what you want for something like an int that arrives as a string. But coercing a confidence score of 1.7 down to 1.0 silently, or accepting an empty string where you expected a non-empty string, hides the fact that your model is producing garbage and makes it much harder to notice when prompt quality degrades. Be explicit about what you'll coerce and what you'll reject. Use strict=True on fields where coercion would mask a real problem.

Third, don't put this validation layer inside your LLM client wrapper. It doesn't belong there. It belongs as close to your database write as possible, as a separate concern, so that the contract between your LLM output and your application logic is explicit and reviewable. We've shipped several AI automation projects at steezr where the Pydantic contract models are the most important documentation we hand over to the client's engineering team, because they encode every assumption the system makes about model output in a form that's also executable code.

Fourth, version your contracts. When you change a prompt in a way that changes the output shape or semantics, bump the contract model and keep the old one around for a migration period. LLM outputs are data, and you wouldn't change a database schema without a migration.

a note on monitoring

Validation failures at the contract layer are production signals, not just errors to handle and move on from. Every ValidationError you catch should be logged with the raw model output, the model version, the prompt hash or template version, and the specific errors. Over time this gives you a distribution of what kinds of invariant violations your model produces, which tells you where to focus prompt engineering effort and whether model upgrades are actually improving extraction quality.

We've used simple Postgres tables for this in projects where the client already had Postgres running, just a llm_validation_failures table with a JSONB column for the raw output and another for the error list. Nothing fancy. The point is that you have it, that you're looking at it, and that a spike in failures triggers an alert before customers notice something is wrong.

This whole approach adds maybe two days of engineering time to a pipeline that would otherwise ship without it. Every AI pipeline we've built that skipped this layer has eventually paid for it with a production incident. The ones that have it tend to be boring in the best possible way.

Structured Outputs Guarantee Shape, Not Semantics

the false sense of safety

what structured outputs actually guarantee

the contract layer you actually need

building a retry loop with validation feedback

where people get this wrong

a note on monitoring

Structured Outputs Don't Mean Your LLM Data Is Correct

Your AI Feature Ships Fast and Rots Faster

Swap Encoders Without Torching Retrieval

Want to work with us?