Say you are trying to pipe an LLM’s response directly into a downstream function — parsing it into a database record, feeding it to another API, rendering it in a UI component that expects a fixed shape. You write a prompt asking for JSON, you get back something that looks like JSON, and then your JSON.parse() call throws on the third request out of ten. The model isn’t malfunctioning. It’s doing exactly what an autoregressive text generator does when you haven’t constrained it enough: producing plausible-looking output instead of a validated data structure.

Structured output is one of those areas where the failure modes are extremely consistent across models and providers. Once you’ve seen each one a few times, you can diagnose them almost on sight. Below is the troubleshooting reference I actually use — organized by symptom, so you can jump to whatever’s breaking in your pipeline right now.


Symptom: The response isn’t valid JSON at all

You get something like “Sure! Here’s the JSON you requested:” followed by a code block, and your parser chokes on the prose wrapper before it even reaches the braces.

Cause: The model is treating this as a conversational turn rather than an API contract. Nothing in your prompt or request configuration told it that prose is not an acceptable response format — so it defaults to being helpful in the way it was trained to be helpful, which includes narrating what it’s about to do.

Fix: Use the provider’s actual JSON mode or structured output feature if one exists — OpenAI’s response_format: { type: "json_object" }, Anthropic’s tool-use forcing, or equivalent constrained decoding options elsewhere. These work at the token-sampling level, masking out any token that wouldn’t produce valid JSON, which is a fundamentally stronger guarantee than asking nicely in the prompt. If you’re on a model or endpoint without native JSON mode, add an explicit instruction: “Respond with raw JSON only. Do not include markdown formatting, code fences, or any explanatory text before or after the JSON object.” Then strip whitespace and check the first non-whitespace character is { or [ before you even attempt a parse — it’s a one-line guard that catches most residual violations.


Symptom: Valid JSON, but wrapped in a markdown code fence

Parsing fails with an unexpected token error, and when you print the raw string, you see triple backticks and a json language tag surrounding an otherwise well-formed object.

Cause: Chat-tuned models default to markdown formatting almost reflexively, since that’s the convention for the vast majority of their training and fine-tuning data. JSON mode reduces this, but doesn’t eliminate it on every model.

Fix: Add a stripping step to your parsing layer regardless of which model you’re using — treat it as defensive code, not a workaround for a bug you’ll eventually fix upstream. A simple regex that strips a leading ```json and trailing ``` before parsing handles the overwhelming majority of cases. Don’t rely on prompt instructions alone here; it’s cheaper and more reliable to normalize the string in code than to keep tuning wording and hoping the model complies on every single call.


Symptom: Fields are missing or renamed from what you specified

Your schema calls for user_id, and the response comes back with userId, or id, or the field is simply absent from the object entirely.

Cause: A natural-language description of a schema is a suggestion, not a contract. If you wrote “include the user’s ID,” the model is inferring both the key name and whether that field is required, and it has no way of knowing your downstream code expects the literal string user_id.

Fix: Stop describing your schema in prose and pass an actual schema. Most providers now support a json_schema parameter or function-calling / tool-definition format where you specify field names, types, and a required array explicitly. This moves enforcement from “the model probably remembers the field name” to “the sampling process is constrained against a schema object,” which is a categorically more reliable mechanism. If your provider doesn’t support schema enforcement natively, include a literal example JSON object in the prompt with exact key names filled in — models are markedly better at copying an example’s structure than at reconstructing one from a verbal description.


Symptom: Correct keys, but wrong data types

The schema expects an integer for age, and you get the string "32". Or a boolean field comes back as the string "true" instead of the literal true.

Cause: Language models generate text one token at a time — there’s no innate concept of a “type system” unless the decoding is explicitly constrained to one. Left to its own devices, the model treats every value as a token sequence to be produced, and numeric-looking strings and actual numbers can look identical at generation time. This is one of the more common breakages in production output, especially on longer responses where formatting consistency drifts over the course of generation.

Fix: If you’re using a schema-enforcement feature (JSON Schema, Pydantic-backed tool calls, etc.), specify the type per field — "type": "integer" — rather than relying on prompt language like “as a number.” Constrained decoding will refuse to emit a quoted string in a slot typed as an integer. If schema enforcement isn’t available, add a validation and coercion layer on your end regardless: cast known-numeric fields, and treat any cast failure as a signal to retry the request rather than silently passing a bad value further down your pipeline.


Symptom: Output truncates mid-object

You get a JSON string that starts fine and simply stops — no closing brace, sometimes cut off mid-value.

Cause: This is almost never a JSON-formatting issue. It’s a max_tokens ceiling being hit before generation naturally completes, which cuts the output off at an arbitrary token boundary regardless of whether that boundary lands on a syntactically valid stopping point.

Fix: Check your max_tokens or equivalent output-length parameter first, before touching your prompt at all — this is the fix people skip because it feels too simple, but it resolves this symptom more often than any prompt adjustment does. Estimate a generous ceiling based on your expected output size, with real margin, not the minimum you think you’ll need. If the object is genuinely large, consider restructuring the request into smaller, targeted calls rather than one call that has to hold the entire structure in a single generation pass — smaller completions are also lower latency, which is a fringe benefit worth having regardless.


Symptom: The JSON is well-formed, but the content is wrong or hallucinated

Parsing succeeds. Types are correct. But a field like total_amount doesn’t match the source document, or a summary field contains information not present anywhere in your input.

Cause: This isn’t a formatting failure at all — it’s a reasoning or grounding failure wearing a formatting costume. Structured output constraints control shape, not truthfulness. A model can produce a perfectly valid JSON object that is confidently, precisely wrong.

Fix: This needs a different toolkit than the fixes above. Add explicit grounding instructions — “only include values that appear verbatim in the source text; use null for anything not present” — rather than letting the model infer plausible-sounding fills for gaps. For high-stakes fields, add a validation pass that checks extracted values against the source input programmatically before you trust them downstream. If the volume justifies it, a second LLM call whose sole job is verifying the first call’s output against the source can catch a meaningful fraction of these errors, at the cost of added latency and an extra API call per request — usually a reasonable trade for anything touching billing, medical, or legal data.


Symptom: It works in testing, then breaks intermittently in production

Your test suite passes reliably. In production, some small percentage of requests — often under 2%— still fail to parse or fail validation, with no obvious pattern.

Cause: LLM output is probabilistic by construction. Even with a temperature of 0 and a schema-constrained call, there is a long tail of edge cases: unusual input triggering an unusual completion, provider-side model updates shifting behavior slightly, or rare interactions between your prompt and specific input content you didn’t test against.

Fix: Design for this tail instead of trying to eliminate it entirely, which isn’t a realistic target for any current model. Wrap the call in a retry loop with validation: parse, validate against your schema, and on failure, retry with the original response appended and an explicit note about what was invalid — “Your previous response had a string in the count field; return an integer instead.” This self-correction pattern resolves a large share of tail failures on the first retry. Log every failure with the raw output attached, not just an error code, so you can identify emerging patterns instead of treating each one as an isolated fluke.


Quick Reference

SymptomMost Likely CauseFirst Thing to Check
Not valid JSON at allNo format constraint appliedEnable native JSON mode / structured output
Wrapped in code fencesDefault markdown habitStrip fences in your parsing code
Missing or renamed fieldsProse schema instead of a real onePass an explicit JSON Schema or function definition
Wrong data typesNo type constraint on generationSet explicit per-field types in the schema
Truncated mid-objectOutput hit the token ceilingRaise max_tokens, or split the request
Valid but wrong contentFormatting fix applied to a grounding problemAdd source-grounding instructions and validation
Intermittent production failuresLong-tail probabilistic behaviorAdd retry-with-validation logic, log raw failures

Most of these fixes have nothing to do with clever prompting and everything to do with treating the LLM call like any other unreliable network dependency — one that needs schema validation, retries, and logging around it, not just a well-worded request. Which of these symptoms is actually hitting your pipeline right now? That’s usually the faster question to answer than trying to prompt-engineer your way to zero failures.