I don't like this future we're going towards where we have to trick our software (which we can no longer understand the workings of) into doing what we tell it to by asking it nicely, or by putting another black box on the end to "fix" the output. This is the opposite of engineering. This is negotiation with a genie trapped in silicon.
Very confused. When you enable structured output the response should adhere to the JSON schema EXACTLY, not best effort, by constraining the output via guided decoding. This is even documented in OpenRouter's structured output doc
> The model will respond with a JSON object that strictly follows your schema
Gemini is listed as a model supporting structured output, and yet its fail rate is 0.39% (Gemini 2.0 Flash)!! I get that structured output has a high performance cost but advertising it as supported when in reality it's not is a massive red flag.
Worst yet response healing only fixes JSON syntax error, not schema adherence. This is only mentioned at the end of the article which people are clearly not going to read.
I have various workflows for which I use Claude (sonnet 4.5 mostly) to emit json and it’s literally more common to have an API error than to get defective json out - I can’t recall it happening at all in the last six months, probably much longer. I remember a customer over a year ago that was using gpt4o and had some logic to fix json errors, so evidently it happened for them.
I recognize this is more of an issue with smaller cheaper models now, though if the model can’t make proper json, what else is it breaking.
> Here's something most developers overlook: if an LLM has a 2% JSON defect rate, and Response Healing drops that to 1%, you haven't just made a 1% improvement. You've cut your defects, bugs, and support tickets in half.
If part of my system can't even manage to output JSON reliably, it needs way more "healing" than syntax munging. This comes across as naive.
> > Here's something most developers overlook: if an LLM has a 2% JSON defect rate, and Response Healing drops that to 1%, you haven't just made a 1% improvement. You've cut your defects, bugs, and support tickets in half.
Why do you have an expectation that a company will disclose to you when they use AI for their copywriting? Do you want them to disclose the software they used to draft and publish? If a manager reviewed the blog post before it went live?
I thought structured output was done by only allowing tokens that would produce valid output. For their example of a missing closing bracket, the end token wouldn't be allowed, and it would only accept tokens that contain a digit, comma, or closing bracket. I guess that must not be the case, though. Doing that seems like a better way to address this.
That is a way of doing that, but it's quite expensive computationally. There are some companies that can make it feasible [0], but it's often not a perfect process and different inference providers implement it different ways.
This really gets at the heart of my instinctive dislike of how LLMs are being deployed. A core feature of computers, and tools in general, is reliability. I like software because you can set something up, run it, and (ideally) know that it will do the same job the same way each subsequent time you run it. I want a button that is clearly labeled, and when pressed, does a specific thing, acting like a limb, an extension of my will. I do not, in almost all cases, want my computer to be another distinct entity that I conduct social interactions with.
Maybe people got used to computers being unreliable and unpredictable as the UIs we shipped became more distracting, less learnable, always shifting and hiding information, popping up suggestions and displaying non-deterministic-seeming behavior. We trained users to treat their devices like unruly animals that they can never quite trust. So now the idea of a machine that embodies a more clever (but still unreliable) animal to wrangle sounds like a clear upgrade.
But as someone who's spent an inordinate amount of time tweaking and tuning his computing environment to prune out flakey components and fine-tune bindings and navigation, the idea of integrating a tool into my workflow that does amazing things but fails utterly even 1% of the time sounds like a nightmare, a sort of perpetual torture of low-grade anxiety.
> We trained users to treat their devices like unruly animals that they can never quite trust. So now the idea of a machine that embodies a more clever (but still unreliable) animal to wrangle sounds like a clear upgrade.
I wish I didn't agree with this, but I think you're exactly right. Even engineers dealing with systems we know are deterministic will joke about making the right sacrifices to the tech gods to get such-and-such working. Take that a step further and maybe it doesn't feel too bad to some people for the system to actually not be deterministic, if you have a way to "convince" it to do what you want. How depressing.
Is this a joke? Am I going crazy?
I don't like this future we're going towards where we have to trick our software (which we can no longer understand the workings of) into doing what we tell it to by asking it nicely, or by putting another black box on the end to "fix" the output. This is the opposite of engineering. This is negotiation with a genie trapped in silicon.
Very confused. When you enable structured output the response should adhere to the JSON schema EXACTLY, not best effort, by constraining the output via guided decoding. This is even documented in OpenRouter's structured output doc
> The model will respond with a JSON object that strictly follows your schema
Gemini is listed as a model supporting structured output, and yet its fail rate is 0.39% (Gemini 2.0 Flash)!! I get that structured output has a high performance cost but advertising it as supported when in reality it's not is a massive red flag.
Worst yet response healing only fixes JSON syntax error, not schema adherence. This is only mentioned at the end of the article which people are clearly not going to read.
WTF
>What about XML? The plugin can heal XML output as well - contact us if you’d like access.
Isn't this exactly how we got weird html parsing logic in the first place, with "autohealing" logic for mismatched closing tags or quotes?
I have various workflows for which I use Claude (sonnet 4.5 mostly) to emit json and it’s literally more common to have an API error than to get defective json out - I can’t recall it happening at all in the last six months, probably much longer. I remember a customer over a year ago that was using gpt4o and had some logic to fix json errors, so evidently it happened for them.
I recognize this is more of an issue with smaller cheaper models now, though if the model can’t make proper json, what else is it breaking.
> Here's something most developers overlook: if an LLM has a 2% JSON defect rate, and Response Healing drops that to 1%, you haven't just made a 1% improvement. You've cut your defects, bugs, and support tickets in half.
If part of my system can't even manage to output JSON reliably, it needs way more "healing" than syntax munging. This comes across as naive.
"it's not just X, it's Y"
Don't you worry about Planet Express, let me worry about blank.
Plus, that claim isn't even true. A 1% and 2% JSON defect rate are going to annoy a similar amount of people into filing bugs and tickets.
Sounds like we are twice as close to AGI!
But, but, you've just cut your defects, bugs, and support tickets in half!
One of the best shitposts I have ever seen, by far. Absurdism taken to its finest form.
Dear Openrouter blog authors, could you please stop writing your blogposts with LLMs?
The content of your posts is really insightful and interesting, but it's feel like junk quality because of the way LLMs write blogposts.
What was your prompt?
A lot of it was finger written -- curious which part sounded like LLM to you?
> > Here's something most developers overlook: if an LLM has a 2% JSON defect rate, and Response Healing drops that to 1%, you haven't just made a 1% improvement. You've cut your defects, bugs, and support tickets in half.
This sounds AI written.
Meaning parts were LLM written? With no disclosure?
"With no disclosure?"
Why do you have an expectation that a company will disclose to you when they use AI for their copywriting? Do you want them to disclose the software they used to draft and publish? If a manager reviewed the blog post before it went live?
Why not just publish the prompt? I can then take an LLM of my taste to reformat it the way I want.
Basically, I'm asking for open source blogging!
Next up: blog healing
I thought structured output was done by only allowing tokens that would produce valid output. For their example of a missing closing bracket, the end token wouldn't be allowed, and it would only accept tokens that contain a digit, comma, or closing bracket. I guess that must not be the case, though. Doing that seems like a better way to address this.
That is a way of doing that, but it's quite expensive computationally. There are some companies that can make it feasible [0], but it's often not a perfect process and different inference providers implement it different ways.
[0] https://dottxt.ai/
I have used structured outputs both with OpenAI and the Gemini models. In the beginning they had some rough edges but lately it's been smooth sailing.
Seems like Openrouter also supports structured outputs.
https://openrouter.ai/docs/guides/features/structured-output...
This really gets at the heart of my instinctive dislike of how LLMs are being deployed. A core feature of computers, and tools in general, is reliability. I like software because you can set something up, run it, and (ideally) know that it will do the same job the same way each subsequent time you run it. I want a button that is clearly labeled, and when pressed, does a specific thing, acting like a limb, an extension of my will. I do not, in almost all cases, want my computer to be another distinct entity that I conduct social interactions with.
Maybe people got used to computers being unreliable and unpredictable as the UIs we shipped became more distracting, less learnable, always shifting and hiding information, popping up suggestions and displaying non-deterministic-seeming behavior. We trained users to treat their devices like unruly animals that they can never quite trust. So now the idea of a machine that embodies a more clever (but still unreliable) animal to wrangle sounds like a clear upgrade.
But as someone who's spent an inordinate amount of time tweaking and tuning his computing environment to prune out flakey components and fine-tune bindings and navigation, the idea of integrating a tool into my workflow that does amazing things but fails utterly even 1% of the time sounds like a nightmare, a sort of perpetual torture of low-grade anxiety.
> We trained users to treat their devices like unruly animals that they can never quite trust. So now the idea of a machine that embodies a more clever (but still unreliable) animal to wrangle sounds like a clear upgrade.
I wish I didn't agree with this, but I think you're exactly right. Even engineers dealing with systems we know are deterministic will joke about making the right sacrifices to the tech gods to get such-and-such working. Take that a step further and maybe it doesn't feel too bad to some people for the system to actually not be deterministic, if you have a way to "convince" it to do what you want. How depressing.
How do they know the output needs to be in json format?
This is good, is there a python library to do this ?
This is incredible!