It's an interesting paper, but I'd like to see a lot more about the types of errors that the LLM makes. Are they happening in the forward pass or the inverse pass? My guess is the inverse pass.
I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks.
It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.
I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.
I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.
The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.
Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.
Yeah I've been saying this for a while: AI-washing any text will degrade it, compounding with each pass.
"Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...
I've been calling it meanwit reversion
By „with each pass” do you mean within the same session, or with new session (context window) each time?
It's an interesting paper, but I'd like to see a lot more about the types of errors that the LLM makes. Are they happening in the forward pass or the inverse pass? My guess is the inverse pass.
I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks.
It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.
I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.
I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.
The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.
Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.