"In this post, I’ll cover a third, not-so-obvious approach: building ways for the agent to validate more of its own work before a human has to step in. "
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
Isn't this a bit of an incorrect usage of the term "backpressure"?
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
It is an incorrect use of what was already a flawed metaphor. Pressure is isotropic. Directed pressure makes no sense, like all other fluid analogies in unrelated fields of engineering.
Wait so cross ventilation, where a breeze will flow through a house if windows are open on opposite sides at a much greater rate than if windows are only open on the upwind side… isn’t really a thing?
If the systems invariants are well defined, and a suite of conformance + requirements tests (ensuring invariance is respected) are defined, wouldn't this be a broad - _'base case'_ - approach in general?
A very long post about a simple and very obvious idea with many different implementations.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", that's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
Yeah I don’t really see the backpressure analogy here - it implies that the agent is constantly producing new stuff, which isn’t really possible since the solution is very detailed specs/goals.
That quote shows an utter disregard for basic human decency.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
They are probably reacting to the laughable idea that by making PRs 20% better (or whatever), devs will continue to review the code with sufficient rigor to catch even the bugs they're supposedly now preventing. Assuming such rigor was ever present in their work!
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
Interesting ideas for generalizing goals to reduce human labor in human <—> agent interactions. That said, maybe it is better to set up customized skills and infrastructure for large projects? At our early stage of trying to capture value of agentic systems, the good ideas in this article might be premature optimization.
I’m willing to be wrong but this industry-wide emphasis on AI creative/coding workflows seems way over-engineered.
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
For me it's usually that I start with a single agent, but then I won't have anything to do while it is churning and I have other ideas/features that keep building up that I want to do, so I need to scale, and while I'm scaling I need to start to have those workflows, so eventually I end up with many agents, most which are autonomous working on their own worktrees, but I will have a specific agent that I will talk to more iteratively.
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
I suspect that letting agents spin away unattended for long stretches of time will become less and less popular as more and more companies blow their token budgets and start requiring some answers to difficult questions before agreeing to further loose the purse strings.
I agree. I have gotten an incredible amount of work done iterating with 5-30 minute long agent tasks. But it requires I stay engaged, and not go chill on the beach, which I guess is a lot of agentmaxxers’ goal.
interesting idea, unfortunately programming the structure is equivalent (P=NP) to just programming itself. same as TDD.
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.
"In this post, I’ll cover a third, not-so-obvious approach: building ways for the agent to validate more of its own work before a human has to step in. "
this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".
I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.
What license do you use then?
you can pay by just volume ("API pricing")
Isn't this a bit of an incorrect usage of the term "backpressure"?
OP quoted the correct definition right at the start:
> In systems engineering, backpressure is the mechanism by which a downstream component signals upstream that it can't accept more work
(the "downstream component" being the human reviewer in this case)
But the measures they propose don't actually do that. They are more like fixed throttle elements which would slow down the rate of submissions of an agent and weed out some low-quality submissions before hitting "downstream".
I'm missing the connection to the actual capacity (or will) that the human developers have to review the submissions.
Author here. Well noted. I do think backpressure might not be the ideal analogy/term.
It comes from previous posts I’ve come across, but I haven’t considered exactly what you mentioned. That’s on me.
lol your comment sounds like a Claude apology
I had a colleague called Claude. Half of the company was blaming everything on him...
It is an incorrect use of what was already a flawed metaphor. Pressure is isotropic. Directed pressure makes no sense, like all other fluid analogies in unrelated fields of engineering.
Wait so cross ventilation, where a breeze will flow through a house if windows are open on opposite sides at a much greater rate than if windows are only open on the upwind side… isn’t really a thing?
The act of "making pressure" means applying a force and is completely directional.
If the systems invariants are well defined, and a suite of conformance + requirements tests (ensuring invariance is respected) are defined, wouldn't this be a broad - _'base case'_ - approach in general?
A very long post about a simple and very obvious idea with many different implementations.
The three main problems are 1) API usage is deadly expensive 2) Claude is about to make all automation very expensive 3) all the flows where a model has the initiative are strictly biased towards unwarranted stops (checkpointing).
Also, I won't call that "backpressure", that's a discussion about organizational principles for system which consist of multiple unreliable but very complex components and this "backpressure" is just one of the aspects. Personally I find the viable system model framework productive as both a mental model and literal implementation guideline.
Lesser problem is that agent SDKs are bad and building a custom harness is hard.
https://en.wikipedia.org/wiki/Back_pressure
The problem is not "backpressure", that's just one of the tools and there are different approaches with the same effect.
This seems to be the coding agents 101: build a strong feedback loop. Am I missing something?
No, there is nothing new in this article. This methodology is already adopted - and further optimized - by the sw community.
Yeah I don’t really see the backpressure analogy here - it implies that the agent is constantly producing new stuff, which isn’t really possible since the solution is very detailed specs/goals.
Looks like plenty of recent prior art on this:
https://pura.xyz
https://github.com/puraxyz/puraxyz/blob/main/docs/paper/main...
> It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself.
Oh boy.
Care to elaborate?
That quote shows an utter disregard for basic human decency.
It is the responsibility of the person running the coding agent to make sure the resulting PRs are high quality. Putting that on your team mates, or worse, random open source project maintainers on the internet, is the definition of an extractive contribution.
They are probably reacting to the laughable idea that by making PRs 20% better (or whatever), devs will continue to review the code with sufficient rigor to catch even the bugs they're supposedly now preventing. Assuming such rigor was ever present in their work!
Put another way, who are they supposed to hire to tell these low quality PRs apart from the high quality ones? Who even knows how to do something like that?!
Interesting ideas for generalizing goals to reduce human labor in human <—> agent interactions. That said, maybe it is better to set up customized skills and infrastructure for large projects? At our early stage of trying to capture value of agentic systems, the good ideas in this article might be premature optimization.
Oh this is 101. Anyone not doing this? If not do it now!
Because your token use explodes?
I’m willing to be wrong but this industry-wide emphasis on AI creative/coding workflows seems way over-engineered.
Ime successful creative execution looks like micro-iterations where each output informs the next creative move.
I can build something incredibly fast from essentially caveman grunt instructions through an LLM harness, iterating as I go.
Optimizing for feeding a huge plan to an agent sounds to me like a net waste of time. And looking over the shoulder of industry peers trying to do this, I don’t see their outputs or throughput some remarkable improvement over what I can produce with minimal fanfare usage.
For me it's usually that I start with a single agent, but then I won't have anything to do while it is churning and I have other ideas/features that keep building up that I want to do, so I need to scale, and while I'm scaling I need to start to have those workflows, so eventually I end up with many agents, most which are autonomous working on their own worktrees, but I will have a specific agent that I will talk to more iteratively.
So e.g. I may have 1 agent that I ask and iterate on with directly, and 9 agents that work separately on their own.
I will utilize this 1 agent on features I care most about and want to guide and iterate on in as much detail as possible.
100% agree. Took me ages of working with the agents to circle back around this, which was the best way to get work done before AI automation anyways.
Yeah it's wild watching so many people decide waterfall is great all of a sudden.
Never mind stumbling into proper engineering principles like having documented, testable requirements specifications.
I suspect that letting agents spin away unattended for long stretches of time will become less and less popular as more and more companies blow their token budgets and start requiring some answers to difficult questions before agreeing to further loose the purse strings.
The trick is to have 5 of those huge plans running in parallel.
I agree. I have gotten an incredible amount of work done iterating with 5-30 minute long agent tasks. But it requires I stay engaged, and not go chill on the beach, which I guess is a lot of agentmaxxers’ goal.
you do not need micro iterations. you can set macro goals and let the agent/LLM/model whatever you want to call it figure it out.
it works.
interesting idea, unfortunately programming the structure is equivalent (P=NP) to just programming itself. same as TDD.
as usual, the tool isnt really doing whats listed on its label.
however, people are different so this might improve someones capability to deploy LLMs. might even provide better evidence where actual brain power is needed.