Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.
I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx
What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.
Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.
Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.
“Anymore?” After 40 years in software I’ll say that validation of intent vs. outcome has always been a hard problem. There are and have been no shortcuts other than determined human effort.
I don’t disagree. After decades, it’s still hard which is exactly why I think treating validation as a system problem matters.
We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.
This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).
And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.
(and unambiguously. and completely. For various depths of those)
This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you
There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..
For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really.
And not that long ago, the SW-Engineer job was called Analyst-programmer.
AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.
Tests are only rigorous if the correct intent is encoded in them. Perfectly working software can be wrong if the intent was inferred incorrectly. I leverage BDD heavily, and there a lot of little details it's possible to misinterpret going from spec -> code. If the spec was sufficient to fully specify the program, it would be the program, so there's lots of room for error in the transformation.
> You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.
Can you detail a scenario by which an LLM can get the scenario wrong?
I do not trust the LLM to do it correctly. We do not have the same experience with them, and should not assume everyone does. To me, your question makes no sense to ask.
We should be able to measure this. I think verifying things is something an llm can do better than a human.
You and I disagree on this specific point.
Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language. Having it write tests doesn't change this, it's only asserting that its view of what you want is internally consistent, it is still just as likely to be an incorrect interpretation of your intent.
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language.
Then it seems like the only workable solution from your perspective is a solo member team working on a product they came up with. Because as soon as there's more than one person on something, they have to use "lossy natural language" to communicate it between themselves.
> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.
This is an interesting point but if I may offer a different perspective:
Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
Now I've worked with many junior to mid-junior level SDEs and sadly 80% does not do a better job than Claude. (I've also worked with staff level SDEs who writes worse code than AI, but they offset that usually with domain knowledge and TL responsibilities)
I do see AI transform software engineering into even more of a pyramid with very few human on top.
Important too, a fully loaded salary costs the company far more than the actual salary that the employee receives. That would tip this balancing point towards 120k salaries, which is well into the realm of non-FAANG
It would depend on the speed of execution, if you can do the same amount of work in 5 days with spending 5k, vs spending a month and 5k on a human the math makes more sense.
Tokens will become significantly more expensive in the short term actually. This is not stemming from some sort of anti-AI sentiment. You have two ramps that are going to drive this. 1. Increase demand, linear growth at least but likely this is already exponential. 2. Scaling laws demand, well, more scale.
Future better models will both demand higher compute use AND higher energy. We cannot underestimate the slowness of energy production growth and also the supplies required for simply hooking things up. Some labs are commissioning their own power plants on site, but this is not a true accelerator for power grid growth limits. You're using the same supply chain to build your own power plant.
If inference cost is not dramatically reduced and models don't start meaningfully helping with innovations that make energy production faster and inference/training demand less power, the only way to control demand is to raise prices. Current inference costs, do not pay for training costs. They can probably continue to do that on funding alone, but once the demand curve hits the power production limits, only one thing can slow demand and that's raising the cost of use.
$1,000 is maybe 5$ per workday. I measure my own usage and am on the way to $6,000 for a full year. I'm still at the stage where I like to look at the code I produce, but I do believe we'll head to a state of software development where one day we won't need to.
Scroll further down (specifically to the section titled "Wait, $1,000/day per engineer?"). The quote in the quoted article (so from the original source in factory.strongdm.ai) could potentially be read either way, but Simon Willison (the direct link) absolutely is interpreting it as $1000/dev/day. I also think $1000/dev/day is the intended meaning in the strongdm article.
> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored where the coding agents can see them—is fascinating. It imitates aggressive testing by an external QA team—an expensive but highly effective way of ensuring quality in traditional software.
This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.
The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.
I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.
Question for people who are already doing this: How much are you spending on tokens?
That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.
Re: $1k/day on tokens - you can also build a local rig, nothing "fancy". There was a recent thread here re: the utility of local models, even on not-so-fancy hardware. Agents were a big part of it - you just set a task and it's done at some point, while you sleep or you're off to somewhere or working on something else entirely or reading a book or whatever. Turn off notifications to avoid context switches.
> In rule form:
- Code must not be written by humans
- Code must not be reviewed by humans
as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is
also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage
Important to note that this is the approach taken by their AI research lab over the past six months, it's not (yet) reflective of how they build the core product.
Thanks for the reply (always enjoy your sqlite content). It's definitely going to be interesting to see how all these AI labs playout when they are how the core business is built.
> As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation.
This is still the same problem -- just pushed back a layer. Since the generated API is wrong, the QA outcomes will be wrong, too. Also, QAing things is an effective way to ensure that they work _after_ they've been reviewed by an engineer. A QA tester is not going to test for a vulnerability like a SQL injection unless they're guided by engineering judgement which comes from an understanding of the properties of the code under test.
The output is also essentially the definition of a derivative work, so it's probably not legally defensible (not that that's ever been a concern with LLMs).
The solution to this problem is not throwing everything at AI. To get good results from any AI model, you need an architect (human) instructing it from the top. And the logic behind this is that AI has been trained on millions of opinions on getting a particular task done. If you ask a human, they almost always have one opinionated approach for a given task. The human's opinion is a derivative of their lived experience, sometimes foreseeing all the way to the end result an AI cannot foresee. Eg. I want a database column a certain type because I'm thinking about adding an E-Commerce feature to my CMS later. An AI might not have this insight.
Of course, you can't always tell the model what to do, especially if it is a repeated task. It turns out, we already solved this decades ago using algorithms. Repeatable, reproducible, reliable. The challenge (and the reward) lies in separating the problem statement into algorithmic and agentic. Once you achieve this, the $1000 token usage is not needed at all.
I have a working prototype of the above and I'm currently productizing it (shameless plug):
However - I need to emphasize, the language you use to apply the pattern above matters. I use Elixir specifically for this, and it works really, really well.
It works based off starting with the architect. You. It feeds off specs and uses algorithms as much as possible to automate code generation (eg. Scaffolding) and only uses AI sparsely when needed.
Of course, the downside of this approach is that you can't just simply say "build me a social network". You can however say something like "Build me a social network where users can share photos, repost, like and comment on them".
Once you nail the models used in the MVC pattern, their relationships, the software design is pretty much 50% battle won. This is really good for v1 prototypes where you really want best practices enforced, OSWAP compliant code, security-first software output which is where a pure agentic/AI approach would mess up.
IT perspective here. Simon hits the nail on the head as to what I'm genuinely looking forward to:
> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!
This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.
What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.
This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?
For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.
Stuff comes in from an API goes out to a different API.
With a semi-decent agent I can build what took me a week or two in hours just because it can iterate the solution faster than any human can type.
A new field in the API could’ve been a two day ordeal of patching it through umpteen layers of enterprise frameworks. Now I can just tell Claude to add it, it’ll do it up to the database in minutes - and update the tests at the same time.
And because these are all APIs, we can brute-force it with read-only operations with minimal review times. If the read works, the write almost always will, and then it's just a matter of reading and documenting the integration before testing it in dev or staging.
So much of enterprise IT nowadays is spent hammering or needling vendors for basic API documentation so we can write a one-off that hooks DB1 into ServiceNow that's also pulling from NewRelic just to do ITAM. Consultants would salivate over such a basic integration because it'd be their yearly salary over a three month project.
Now we can do this ourselves with an LLM in a single sprint.
On the cxdb “product” page one reason they give against rolling your own is that it would be “months of work”. Slipped into an archaic off-brand mindset there, no?
I like the idea but I'm not so sure this problem can be solved generally.
As an example: imagine someone writing a data pipeline for training a machine learning model. Anyone who's done this knows that such a task involves lots data wrangling work like cleaning data, changing columns and some ad hoc stuff.
The only way to verify that things work is if the eventual model that is trained performs well.
In this case, scenario testing doesn't scale up because the feedback loop is extremely large - you have to wait until the model is trained and tested on hold out data.
Scenario testing clearly can not work on the smaller parts of the work like data wrangling.
In this model the spec/scenarios are the code. These are curated and managed by humans just like code.
They say "non interactive". But of course their work is interactive. AI agents take a few minutes-hours whereas you can see code change result in seconds. That doesn't mean AI agents aren't interactive.
I'm very AI-positive, and what they're doing is different, but they are basically just lying. It's a new word for a new instance of the same old type of thing. It's not a new type of thing.
The common anti-AI trope is "AI just looked at <human output> to do this." The common AI trope from the StrongDM is "look, the agent is working without human input." Both of these takes are fundamentally flawed.
AI will always depend on humans to produce relevant results for humans. It's not a flaw of AI, it's more of a flaw of humans. Consequently, "AI needs human input to produce results we want to see" should not detract from the intelligence of AI.
Why is this true? At a certain point you just have Kolmogorov complexity, AI having fixed memory and fixed prompt size, pigeonhole principle, not every output is possible to be produced even with any input given specific model weights.
Recursive self-improvement doesn't get around this problem. Where does it get the data for next iteration? From interactions with humans.
With the infinite complexity of mathematics, for instance solving Busy Beaver numbers, this is a proof that AI can in fact not solve every problem. Humans seem to be limited in this regard as well, but there is no proof that humans are fundamentally limited this way like AI. This lack of proof of the limitations of humans is the precise advantage in intelligence that humans will always have over AI.
That's a genuine problem now. If you launch a new feature and your competition can ship their own copy a few hours later the competitive dynamics get really challenging!
My hunch is that the thing that's going to matter is network effects and other forms of soft lockin. Features alone won't cut it - you need to build something where value accumulates to your user over time in a way that discourages them from leaving.
The interesting part about that is both of those things require some sort of time to start.
If I launch a new product, and 4 hours later competitors pop up, then there's not enough time for network effects or lockin.
I'm guessing what is really going to be needed is something that can't be just copied. Non-public data, business contracts, something outside of software.
Wouldn't the incumbents with their fantastic distribution channels, brand, lockin, marketing, capital and own models just wipe the floor with everyone as talent no longer matters?
I recently passed 40,000 but my Substack is free so it's not a revenue source for me. I haven't really looked at who they are - at some point it would be interesting to export the CSV of the subscribers and count by domains, I guess.
My content revenue comes from ads on my blog via https://www.ethicalads.io/ - rarely more than $1,000 in a given month - and sponsors on GitHub: https://github.com/sponsors/simonw - which is adding up to quite good money now. Those people get my sponsors-only monthly newsletter which looks like this: https://gist.github.com/simonw/13e595a236218afce002e9aeafd75... - it's effectively the edited highlights from my blog because a lot of people are too busy to read everything I put out there!
Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.
I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx
What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.
Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.
Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.
“Anymore?” After 40 years in software I’ll say that validation of intent vs. outcome has always been a hard problem. There are and have been no shortcuts other than determined human effort.
I don’t disagree. After decades, it’s still hard which is exactly why I think treating validation as a system problem matters.
We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.
This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).
And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.
> “define the spec concretely“
(and unambiguously. and completely. For various depths of those)
This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you
There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..
For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really. And not that long ago, the SW-Engineer job was called Analyst-programmer.
But still, the frontier is there to cross..
AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.
did you read the article?
>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).
Tests are only rigorous if the correct intent is encoded in them. Perfectly working software can be wrong if the intent was inferred incorrectly. I leverage BDD heavily, and there a lot of little details it's possible to misinterpret going from spec -> code. If the spec was sufficient to fully specify the program, it would be the program, so there's lots of room for error in the transformation.
Then I disagree with you
> You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.
Can you detail a scenario by which an LLM can get the scenario wrong?
I do not trust the LLM to do it correctly. We do not have the same experience with them, and should not assume everyone does. To me, your question makes no sense to ask.
We should be able to measure this. I think verifying things is something an llm can do better than a human.
You and I disagree on this specific point.
Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language. Having it write tests doesn't change this, it's only asserting that its view of what you want is internally consistent, it is still just as likely to be an incorrect interpretation of your intent.
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language.
Then it seems like the only workable solution from your perspective is a solo member team working on a product they came up with. Because as soon as there's more than one person on something, they have to use "lossy natural language" to communicate it between themselves.
Coworkers are absolutely an ongoing point of friction everywhere :)
On the plus side, IMO nonverbal cues make it way easier to tell when a human doesn't understand things than an agent.
> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.
Yeah I'm going to update my piece to talk more about that.
Edit: here's that section: https://simonwillison.net/2026/Feb/7/software-factory/#wait-...
This is an interesting point but if I may offer a different perspective:
Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
Now I've worked with many junior to mid-junior level SDEs and sadly 80% does not do a better job than Claude. (I've also worked with staff level SDEs who writes worse code than AI, but they offset that usually with domain knowledge and TL responsibilities)
I do see AI transform software engineering into even more of a pyramid with very few human on top.
Important too, a fully loaded salary costs the company far more than the actual salary that the employee receives. That would tip this balancing point towards 120k salaries, which is well into the realm of non-FAANG
Original claim was:
> At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans
You say
> Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.
So you both are in agreement on that part at least.
It would depend on the speed of execution, if you can do the same amount of work in 5 days with spending 5k, vs spending a month and 5k on a human the math makes more sense.
You won't know which path has larger long term costs, for a example, what if the AI version costs 10x to run?
If the output is (dis)proportionally larger, the cost trade off might be the right thing to do.
And it might be the tokens will become cheaper.
Tokens will become significantly more expensive in the short term actually. This is not stemming from some sort of anti-AI sentiment. You have two ramps that are going to drive this. 1. Increase demand, linear growth at least but likely this is already exponential. 2. Scaling laws demand, well, more scale.
Future better models will both demand higher compute use AND higher energy. We cannot underestimate the slowness of energy production growth and also the supplies required for simply hooking things up. Some labs are commissioning their own power plants on site, but this is not a true accelerator for power grid growth limits. You're using the same supply chain to build your own power plant.
If inference cost is not dramatically reduced and models don't start meaningfully helping with innovations that make energy production faster and inference/training demand less power, the only way to control demand is to raise prices. Current inference costs, do not pay for training costs. They can probably continue to do that on funding alone, but once the demand curve hits the power production limits, only one thing can slow demand and that's raising the cost of use.
$1,000 is maybe 5$ per workday. I measure my own usage and am on the way to $6,000 for a full year. I'm still at the stage where I like to look at the code I produce, but I do believe we'll head to a state of software development where one day we won't need to.
Maybe read that quote again. The figure is 1000 per day
The quote is if you haven't spent $1000 per dev today
which sounds more like if you haven't reached this point you don't have enough experience yet, keep going
At least that's how I read the quote
Scroll further down (specifically to the section titled "Wait, $1,000/day per engineer?"). The quote in the quoted article (so from the original source in factory.strongdm.ai) could potentially be read either way, but Simon Willison (the direct link) absolutely is interpreting it as $1000/dev/day. I also think $1000/dev/day is the intended meaning in the strongdm article.
> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored where the coding agents can see them—is fascinating. It imitates aggressive testing by an external QA team—an expensive but highly effective way of ensuring quality in traditional software.
This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.
The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.
I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.
Question for people who are already doing this: How much are you spending on tokens?
That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.
Re: $1k/day on tokens - you can also build a local rig, nothing "fancy". There was a recent thread here re: the utility of local models, even on not-so-fancy hardware. Agents were a big part of it - you just set a task and it's done at some point, while you sleep or you're off to somewhere or working on something else entirely or reading a book or whatever. Turn off notifications to avoid context switches.
Check it: https://news.ycombinator.com/item?id=46838946
I wouldn't be surprised if agents start "bribing" each other.
Do you know what those hold out twats should look like before thoroughly iterating on the problem?
I think people are burning money on tokens letting these things fumble about until they arrive at some working set of files.
I'm staying in the loop more than this, building up rather than tuning out
> In rule form: - Code must not be written by humans - Code must not be reviewed by humans
as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is
also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage
Important to note that this is the approach taken by their AI research lab over the past six months, it's not (yet) reflective of how they build the core product.
Right but how many unsuspecting customers like you do they need to have before they can exit?
They actually "exited" a few weeks ago - acquired by Delinea: https://delinea.com/news/delinea-strongdm-to-unite-redefine-...
From what I've heard the acquisition was unrelated to their AI lab work, it was about the core business.
Thanks for the reply (always enjoy your sqlite content). It's definitely going to be interesting to see how all these AI labs playout when they are how the core business is built.
> As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation.
This is still the same problem -- just pushed back a layer. Since the generated API is wrong, the QA outcomes will be wrong, too. Also, QAing things is an effective way to ensure that they work _after_ they've been reviewed by an engineer. A QA tester is not going to test for a vulnerability like a SQL injection unless they're guided by engineering judgement which comes from an understanding of the properties of the code under test.
The output is also essentially the definition of a derivative work, so it's probably not legally defensible (not that that's ever been a concern with LLMs).
The solution to this problem is not throwing everything at AI. To get good results from any AI model, you need an architect (human) instructing it from the top. And the logic behind this is that AI has been trained on millions of opinions on getting a particular task done. If you ask a human, they almost always have one opinionated approach for a given task. The human's opinion is a derivative of their lived experience, sometimes foreseeing all the way to the end result an AI cannot foresee. Eg. I want a database column a certain type because I'm thinking about adding an E-Commerce feature to my CMS later. An AI might not have this insight.
Of course, you can't always tell the model what to do, especially if it is a repeated task. It turns out, we already solved this decades ago using algorithms. Repeatable, reproducible, reliable. The challenge (and the reward) lies in separating the problem statement into algorithmic and agentic. Once you achieve this, the $1000 token usage is not needed at all.
I have a working prototype of the above and I'm currently productizing it (shameless plug):
https://designflo.ai
However - I need to emphasize, the language you use to apply the pattern above matters. I use Elixir specifically for this, and it works really, really well.
It works based off starting with the architect. You. It feeds off specs and uses algorithms as much as possible to automate code generation (eg. Scaffolding) and only uses AI sparsely when needed.
Of course, the downside of this approach is that you can't just simply say "build me a social network". You can however say something like "Build me a social network where users can share photos, repost, like and comment on them".
Once you nail the models used in the MVC pattern, their relationships, the software design is pretty much 50% battle won. This is really good for v1 prototypes where you really want best practices enforced, OSWAP compliant code, security-first software output which is where a pure agentic/AI approach would mess up.
IT perspective here. Simon hits the nail on the head as to what I'm genuinely looking forward to:
> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!
This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.
What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.
This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?
For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.
“API Glue” is what I’ve called it since forever
Stuff comes in from an API goes out to a different API.
With a semi-decent agent I can build what took me a week or two in hours just because it can iterate the solution faster than any human can type.
A new field in the API could’ve been a two day ordeal of patching it through umpteen layers of enterprise frameworks. Now I can just tell Claude to add it, it’ll do it up to the database in minutes - and update the tests at the same time.
And because these are all APIs, we can brute-force it with read-only operations with minimal review times. If the read works, the write almost always will, and then it's just a matter of reading and documenting the integration before testing it in dev or staging.
So much of enterprise IT nowadays is spent hammering or needling vendors for basic API documentation so we can write a one-off that hooks DB1 into ServiceNow that's also pulling from NewRelic just to do ITAM. Consultants would salivate over such a basic integration because it'd be their yearly salary over a three month project.
Now we can do this ourselves with an LLM in a single sprint.
That's a Pandora's Box moment right there.
On the cxdb “product” page one reason they give against rolling your own is that it would be “months of work”. Slipped into an archaic off-brand mindset there, no?
We make this great, just don't use it to build the same thing we offer
Heat death of the SaaSiverse
I like the idea but I'm not so sure this problem can be solved generally.
As an example: imagine someone writing a data pipeline for training a machine learning model. Anyone who's done this knows that such a task involves lots data wrangling work like cleaning data, changing columns and some ad hoc stuff.
The only way to verify that things work is if the eventual model that is trained performs well.
In this case, scenario testing doesn't scale up because the feedback loop is extremely large - you have to wait until the model is trained and tested on hold out data.
Scenario testing clearly can not work on the smaller parts of the work like data wrangling.
I can't tell if this is genius or terrifying given what their software does. Probably a bit of both.
I wonder what the security teams at companies that use StrongDM will think about this.
I doubt this would be allowed in regulated industries like healthcare
This is just sleight of hand.
In this model the spec/scenarios are the code. These are curated and managed by humans just like code.
They say "non interactive". But of course their work is interactive. AI agents take a few minutes-hours whereas you can see code change result in seconds. That doesn't mean AI agents aren't interactive.
I'm very AI-positive, and what they're doing is different, but they are basically just lying. It's a new word for a new instance of the same old type of thing. It's not a new type of thing.
The common anti-AI trope is "AI just looked at <human output> to do this." The common AI trope from the StrongDM is "look, the agent is working without human input." Both of these takes are fundamentally flawed.
AI will always depend on humans to produce relevant results for humans. It's not a flaw of AI, it's more of a flaw of humans. Consequently, "AI needs human input to produce results we want to see" should not detract from the intelligence of AI.
Why is this true? At a certain point you just have Kolmogorov complexity, AI having fixed memory and fixed prompt size, pigeonhole principle, not every output is possible to be produced even with any input given specific model weights.
Recursive self-improvement doesn't get around this problem. Where does it get the data for next iteration? From interactions with humans.
With the infinite complexity of mathematics, for instance solving Busy Beaver numbers, this is a proof that AI can in fact not solve every problem. Humans seem to be limited in this regard as well, but there is no proof that humans are fundamentally limited this way like AI. This lack of proof of the limitations of humans is the precise advantage in intelligence that humans will always have over AI.
Serious question: what's keeping a competitor from doing the same thing and doing it better than you?
That's a genuine problem now. If you launch a new feature and your competition can ship their own copy a few hours later the competitive dynamics get really challenging!
My hunch is that the thing that's going to matter is network effects and other forms of soft lockin. Features alone won't cut it - you need to build something where value accumulates to your user over time in a way that discourages them from leaving.
The interesting part about that is both of those things require some sort of time to start.
If I launch a new product, and 4 hours later competitors pop up, then there's not enough time for network effects or lockin.
I'm guessing what is really going to be needed is something that can't be just copied. Non-public data, business contracts, something outside of software.
Marketing and brand are still the most important, though I personally hope for a world where business is more indie and less winner take all
You can see the first waves of this trend in HN new.
Wouldn't the incumbents with their fantastic distribution channels, brand, lockin, marketing, capital and own models just wipe the floor with everyone as talent no longer matters?
Can you disclose the number of Substack subscriptions and whether there is an unusual amount of bulk subscriptions from certain entities?
I recently passed 40,000 but my Substack is free so it's not a revenue source for me. I haven't really looked at who they are - at some point it would be interesting to export the CSV of the subscribers and count by domains, I guess.
My content revenue comes from ads on my blog via https://www.ethicalads.io/ - rarely more than $1,000 in a given month - and sponsors on GitHub: https://github.com/sponsors/simonw - which is adding up to quite good money now. Those people get my sponsors-only monthly newsletter which looks like this: https://gist.github.com/simonw/13e595a236218afce002e9aeafd75... - it's effectively the edited highlights from my blog because a lot of people are too busy to read everything I put out there!
I try to keep my disclosures updated on the about page of my blog: https://simonwillison.net/about/#disclosures
Is this article sponsored by StrongDM?
No.