This is the "lines of code per week" metric from the 90s, repackaged. "I'm doing more PRs" is not evidence that AI is working, it's evidence that you are merging more.
Whether thats good depends entirely on what you are merging.
I use AI every day too. But treating throughput of code going to production as a success metric, without any mention of quality, bugs, or maintenance burden is exactly the kind of thinking developers used to push back on when management proposed it.
Turns out we weren't opposed to bad metrics! We were just opposed to being measured!
Given the chance to pick our own, we jumped straight to the same nonsense.
And the author has a blog post about burnout and anxiety. Maybe all of those things are related.
Working to the point of making yourself sick should not be seen as a mark of pride, it is a sign that something is broken. Not necessarily the individual, maybe the system the individual is in.
Lines of code are meaningful when taken in aggregate and useless as a metric for an individual’s contributions.
COCOMO, which considers lines of code, is generally accepted as being accurate (enough) at estimating the value of a software system, at least as far as how courts (in the US) are concerned.
> Lines of code are meaningful when taken in aggregate
The linked article does not demonstrate this. It establishes no causal link. One can obviously bloat LOC to an arbitrary degree while maintaining feature parity. Very generously, assuming good faith participants, it might reflect a kind average human efficiency within the fixed environment of the time.
Carrying the conclusions of this study from the 80s into the LLM age is not justified scientifically.
Maybe author knows that too, but wants to talk about it nonetheless. First line of article: “Commits are a terrible metric for output, but they're the most visible signal I have.”
> Turns out we weren't opposed to bad metrics! We were just opposed to being measured! Given the chance to pick our own, we jumped straight to the same nonsense.
This seems like a distinction without a difference, unless there actually are any good metrics (which also requires them to be objectively and reliably quantifiable). I think most developers don't really want to measure themselves, it's just that pro-AI people think measurement is necessary to put forward a convincing argument that they've improved anything.
The only time metrics have been useful to me in the past is when they are kept private to each team, which is to say that I do think they are useful for measuring yourself, but not for others to measure you. Taken over time, they can eventual give you a really good idea of what you can deliver. Sandbag a bit (ie, undershoot that number), communicate that to ye olde stakeholders, and everybody's happy that you can actually do what you say you'll do without being stressed out (obviously this doesn't work in startups).
Honest question: if you're using multiple agents, it's usually to produce not a dozen lines of code. It's to produce a big enough feature spanning multiple files, modules and entry points, with tests and all. So far so good. But once that feature is written by the agents... wouldn't you review it? Like reading line by line what's going on and detecting if something is off? And wouldn't that part, the manual reviewing, take an enormous amount of time compare to the time it took the agents to produce it? (you know, it's more difficult to read other people's/machine code than to write it yourself)... meaning all the productivity gained is thrown out the door.
Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ?
This is the biggest bottleneck for me. What's worse is that LLMs have a bad habit of being very verbose and rewriting things that don't need to be touched, so the surface area for change is much larger.
It's kind weird; I jumped on the vibe coding opencode bandwagon but using local 395+ w/128; qwen coder. Now, it takes a bit to get the first tokens flowing, and and the cache works well enough to get it going, but it's not fast enough to just set it and forget it and it's clear when it goes in an absurd direction and either deviates from my intention or simply loads some context whereitshould have followed a pattern, whatever.
I'm sure these larger models are both faster and more cogent, but its also clear what matter is managing it's side tracks and cutting them short. Then I started seeing the deeper problematic pattern.
Agents arn't there to increase the multifactor of production; their real purpose is to shorten context to manageable levels. In effect, they're basically try to reduce the odds of longer context poisoning.
So, if we boil down the probabilty of any given token triggering the wrong subcontext, it's clear that the greater the context, the greater the odds of a poison substitution.
Then that's really the problematic issue every model is going to contend with because there's zero reality in which a single model is good enough. So now you're onto agents, breaking a problem into more manageable subcontext and trying to put that back into the larger context gracefully, etc.
Then that fails, because there's zero consistent determinism, so you end up at the harness, trying to herd the cats. This is all before you realize that these businesses can't just keep throwing GPUs at everything, because the problem isn't computing bound, it's contextual/DAG the same way a brain is limited.
We all got intelligence and use several orders of magnitude less energy, doing mostly the same thing.
Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible.
Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.).
Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints.
Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing.
I'm looking into mutation testing and fuzzing too, but I am still learning.
Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features.
The beauty of agentic coding is, suddenly you have time for all of this.
> Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
I feel like i am a bit crayz or stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. That seems so opposite of how my brain works.
> The worktree system removed the friction of context-switching - juggling multiple streams of work without them colliding.
I'm so conflicted about this. On the one hand I love the buzz of feeling so productive and working on many different threads. On the other hand my brain gets so fried, and I think this is a big contributor.
I would like some research regarding multi agent flows and impact on speed and correctness, because I have a feeling that it's like a texting and driving situation, where self perception of skill loss and measured skill loss diverge.
I also have nothing to back it up, but it fits my mental models. When juggling multiple things as humans, it eats up your context window (working memory). After a long day, your coherence degrades and your context window needs flushing (sleeping) and you need to start a new session (new day, or post-nap afternoon).
you do lose context, but if you generate a plan beforehand and save it, then it makes it easier to gain that context when you come back. I've been able to get out things a lot more quickly this way, because instead of "working" that day, I'll just review the work that's been queued up and focus on it one at a time, so I'm still the bottle neck but it has allowed me to move more quickly at times
Is constant juggling of multiple agents productive? I haven't seen the allure (except maybe with 2 agents sometimes). I guess it depends on what kind of tasks one is doing and I can imagine it working if doing large, long-running tasks, but then reviewing those large changes and refactoring becomes more difficult. And if you're juggling multiple agents, there's the mental context switching and tooling overhead for managing them. Maybe predictable and repetitive tasks can work well.
I prefer focusing mostly on 1 task at a time (sometimes 2 for a short time, or asking other agent some questions simultaneously) and doing the task in chunks so it doesn't take much time until you have something to review. Then I review it, maybe ask for some refactoring and let it continue to the next step (maybe let it continue a bit before finishing review if feeling confident about the code). It's easier to review smaller self-contained chunks and easier to refer to code and tell AI what needs changing because of fewer amount of relevant lines.
the way I handle this is that I just create pull requests (tell the agent to do it at the end), and then I'll come back at a later time to review, so I always have stuff queued up to review.
I do parallel agents in worktrees and I don't always constantly keep an eye on them like a fry cook flipping 20 burgers at once. Sometimes it's just nice to know that I can spin one up, come back tomorrow, and some progress has been made without breaking my current flow.
I have a little ai-commit.sh as "send" in package.json which describes my changes and commits. Formatting has been solved by linters already. Neither my approach nor OP approach are ground-breaking, but i think mine is faster, you also !p send (p alias pnpm) inside from claude no need for it to make a skill and create overhead..
Like thinking about it a pr skill is pretty much an antipattern even telling ai to just create a pr is faster.
I think some vibe coders should let AI teach them some cli tooling
I don't understand the "being more productive" part. Like, sure, LLMs make us iterate faster but our managers know we're using them! They don't naively think we suddenly became 10x engineers. Companies pay for these tools and every engineer has access to them. So if everyone is equally productive, the baseline just shifted up... same as always, no?
Mentioning LLM usage as a distinction is like bragging about using a modern compiler instead of writing assembly. Yeah it's faster, but so is everyone else code...
Besides, I wouldn't brag about being more productive with LLMS because it's a double edge sword: it's very easy to use them, and nobody is reviewing all the lines of code you are pushing to prod (really, when was the last time you reviewed a PR generated by AI that changed 20+ files and added/removed thousands of lines of code?), so you don't know what's the long game of your changes; they seem to work now but who knows how it will turn out later?
Sometimes outcomes and achievements and work product are useful beyond just... stack ranking yourself against your peers. Seems so odd to me that this is your mentality unless you're earlier in your career.
Fair enough. I've been in software more than I would like to admit. And the more I'm in, the less I care about achievements in a work environment. All I care about is that the company pays me every month, because companies don't care about me (they care about my outome per hour/week/month). So it's essential to rank yourself high against your peers (being ethically and the like, ofc), otherwise you are out in the next layoff. I know not every company is like this, but the vast majority of tech companies are.
Outside of work, yeah, everything is fine and there's nothing but the pure pursue of knowledge and joy.
I don't know. Claude helped me implement a ton of features I had been procrastinating for months in a matter of days. I'm implementing features in my project faster than I can blog about them. It definitely manifested as a huge commit spike.
And it's not like I'm blindly commiting LLM output. I often write everything myself because I want to understand what I'm doing. Claude often comments that my version is better and cleaner. It's just that the tasks seemed so monumental I felt paralyzed and had difficulty even starting. Claude broke things down into manageable steps that were easy to do. Having a code review partner was also invaluable.
This right here is the big value I see in LLMs as well. I specifically suffer from analysis paralysis when starting something big and just getting skeletonized cheap code out quick as a template then refining it is much more to my strengths. I am ADHD and task breakdown is a known difficulty for that disorder so it has been hugely helpful.
That said, by the time I'm happy with it all the AI stuff outside very boilerplate ops/config stuff has been rewritten and refined. I just find it quite helpful to get over that initial hump of "I have nothing but a dream" to the stage of "I have a thing that compiles but is terrible". Once I can compile it then I can refine which where my strengths lie.
I have quite a lot of skepticism about that as well. I didn't mean to imply I believed it. I was trying to express that I wasn't lazily copy pasting the LLM output.
I'm taking the time to understand what it is proposing. I'm pushing back and asking for clarifications. When I implement things, I do it myself in my own way. Even so I experienced a huge increase in my ability to make the cool stuff I've always wanted to make.
I can't even fathom how productive the people who have Claude Code cranking out features on multiple git worktrees in parallel must be. I can totally understand doing that in a job setting.
In Claude's world, every user is a generational genius up there with Gauss and Euler, every new suggestion, no matter how banal, is a mind boggling Copernican turn that upends epistemology as we know it.
My fav metric for codebase improvement (not feature improvement) is negative LOC. Nothing beats a patch that only deletes things without breaking anything or simply removing tests. Just dead code deletion.
> It's about on par with the ridiculousness of LOC implying code quality.
Most effective engineers on the brownfield projects I've worked on, usually deleted more LOC than they've added, because they were always looking to simplify the code and replace it with useful (and often shorter) abstractions.
Yeah it's very much the opposite of how Claude Code tends to approach a problem it hasn't seen before, which tends toward constructing an elaborate Rube Goldberg machine by just inserting more and more logic until it manages to produce the desired outcome. You can coax it into simplifying its output, but it's very time consuming to get something that is of a professional standard, and doesn't introduce technical debt.
Especially in brownfield settings, if you do use CC, you really should be spending something like a day refactoring the code for every 15 minutes of work it spends implementing new functionality. Otherwise the accumulation of technical debt will make the code base unworkable by both human and claude hands in a fairly short time.
I think overall it can be a force for good, and a source of high quality code, but it requires a significant amount of human intervention.
Claude Code operating on unsupervised Claude code fairly rapidly generates a mess not even Claude Code can decode, resulting in a sort of technical debt Kessler syndrome, where the low quality makes the edits worse, which makes the quality worse, rinse and repeat.
> I’m not “using a tool that writes code.” I’m in a tight loop: kick off a task, the agent writes code, I check the preview, read the diff, give feedback or merge, kick off the next task
the assumption to this workflow is that claude code can complete tasks with little or no oversight.
If the flow looks like review->accept, review->accept, it is manageable.
In my personal experience, claude needs heavy guidance and multiple rounds of feedback before arriving at a mergeable solution (if it does at all).
Interleaving many long running tasks with multiple rounds of feedback does not scale well unfortunately.
I can only remember so much, and at some point I spend more time trying to understand what has been done so far to give accurate feedback than actually giving feedback for the next iteration.
This is basically the same workflow I've come to adopt. I don't use any "pre-built" skills, mine are actually still .md files in the .claude/command/ folder because that's when I started. The workflow is so good, I'm the bottleneck.
I've started to use git worktrees to parallelize my work. I spend so much time waiting...why not wait less on 2 things? This is not a solved problem in my setup. I have a hard time managing just two agents and keeping them isolated. But again, I'm the bottleneck. I think I could use 5 agents if my brain were smarter........or if the tools were better.
I am also a PM by day and I'm in Claude Code for PM work almost 90% of my day.
I like Claude, at least when the user reviews the code before asking for a PR. But gods I hate tickets/feature requests written by Opus/Sonnet (or worse: Codex or Gemini). If you know/understand your product enough it's probably less of a problem for your team than it is for mine, but each time I see a feature request automagically written in the backlog I know I will have to spend at least 30 minutes rewriting in so that it doesn't take us one hour to refine it collectively.
> I hope they also get penalised when a lowly worker does a bad thing, even if the worker is an LLM silently misinterpreting a vague instruction.
Yeah the buck stops with the manager (IMO the direct manager). So I can do some constructive criticism with my dev if they make a mistake, but it's my fault in the larger org that it happened. Then it's my manager's job to work with me to make sure I create the environment where the same mistake doesn't happen again. Am I training well? Am I giving them well-scoped work? All that.
Yup, the manager gets implicit credit for the work their team does. In most cases, deservedly so. I don't see why it should be any different for engineers using LLMs as "direct reports". Not all engineers will be the same level of "good" with LLM tools so the better you are (as with any other skill as well) the more credit you would receive.
Are you kidding? What else would managers get credit from? They don't produce anything the company is interested in. They steer, they manage, and so if the ones being managed produce the thing the company is interested in, then sure all the credit goes to the team (including the manager!). As it usually happens, getting credit means nothing if not accompanied by a salary bump or something like that. And as it usually happens, not the whole team can get a salary bump. So the ones who get the bump are usually one or two seniors on the team, plus the manager of course... because the manager is the gatekeeper between upper management (the ones who approve salary bumps) and the ICs... and no sane manager would sacrifice a salary bump for themselves just to give it away to an IC. And that's not being a bad manager, that's simply being human. Also if you think about it, if the team succeeded in delivering "the thing", then the manager would think it's partially because of their managing, and so he/she would believe a salary bump is deserved
When things go south, no penalization is made. A simple "post-mortem" is written in confluence and people write "action items". So, yeah, no need for the manager to get the blame.
It's all very shitty, but it's always been like that.
I don't know if I am just in an unlucky A/B assignment or anything but I really don't understand people juggling multiple agent sessions. For me Opus 4.6 High performance went from unbelievable to mediocre. And this keeps happening making the whole agentic coding very unreliable and frustrating. I do use it but I have to babysit and I get overwhelmed even with a single session.
> What’s become more fun is building the infrastructure that makes the agents effective.
Solving new problems is a thing engineers get to do constantly, whereas building an agent infrastructure is mostly a one-ish time thing. Yes, it evolves, but I worry that once the fun of building an agentic engineering system is done, we’re stuck doing arguably the most tedious job in the SDLC, reviewing code. It’s like if you were a principal researcher who stopped doing research and instead only peer reviewed other people’s papers.
The silver lining is if the feeling of faster progress through these AI tools gives enough satisfaction to replace the missing satisfaction of problem-solving. Different people will derive different levels of contentment from this. For me, it has not been an obvious upgrade in satisfaction. I’m definitely spending less time in flow.
> The PR descriptions are more thorough than what I’d write, because it reads the full diff and summarises the changes properly. I’d gotten so used to the drudgery that I’d stopped noticing it was drudgery.
Who are you creating PR descriptions for, exactly? If you consider it "drudgery", how do you think your coworkers will feel having to read pages of generic "AI" text? If reviewing can be considered "drudgery" as well, can we also offload that to "AI"? In which case, why even bother with PRs at all? Why are you still participating in a ceremony that was useful for humans to share knowledge and improve the codebase, when machines don't need any of it?
> My role has changed. I used to derive joy from figuring out a complicated problem, spending hours crafting the perfect UI. [...] What’s become more fun is building the infrastructure that makes the agents effective. Being a manager of a team of ten versus being a solo dev.
Yeah, it's great that you enjoy being a "manager" now. Personally, that is not what I enjoy doing, nor why I joined this industry.
Quick question: do you think your manager role is safe from being automated away? If machines can write code and prose now better than you, couldn't they also manage other machines into producing useful output better than you? So which role is left for you, and would you enjoy doing it if "manager" is not available?
Purely rhetorical, of course, since I don't think the base premise is true, besides the fact that it's ignoring important factors in software development such as quality, reliability, maintainability, etc. This idea that the role of an IC has now shifted into management is amusing. It sounds like a coping mechanism for people to prove that they can still provide value while facing redundancy.
So many pretend they are more productive but so few are able to articulate what they actually produced.
Some says features. Well. Are they used. Are they beneficial in any way for our society or humanity? Or are we junk producing for the sake of producing?
As an outsider it seems like agentic coders get buried in the weeds of running agents in parallel and churning out commits. (Even after a sheepish “commits are a bad metric but”) And every week there is a new orchestration, something, who even cares.
Is that the end game? Well why can’t the agents orchestrate the agents? Agents all the way down?
The whole agent coding scene seems like people selling their soul for very shiny inflatable balloons. Now you have twelve bespoke apps tailored for you that you don’t even care about.
> The PR descriptions are more thorough than what I’d write
Why do people do this? Why do they outsource something that is meant to have been written by a human, so that another human can actually understand what that first human wanted to do, so why do people outsource that to AI? It just doesn't make sense.
We have “Cursor Bot” enabled at work. It reviews our PRs (in addition to a human review)
One thing it does is add a PR summary to the PR description. It’s kind of helpful since it outlines a clear list of what changed in code. But it would be very lacking if it was the full PR description. It doesn’t include anything about _why_ the changes were made, what else was tried, what is coming next, etc.
I've been doing a lot of parallel work and it can be draining. It feels exciting to have 6 agents spinning on things, but unless you have very well scoped plans, you need to still check in frequently.
If you have the tokens for it, having a team of agents checking and improving on the work does help a lot and reduces the slop.
This is the "lines of code per week" metric from the 90s, repackaged. "I'm doing more PRs" is not evidence that AI is working, it's evidence that you are merging more. Whether thats good depends entirely on what you are merging. I use AI every day too. But treating throughput of code going to production as a success metric, without any mention of quality, bugs, or maintenance burden is exactly the kind of thinking developers used to push back on when management proposed it.
Turns out we weren't opposed to bad metrics! We were just opposed to being measured! Given the chance to pick our own, we jumped straight to the same nonsense.
And the author has a blog post about burnout and anxiety. Maybe all of those things are related.
Working to the point of making yourself sick should not be seen as a mark of pride, it is a sign that something is broken. Not necessarily the individual, maybe the system the individual is in.
Lines of code are meaningful when taken in aggregate and useless as a metric for an individual’s contributions.
COCOMO, which considers lines of code, is generally accepted as being accurate (enough) at estimating the value of a software system, at least as far as how courts (in the US) are concerned.
https://en.wikipedia.org/wiki/COCOMO
> Lines of code are meaningful when taken in aggregate
The linked article does not demonstrate this. It establishes no causal link. One can obviously bloat LOC to an arbitrary degree while maintaining feature parity. Very generously, assuming good faith participants, it might reflect a kind average human efficiency within the fixed environment of the time.
Carrying the conclusions of this study from the 80s into the LLM age is not justified scientifically.
Maybe author knows that too, but wants to talk about it nonetheless. First line of article: “Commits are a terrible metric for output, but they're the most visible signal I have.”
What about number of working features or system completeness? Current state vs desired state is fairly visible.
how do you define system completeness? what if you ship one really big feature vs three really small ones?
I would posit that you need extra context to obtain meaning from those metrics, which inherently makes them less visible
> Turns out we weren't opposed to bad metrics! We were just opposed to being measured! Given the chance to pick our own, we jumped straight to the same nonsense.
This seems like a distinction without a difference, unless there actually are any good metrics (which also requires them to be objectively and reliably quantifiable). I think most developers don't really want to measure themselves, it's just that pro-AI people think measurement is necessary to put forward a convincing argument that they've improved anything.
The only time metrics have been useful to me in the past is when they are kept private to each team, which is to say that I do think they are useful for measuring yourself, but not for others to measure you. Taken over time, they can eventual give you a really good idea of what you can deliver. Sandbag a bit (ie, undershoot that number), communicate that to ye olde stakeholders, and everybody's happy that you can actually do what you say you'll do without being stressed out (obviously this doesn't work in startups).
I'm also trying everything to learn how to use Claude, everything is so new. And keep upgrading.
Honest question: if you're using multiple agents, it's usually to produce not a dozen lines of code. It's to produce a big enough feature spanning multiple files, modules and entry points, with tests and all. So far so good. But once that feature is written by the agents... wouldn't you review it? Like reading line by line what's going on and detecting if something is off? And wouldn't that part, the manual reviewing, take an enormous amount of time compare to the time it took the agents to produce it? (you know, it's more difficult to read other people's/machine code than to write it yourself)... meaning all the productivity gained is thrown out the door.
Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ?
This is the biggest bottleneck for me. What's worse is that LLMs have a bad habit of being very verbose and rewriting things that don't need to be touched, so the surface area for change is much larger.
It's kind weird; I jumped on the vibe coding opencode bandwagon but using local 395+ w/128; qwen coder. Now, it takes a bit to get the first tokens flowing, and and the cache works well enough to get it going, but it's not fast enough to just set it and forget it and it's clear when it goes in an absurd direction and either deviates from my intention or simply loads some context whereitshould have followed a pattern, whatever.
I'm sure these larger models are both faster and more cogent, but its also clear what matter is managing it's side tracks and cutting them short. Then I started seeing the deeper problematic pattern.
Agents arn't there to increase the multifactor of production; their real purpose is to shorten context to manageable levels. In effect, they're basically try to reduce the odds of longer context poisoning.
So, if we boil down the probabilty of any given token triggering the wrong subcontext, it's clear that the greater the context, the greater the odds of a poison substitution.
Then that's really the problematic issue every model is going to contend with because there's zero reality in which a single model is good enough. So now you're onto agents, breaking a problem into more manageable subcontext and trying to put that back into the larger context gracefully, etc.
Then that fails, because there's zero consistent determinism, so you end up at the harness, trying to herd the cats. This is all before you realize that these businesses can't just keep throwing GPUs at everything, because the problem isn't computing bound, it's contextual/DAG the same way a brain is limited.
We all got intelligence and use several orders of magnitude less energy, doing mostly the same thing.
Here's what I suggest:
Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible.
Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.).
Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints.
Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing.
I'm looking into mutation testing and fuzzing too, but I am still learning.
Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features.
The beauty of agentic coding is, suddenly you have time for all of this.
> Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
I feel like i am a bit crayz or stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. That seems so opposite of how my brain works.
what i am missing.
I use coding agents to produce a lot of code that I don’t ship. But I do ship the output of the code.
> The worktree system removed the friction of context-switching - juggling multiple streams of work without them colliding.
I'm so conflicted about this. On the one hand I love the buzz of feeling so productive and working on many different threads. On the other hand my brain gets so fried, and I think this is a big contributor.
I would like some research regarding multi agent flows and impact on speed and correctness, because I have a feeling that it's like a texting and driving situation, where self perception of skill loss and measured skill loss diverge.
I have nothing to back up the idea though.
Ooooh very interesting idea.
I also have nothing to back it up, but it fits my mental models. When juggling multiple things as humans, it eats up your context window (working memory). After a long day, your coherence degrades and your context window needs flushing (sleeping) and you need to start a new session (new day, or post-nap afternoon).
you do lose context, but if you generate a plan beforehand and save it, then it makes it easier to gain that context when you come back. I've been able to get out things a lot more quickly this way, because instead of "working" that day, I'll just review the work that's been queued up and focus on it one at a time, so I'm still the bottle neck but it has allowed me to move more quickly at times
Is constant juggling of multiple agents productive? I haven't seen the allure (except maybe with 2 agents sometimes). I guess it depends on what kind of tasks one is doing and I can imagine it working if doing large, long-running tasks, but then reviewing those large changes and refactoring becomes more difficult. And if you're juggling multiple agents, there's the mental context switching and tooling overhead for managing them. Maybe predictable and repetitive tasks can work well.
I prefer focusing mostly on 1 task at a time (sometimes 2 for a short time, or asking other agent some questions simultaneously) and doing the task in chunks so it doesn't take much time until you have something to review. Then I review it, maybe ask for some refactoring and let it continue to the next step (maybe let it continue a bit before finishing review if feeling confident about the code). It's easier to review smaller self-contained chunks and easier to refer to code and tell AI what needs changing because of fewer amount of relevant lines.
the way I handle this is that I just create pull requests (tell the agent to do it at the end), and then I'll come back at a later time to review, so I always have stuff queued up to review.
I do parallel agents in worktrees and I don't always constantly keep an eye on them like a fry cook flipping 20 burgers at once. Sometimes it's just nice to know that I can spin one up, come back tomorrow, and some progress has been made without breaking my current flow.
I have a little ai-commit.sh as "send" in package.json which describes my changes and commits. Formatting has been solved by linters already. Neither my approach nor OP approach are ground-breaking, but i think mine is faster, you also !p send (p alias pnpm) inside from claude no need for it to make a skill and create overhead..
Like thinking about it a pr skill is pretty much an antipattern even telling ai to just create a pr is faster.
I think some vibe coders should let AI teach them some cli tooling
I don't understand the "being more productive" part. Like, sure, LLMs make us iterate faster but our managers know we're using them! They don't naively think we suddenly became 10x engineers. Companies pay for these tools and every engineer has access to them. So if everyone is equally productive, the baseline just shifted up... same as always, no?
Mentioning LLM usage as a distinction is like bragging about using a modern compiler instead of writing assembly. Yeah it's faster, but so is everyone else code... Besides, I wouldn't brag about being more productive with LLMS because it's a double edge sword: it's very easy to use them, and nobody is reviewing all the lines of code you are pushing to prod (really, when was the last time you reviewed a PR generated by AI that changed 20+ files and added/removed thousands of lines of code?), so you don't know what's the long game of your changes; they seem to work now but who knows how it will turn out later?
Sometimes outcomes and achievements and work product are useful beyond just... stack ranking yourself against your peers. Seems so odd to me that this is your mentality unless you're earlier in your career.
Fair enough. I've been in software more than I would like to admit. And the more I'm in, the less I care about achievements in a work environment. All I care about is that the company pays me every month, because companies don't care about me (they care about my outome per hour/week/month). So it's essential to rank yourself high against your peers (being ethically and the like, ofc), otherwise you are out in the next layoff. I know not every company is like this, but the vast majority of tech companies are.
Outside of work, yeah, everything is fine and there's nothing but the pure pursue of knowledge and joy.
Usually hedonic adaptation ends up catching up, and then it’s just the new baseline.
I like llms too, and I think they make me more productive..
but a chart of commits/contribs is such a lousy metric for productivity.
It's about on par with the ridiculousness of LOC implying code quality.
I don't know. Claude helped me implement a ton of features I had been procrastinating for months in a matter of days. I'm implementing features in my project faster than I can blog about them. It definitely manifested as a huge commit spike.
And it's not like I'm blindly commiting LLM output. I often write everything myself because I want to understand what I'm doing. Claude often comments that my version is better and cleaner. It's just that the tasks seemed so monumental I felt paralyzed and had difficulty even starting. Claude broke things down into manageable steps that were easy to do. Having a code review partner was also invaluable.
This right here is the big value I see in LLMs as well. I specifically suffer from analysis paralysis when starting something big and just getting skeletonized cheap code out quick as a template then refining it is much more to my strengths. I am ADHD and task breakdown is a known difficulty for that disorder so it has been hugely helpful.
That said, by the time I'm happy with it all the AI stuff outside very boilerplate ops/config stuff has been rewritten and refined. I just find it quite helpful to get over that initial hump of "I have nothing but a dream" to the stage of "I have a thing that compiles but is terrible". Once I can compile it then I can refine which where my strengths lie.
> Claude often comments that my version is better and cleaner.
Every comment I make is a "really perceptive observation" according to Claude and every question I ask is either "brilliant" or at least "good", so...
I have quite a lot of skepticism about that as well. I didn't mean to imply I believed it. I was trying to express that I wasn't lazily copy pasting the LLM output.
I'm taking the time to understand what it is proposing. I'm pushing back and asking for clarifications. When I implement things, I do it myself in my own way. Even so I experienced a huge increase in my ability to make the cool stuff I've always wanted to make.
I can't even fathom how productive the people who have Claude Code cranking out features on multiple git worktrees in parallel must be. I can totally understand doing that in a job setting.
In Claude's world, every user is a generational genius up there with Gauss and Euler, every new suggestion, no matter how banal, is a mind boggling Copernican turn that upends epistemology as we know it.
My fav metric for codebase improvement (not feature improvement) is negative LOC. Nothing beats a patch that only deletes things without breaking anything or simply removing tests. Just dead code deletion.
> It's about on par with the ridiculousness of LOC implying code quality.
Most effective engineers on the brownfield projects I've worked on, usually deleted more LOC than they've added, because they were always looking to simplify the code and replace it with useful (and often shorter) abstractions.
Yeah it's very much the opposite of how Claude Code tends to approach a problem it hasn't seen before, which tends toward constructing an elaborate Rube Goldberg machine by just inserting more and more logic until it manages to produce the desired outcome. You can coax it into simplifying its output, but it's very time consuming to get something that is of a professional standard, and doesn't introduce technical debt.
Especially in brownfield settings, if you do use CC, you really should be spending something like a day refactoring the code for every 15 minutes of work it spends implementing new functionality. Otherwise the accumulation of technical debt will make the code base unworkable by both human and claude hands in a fairly short time.
I think overall it can be a force for good, and a source of high quality code, but it requires a significant amount of human intervention.
Claude Code operating on unsupervised Claude code fairly rapidly generates a mess not even Claude Code can decode, resulting in a sort of technical debt Kessler syndrome, where the low quality makes the edits worse, which makes the quality worse, rinse and repeat.
> I’m not “using a tool that writes code.” I’m in a tight loop: kick off a task, the agent writes code, I check the preview, read the diff, give feedback or merge, kick off the next task
the assumption to this workflow is that claude code can complete tasks with little or no oversight.
If the flow looks like review->accept, review->accept, it is manageable.
In my personal experience, claude needs heavy guidance and multiple rounds of feedback before arriving at a mergeable solution (if it does at all).
Interleaving many long running tasks with multiple rounds of feedback does not scale well unfortunately.
I can only remember so much, and at some point I spend more time trying to understand what has been done so far to give accurate feedback than actually giving feedback for the next iteration.
This is basically the same workflow I've come to adopt. I don't use any "pre-built" skills, mine are actually still .md files in the .claude/command/ folder because that's when I started. The workflow is so good, I'm the bottleneck.
I've started to use git worktrees to parallelize my work. I spend so much time waiting...why not wait less on 2 things? This is not a solved problem in my setup. I have a hard time managing just two agents and keeping them isolated. But again, I'm the bottleneck. I think I could use 5 agents if my brain were smarter........or if the tools were better.
I am also a PM by day and I'm in Claude Code for PM work almost 90% of my day.
I like Claude, at least when the user reviews the code before asking for a PR. But gods I hate tickets/feature requests written by Opus/Sonnet (or worse: Codex or Gemini). If you know/understand your product enough it's probably less of a problem for your team than it is for mine, but each time I see a feature request automagically written in the backlog I know I will have to spend at least 30 minutes rewriting in so that it doesn't take us one hour to refine it collectively.
`And like any good manager, you get to claim credit for all the work your “team” does.`
Is that how it works? Do managers claim credit for the work of those below them, despite not doing the work?
I hope they also get penalised when a lowly worker does a bad thing, even if the worker is an LLM silently misinterpreting a vague instruction.
> I hope they also get penalised when a lowly worker does a bad thing, even if the worker is an LLM silently misinterpreting a vague instruction.
Yeah the buck stops with the manager (IMO the direct manager). So I can do some constructive criticism with my dev if they make a mistake, but it's my fault in the larger org that it happened. Then it's my manager's job to work with me to make sure I create the environment where the same mistake doesn't happen again. Am I training well? Am I giving them well-scoped work? All that.
Yup, the manager gets implicit credit for the work their team does. In most cases, deservedly so. I don't see why it should be any different for engineers using LLMs as "direct reports". Not all engineers will be the same level of "good" with LLM tools so the better you are (as with any other skill as well) the more credit you would receive.
Are you kidding? What else would managers get credit from? They don't produce anything the company is interested in. They steer, they manage, and so if the ones being managed produce the thing the company is interested in, then sure all the credit goes to the team (including the manager!). As it usually happens, getting credit means nothing if not accompanied by a salary bump or something like that. And as it usually happens, not the whole team can get a salary bump. So the ones who get the bump are usually one or two seniors on the team, plus the manager of course... because the manager is the gatekeeper between upper management (the ones who approve salary bumps) and the ICs... and no sane manager would sacrifice a salary bump for themselves just to give it away to an IC. And that's not being a bad manager, that's simply being human. Also if you think about it, if the team succeeded in delivering "the thing", then the manager would think it's partially because of their managing, and so he/she would believe a salary bump is deserved
When things go south, no penalization is made. A simple "post-mortem" is written in confluence and people write "action items". So, yeah, no need for the manager to get the blame.
It's all very shitty, but it's always been like that.
Yes. That is how management works. Although a good manager will focus some of that praise onto team members who deserve it.
I don't know if I am just in an unlucky A/B assignment or anything but I really don't understand people juggling multiple agent sessions. For me Opus 4.6 High performance went from unbelievable to mediocre. And this keeps happening making the whole agentic coding very unreliable and frustrating. I do use it but I have to babysit and I get overwhelmed even with a single session.
> What’s become more fun is building the infrastructure that makes the agents effective.
Solving new problems is a thing engineers get to do constantly, whereas building an agent infrastructure is mostly a one-ish time thing. Yes, it evolves, but I worry that once the fun of building an agentic engineering system is done, we’re stuck doing arguably the most tedious job in the SDLC, reviewing code. It’s like if you were a principal researcher who stopped doing research and instead only peer reviewed other people’s papers.
The silver lining is if the feeling of faster progress through these AI tools gives enough satisfaction to replace the missing satisfaction of problem-solving. Different people will derive different levels of contentment from this. For me, it has not been an obvious upgrade in satisfaction. I’m definitely spending less time in flow.
> The PR descriptions are more thorough than what I’d write, because it reads the full diff and summarises the changes properly. I’d gotten so used to the drudgery that I’d stopped noticing it was drudgery.
Who are you creating PR descriptions for, exactly? If you consider it "drudgery", how do you think your coworkers will feel having to read pages of generic "AI" text? If reviewing can be considered "drudgery" as well, can we also offload that to "AI"? In which case, why even bother with PRs at all? Why are you still participating in a ceremony that was useful for humans to share knowledge and improve the codebase, when machines don't need any of it?
> My role has changed. I used to derive joy from figuring out a complicated problem, spending hours crafting the perfect UI. [...] What’s become more fun is building the infrastructure that makes the agents effective. Being a manager of a team of ten versus being a solo dev.
Yeah, it's great that you enjoy being a "manager" now. Personally, that is not what I enjoy doing, nor why I joined this industry.
Quick question: do you think your manager role is safe from being automated away? If machines can write code and prose now better than you, couldn't they also manage other machines into producing useful output better than you? So which role is left for you, and would you enjoy doing it if "manager" is not available?
Purely rhetorical, of course, since I don't think the base premise is true, besides the fact that it's ignoring important factors in software development such as quality, reliability, maintainability, etc. This idea that the role of an IC has now shifted into management is amusing. It sounds like a coping mechanism for people to prove that they can still provide value while facing redundancy.
So many pretend they are more productive but so few are able to articulate what they actually produced.
Some says features. Well. Are they used. Are they beneficial in any way for our society or humanity? Or are we junk producing for the sake of producing?
As an outsider it seems like agentic coders get buried in the weeds of running agents in parallel and churning out commits. (Even after a sheepish “commits are a bad metric but”) And every week there is a new orchestration, something, who even cares.
Is that the end game? Well why can’t the agents orchestrate the agents? Agents all the way down?
The whole agent coding scene seems like people selling their soul for very shiny inflatable balloons. Now you have twelve bespoke apps tailored for you that you don’t even care about.
> The PR descriptions are more thorough than what I’d write
Why do people do this? Why do they outsource something that is meant to have been written by a human, so that another human can actually understand what that first human wanted to do, so why do people outsource that to AI? It just doesn't make sense.
Yeah I agree.
We have “Cursor Bot” enabled at work. It reviews our PRs (in addition to a human review)
One thing it does is add a PR summary to the PR description. It’s kind of helpful since it outlines a clear list of what changed in code. But it would be very lacking if it was the full PR description. It doesn’t include anything about _why_ the changes were made, what else was tried, what is coming next, etc.
I've been doing a lot of parallel work and it can be draining. It feels exciting to have 6 agents spinning on things, but unless you have very well scoped plans, you need to still check in frequently.
If you have the tokens for it, having a team of agents checking and improving on the work does help a lot and reduces the slop.