The author partially acknowledges this later on, but lines of code is actually quite of useful metric. The only mistake is that people have it flipped. Lines of code are bad, and you should target fewer lines of code (except at the expense of other considerations). I regularly track LoC, because if it goes up more than I predicted, I probably did something wrong.
> Bill Gates compared measuring programming progress by lines of code to measuring aircraft building progress by weight
Aircraft weight is also a very useful metric - aircraft weight is also bad. But we do measure this!
Dijkstra’s quote from 1988 is even better: "My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger."
Before, lines of code was (mis)used to try to measure individual developer productivity. And there was the collective realization that this fails, because good refactoring can reduce LoC, a better design may use less lines, etc.
But LoC never went away, for example, for estimating the overall level of complexity of a project. There's generally a valid distinction between an app that has 1K, 10K, 100K, or 1M lines of code.
Now, the author is describing LoC as a metric for determining the proportion of AI-generated code in a codebase. And just like estimating overall project complexity, there doesn't seem to be anything inherently problematic about this. It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
Yes, as the author explains, if the AI code is more repetitive and needs refactoring, then the AI proportion will seem overly high in terms of how much functionality the AI proportion contributes. But at the same time, it's entirely accurate in terms of how this is possibly a larger surface for bugs, exploits, etc.
And when the author talks about big tech companies bragging about the high percentage of LoC being generated with AI... who cares? It's obviously just for press. I would assume (hope) that code review practices haven't changed inside of Microsoft or Google. The point is, I don't see these numbers as being "targets" in the way that LoC once were for individual developer productivity... there's more just a description of how useful these tools are becoming, and a vanity metric for companies signaling to investors that they're using new tools efficiently.
The overall level of complexity of a project is not an "up means good" kind of measure. If you can achieve the same amount of functionality, obtain the same user experience, and have the same reliability with less complexity, you should.
Accidental complexity, as defined by Brooks in No Silver Bullet, should be minimized.
> The overall level of complexity of a project is not an "up means good" kind of measure.
I never said it was. To the contrary, it's more of an indication of how much more complex large refactorings might be, how complex it might be to add a new feature that will wind up touching a lot of parts, or how long a security audit might take.
The point is, it's important to measure things. Not as a "target", but simply so you can make more informed decisions.
a bit of a nit but accidental complexity is still complexity, so even if that 1M lines could reduce to 2 kines its still way more complex to maintain and patch than a codebase thats properly minimized and say 10k lines. (even though this sounds unreasonable i dont doubt it happen..)
If tech companies want to show they have a high percentage of LoC being generated by AI, it's likely they are going to encourage developers to use AI to further increase these numbers, at which point is does become a measure of productivity.
> It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
I'd say you're operating on a higher plane of thought than the majority in this industry right now. Because the majority view roughly appears to be "Need bigger number!", with very little thought, let alone deep thought, employed towards the whys or wherefores thereof.
A higher plane of thought would be "was AI able to remove 5% or 50% of the code while keeping or adding functionality and not diminishing clarity, consistency, and correctness"
I don't think the author is missing this distinction. It seems that you agree with him in his main point which is that companies bragging about LOCs generated by AI should be ignored by right-thinking people. It's just you buried that substantive agreement at the end of your "rebuttal".
> would assume (hope) that code review practices haven't changed inside of Microsoft or Google.
Google engineer perspective:
I'm actually thinking code reviews are one of the lowest hanging fruits for AI here. We have AI reviewers now in addition to the required human reviews, but it can do anything from be overly defensive at times to finding out variables are inconsistently named (helpful) to sometimes finding a pretty big footgun that might have otherwise been missed.
Even if it's not better than a huamn reviwer, the faster turnaround time for some small % of potential bugs is a big productivity boost.
I've noticed that it's super easy to end up with tons of extra lines of code when using AI. It will write code that's already available (whether that's in external libraries, internal libraries, or code already the same project). I don't mind trying to keep dependencies down, but I also don't want every project to have its own poorly tested CSV parser.
It also also often fails to clean up after itself. When you remove a feature (one that you may not have even explicitly asked for) it will sometimes just leave behind the unused code. This is really annoying when reviewing and you realize one of the files you read through is referenced nowhere.
You have to keep a close eye out to prevent bloat from these issues.
Absolutely. Even worse, when you ask AI to solve a problem it almost always adds code even if a better solution exists that removes code. If AI's new solution fails, you ask it to fix, it throws even more code, creates more mess, introduces new unnecessary states. Rinse, repeat ad infinitum.
I did this a few times as an experiment while knowing how a problem could be solved. In difficult situations Cursor always invariably adds code and creates even more mess.
I wonder if this can be mitigated somehow at the inference level because prompts don't seem to be helping with this problem.
It reminds me. When I had my consulting company 20 years ago, I defined these "metrics" to decide if a project was successful.
- Is the client happy?
- Are the team members growing(as in learning)?
- Were we able to make a profit?
Everything else was less relevant. For example: Why do I care that the project took bit longer, if at the end the client was happy with the result, and we can continue the relationship with new projects. It frees you from the cruelty of dates that are often set arbitrary.
So perhaps we should evaluate AI coding tools the same. If we can deliver successful projects in a sustainable way, then we are good.
The code written by AI in most cases is throwaway code to be improved/refined later.
Its likely to be large, verbose and bloated. The design of some agents
have "simplify/refactor" as final step to remedy this, but typically
your average vibe coder will be satisfied that the code just compiles/passes the minimal tests. Lines of code are easy to grow.
If you refine the AI code with iterative back-and-forth questions,
the AI can be forced to write much more compact or elegant version
in principle, but you can't apply this to most large systems without breaking something,
as AI doesn't have context of what is actually changing:
so e.g. an isolated function can be improved easily, but AI can't handle when complexity of abstraction stacks and interfacing multiple systems, typically because it confuses states
where global context is altered.
I've been working to overcome this exact problem. I believe it's fully tractable. With proper deterministic tooling, the right words in context to anchor latent space, and a pilot with the software design skills to do it themselves, AI can help the pilot write and iterate upon properly-designed code faster than typing and using a traditional IDE. And along the way, it serves as a better rubber duck (but worse than a skilled pair programmer).
LoC is a good code quality metric, only it has to be inverted. Not "it wrote C compiler in 100 000 lines of code", but "in just 2000 lines of code". Now that is impressive and deserves praise.
Forget about if LoC is (or ever was) a meaningful metric; in a world where humans - and now AI - essentially wires up massive amounts of existing functionality with relatively little glue code I don't even know how to measure it.
Lines of code changed are a very bad measure for humans because they can be gamed. It's an ok measure for "work done" with AI if you're don't prompt the model to game it. It's useful because it's quick to calculate and concrete, but if you use it as more than a proxy it'll bite you.
Detection of copy-pasta is interesting - what it's calling out is not a deficiency in LLM's to code but in agentic rules in place that should just remind the agent to refactor into a common function when appropriate.
>> that should just remind the agent to refactor into a common function when appropriate.
This off-the-cuff statement buries so much complexity. Sure it catches new code the exactly implements existing code, but IME it is __way__ more common to need to slightly (or not so slightly) change existing code that can now be used by multiple consumers, and then delete the new "duplicate" code. That is not trivial and requires (1) judgement from your AI coder and (2) deep reviewer expertise from your human coder.
I do think that "yo, look at my pumping out 10kloc/day" brags are quite stupid, because it simply does not take that many lines of code to support a large, profitable product. I'd love to hear other people's experiences here, but I'd say that a product with 100K-500K LOC can support a profitable company of dozens of people and generate tens of millions of revenue.
Now 100kloc is roughly 1M tokens, which cost a few dollars, so how could something that costs single digit dollars possibly be worth tens of millions in value? Clearly there's a substantial gap between how useful different pieces of code are, so bragging about how much of it you produce without telling me how valuable it is is useless. I guess it's a long-winded way of saying "show me the money"
When I was just beginning, all of the productivity measures would be 0 and I felt like a failure. The most attainable was lines of code. Currently, it's not a great measure of productivity as I'm achieving more advanced tasks. I've heard so many opinions about how LOC isn't a great measure and then the same people get to trample on all of the work I've done out of spite because I've written more code than them. I think LOC is great because productivity measures are for beginners and people who don't understand code. The audience doesn't know the difference between writing a hundred or thousands of lines of code, both are trophies for them.
These metrics for advanced roles are not applicable, no matter what you come up with. But even lines of code are good enough to see progress from a blank slate. Every developer or advanced AI agent must be judged on a case by case basis.
But if removing 1kLOC and replacing them with 25 for a much better function, there is a -975 LOC report. Does this count as negative productivity?
Having brackets start and stop on their own lines could double LOC counts but would that actually improve the code base? Be a sign of doubling productivity?
The OpenBSD project prides itself on producing very secure, bug free software, and they largely trend towards as low of lines of codes as they can possibly get away with while maintaining readability (so no codegolf tricks for the most part). I would rather we write secure bug free software than speed up the ability to output 10kLOC. The typing out code isn’t the difficult part in that scenario.
No one judges a painting by the amount of paint, or a wooden chair by the number of nails in it. The amount of LoC doesn’t matter. What matters is that the code is bug-free, readable, and maintainable.
But reducing the amount of LoC helps, just like using the correct word helps in writing text. That’s the craft part of software engineering, having the taste to write clear and good code.
And just like writing and any other craft, the best way to acquire such taste is to study others works.
The author partially acknowledges this later on, but lines of code is actually quite of useful metric. The only mistake is that people have it flipped. Lines of code are bad, and you should target fewer lines of code (except at the expense of other considerations). I regularly track LoC, because if it goes up more than I predicted, I probably did something wrong.
> Bill Gates compared measuring programming progress by lines of code to measuring aircraft building progress by weight
Aircraft weight is also a very useful metric - aircraft weight is also bad. But we do measure this!
Dijkstra’s quote from 1988 is even better: "My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger."
I think the author is missing a key distinction.
Before, lines of code was (mis)used to try to measure individual developer productivity. And there was the collective realization that this fails, because good refactoring can reduce LoC, a better design may use less lines, etc.
But LoC never went away, for example, for estimating the overall level of complexity of a project. There's generally a valid distinction between an app that has 1K, 10K, 100K, or 1M lines of code.
Now, the author is describing LoC as a metric for determining the proportion of AI-generated code in a codebase. And just like estimating overall project complexity, there doesn't seem to be anything inherently problematic about this. It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
Yes, as the author explains, if the AI code is more repetitive and needs refactoring, then the AI proportion will seem overly high in terms of how much functionality the AI proportion contributes. But at the same time, it's entirely accurate in terms of how this is possibly a larger surface for bugs, exploits, etc.
And when the author talks about big tech companies bragging about the high percentage of LoC being generated with AI... who cares? It's obviously just for press. I would assume (hope) that code review practices haven't changed inside of Microsoft or Google. The point is, I don't see these numbers as being "targets" in the way that LoC once were for individual developer productivity... there's more just a description of how useful these tools are becoming, and a vanity metric for companies signaling to investors that they're using new tools efficiently.
> the overall level of complexity of a project
The overall level of complexity of a project is not an "up means good" kind of measure. If you can achieve the same amount of functionality, obtain the same user experience, and have the same reliability with less complexity, you should.
Accidental complexity, as defined by Brooks in No Silver Bullet, should be minimized.
So the question is, are these AI tools primarily creating inherent complexity, or is it a significant amount of accidental complexity?
And if AI tools are writing all of the code, does it even matter anymore?
> The overall level of complexity of a project is not an "up means good" kind of measure.
I never said it was. To the contrary, it's more of an indication of how much more complex large refactorings might be, how complex it might be to add a new feature that will wind up touching a lot of parts, or how long a security audit might take.
The point is, it's important to measure things. Not as a "target", but simply so you can make more informed decisions.
a bit of a nit but accidental complexity is still complexity, so even if that 1M lines could reduce to 2 kines its still way more complex to maintain and patch than a codebase thats properly minimized and say 10k lines. (even though this sounds unreasonable i dont doubt it happen..)
If tech companies want to show they have a high percentage of LoC being generated by AI, it's likely they are going to encourage developers to use AI to further increase these numbers, at which point is does become a measure of productivity.
> It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
I'd say you're operating on a higher plane of thought than the majority in this industry right now. Because the majority view roughly appears to be "Need bigger number!", with very little thought, let alone deep thought, employed towards the whys or wherefores thereof.
A higher plane of thought would be "was AI able to remove 5% or 50% of the code while keeping or adding functionality and not diminishing clarity, consistency, and correctness"
I don't think the author is missing this distinction. It seems that you agree with him in his main point which is that companies bragging about LOCs generated by AI should be ignored by right-thinking people. It's just you buried that substantive agreement at the end of your "rebuttal".
> would assume (hope) that code review practices haven't changed inside of Microsoft or Google.
Google engineer perspective:
I'm actually thinking code reviews are one of the lowest hanging fruits for AI here. We have AI reviewers now in addition to the required human reviews, but it can do anything from be overly defensive at times to finding out variables are inconsistently named (helpful) to sometimes finding a pretty big footgun that might have otherwise been missed.
Even if it's not better than a huamn reviwer, the faster turnaround time for some small % of potential bugs is a big productivity boost.
I've noticed that it's super easy to end up with tons of extra lines of code when using AI. It will write code that's already available (whether that's in external libraries, internal libraries, or code already the same project). I don't mind trying to keep dependencies down, but I also don't want every project to have its own poorly tested CSV parser.
It also also often fails to clean up after itself. When you remove a feature (one that you may not have even explicitly asked for) it will sometimes just leave behind the unused code. This is really annoying when reviewing and you realize one of the files you read through is referenced nowhere.
You have to keep a close eye out to prevent bloat from these issues.
Absolutely. Even worse, when you ask AI to solve a problem it almost always adds code even if a better solution exists that removes code. If AI's new solution fails, you ask it to fix, it throws even more code, creates more mess, introduces new unnecessary states. Rinse, repeat ad infinitum.
I did this a few times as an experiment while knowing how a problem could be solved. In difficult situations Cursor always invariably adds code and creates even more mess.
I wonder if this can be mitigated somehow at the inference level because prompts don't seem to be helping with this problem.
This is kind of an aside, but nowadays we could at least be counting (lexer) tokens of code instead of lines of code. Or even number of AST edges.
It reminds me. When I had my consulting company 20 years ago, I defined these "metrics" to decide if a project was successful.
- Is the client happy? - Are the team members growing(as in learning)? - Were we able to make a profit?
Everything else was less relevant. For example: Why do I care that the project took bit longer, if at the end the client was happy with the result, and we can continue the relationship with new projects. It frees you from the cruelty of dates that are often set arbitrary.
So perhaps we should evaluate AI coding tools the same. If we can deliver successful projects in a sustainable way, then we are good.
The code written by AI in most cases is throwaway code to be improved/refined later. Its likely to be large, verbose and bloated. The design of some agents have "simplify/refactor" as final step to remedy this, but typically your average vibe coder will be satisfied that the code just compiles/passes the minimal tests. Lines of code are easy to grow. If you refine the AI code with iterative back-and-forth questions, the AI can be forced to write much more compact or elegant version in principle, but you can't apply this to most large systems without breaking something, as AI doesn't have context of what is actually changing: so e.g. an isolated function can be improved easily, but AI can't handle when complexity of abstraction stacks and interfacing multiple systems, typically because it confuses states where global context is altered.
I've been working to overcome this exact problem. I believe it's fully tractable. With proper deterministic tooling, the right words in context to anchor latent space, and a pilot with the software design skills to do it themselves, AI can help the pilot write and iterate upon properly-designed code faster than typing and using a traditional IDE. And along the way, it serves as a better rubber duck (but worse than a skilled pair programmer).
LoC is a good code quality metric, only it has to be inverted. Not "it wrote C compiler in 100 000 lines of code", but "in just 2000 lines of code". Now that is impressive and deserves praise.
I don't want to maintain an information system written in the style of code golf competitions, either.
I like the author's proposed "Comprehension coverage" metric. It aligns well with Naur's Programming as Theory Building.
this is also easily gameable. Any language can easily be converted into a single LOC
Forget about if LoC is (or ever was) a meaningful metric; in a world where humans - and now AI - essentially wires up massive amounts of existing functionality with relatively little glue code I don't even know how to measure it.
Lines of code changed are a very bad measure for humans because they can be gamed. It's an ok measure for "work done" with AI if you're don't prompt the model to game it. It's useful because it's quick to calculate and concrete, but if you use it as more than a proxy it'll bite you.
Detection of copy-pasta is interesting - what it's calling out is not a deficiency in LLM's to code but in agentic rules in place that should just remind the agent to refactor into a common function when appropriate.
>> that should just remind the agent to refactor into a common function when appropriate.
This off-the-cuff statement buries so much complexity. Sure it catches new code the exactly implements existing code, but IME it is __way__ more common to need to slightly (or not so slightly) change existing code that can now be used by multiple consumers, and then delete the new "duplicate" code. That is not trivial and requires (1) judgement from your AI coder and (2) deep reviewer expertise from your human coder.
I should change my linkedin to vibe coder fixer
I do think that "yo, look at my pumping out 10kloc/day" brags are quite stupid, because it simply does not take that many lines of code to support a large, profitable product. I'd love to hear other people's experiences here, but I'd say that a product with 100K-500K LOC can support a profitable company of dozens of people and generate tens of millions of revenue.
Now 100kloc is roughly 1M tokens, which cost a few dollars, so how could something that costs single digit dollars possibly be worth tens of millions in value? Clearly there's a substantial gap between how useful different pieces of code are, so bragging about how much of it you produce without telling me how valuable it is is useless. I guess it's a long-winded way of saying "show me the money"
I was enjoying the article, and then the author hit me with one of these:
> AI didn't just repeat the mistake. It broke the mistake open.
Come on bruh
When I was just beginning, all of the productivity measures would be 0 and I felt like a failure. The most attainable was lines of code. Currently, it's not a great measure of productivity as I'm achieving more advanced tasks. I've heard so many opinions about how LOC isn't a great measure and then the same people get to trample on all of the work I've done out of spite because I've written more code than them. I think LOC is great because productivity measures are for beginners and people who don't understand code. The audience doesn't know the difference between writing a hundred or thousands of lines of code, both are trophies for them.
These metrics for advanced roles are not applicable, no matter what you come up with. But even lines of code are good enough to see progress from a blank slate. Every developer or advanced AI agent must be judged on a case by case basis.
But if removing 1kLOC and replacing them with 25 for a much better function, there is a -975 LOC report. Does this count as negative productivity? Having brackets start and stop on their own lines could double LOC counts but would that actually improve the code base? Be a sign of doubling productivity?
The OpenBSD project prides itself on producing very secure, bug free software, and they largely trend towards as low of lines of codes as they can possibly get away with while maintaining readability (so no codegolf tricks for the most part). I would rather we write secure bug free software than speed up the ability to output 10kLOC. The typing out code isn’t the difficult part in that scenario.
No one judges a painting by the amount of paint, or a wooden chair by the number of nails in it. The amount of LoC doesn’t matter. What matters is that the code is bug-free, readable, and maintainable.
But reducing the amount of LoC helps, just like using the correct word helps in writing text. That’s the craft part of software engineering, having the taste to write clear and good code.
And just like writing and any other craft, the best way to acquire such taste is to study others works.