This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.
The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.
We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.
The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"
Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.
While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible they are)
The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.
There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.
The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.
Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.
I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.
Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
To the best of my knowledge, there's no Rust-based compiler that comes anywhere close to 99% on the GCC torture test suite, or able to compile Doom. So even if it saw the internals of GCC and a lot of other compilers, the ability to recreate this step-by-step in Rust is extremely impressive to me.
Agreed, but the next step is of having an AI agent actually run the business and be able to get the business context it needs as a human would. Obviously we're not quite there, but with the rapid progress on benchmarks like Vending-Bench [0], and especially with this teams approach, it doesn't seem far fetched anymore.
As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.
It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
Initial commit: empty repo structure
Lock: initial compiler scaffold task
Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
Lock: implement array subscript and lvalue assignments
Implement array subscript, lvalue assignments, and short-circuit evaluation
Add idea: type-aware codegen for correct sized operations
Lock: type-aware codegen for correct sized operations
Implement type-aware codegen for correct sized operations
Lock: implement global variable support
Implement global variable support across all three backends
It's weird to see the expectation that the result should be perfect.
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
A symptom of the increasing backlash against generative AI (both in creative industries and in coding) is that any flaw in the resulting product is predicate to call it AI slop, even if it's very explicitly upfront that it's an experimental demo/proof of concept and not the NEXT BIG THING being hyped by influencers. That nuance is dead even outside of social media.
AI companies set that expectation when their CEOs ran around telling anyone who would listen that their product is a generational paradigm shift that will completely restructure both labor markets and human cognition itself. There is no nuance in their own PR, so why should they benefit from any when their product can't meet those expectations?
Because it leads to poor and nonconstructive discourse that doesn't educate anyone about the implications of the tech, which is expected on social media but has annoyingly leaked to Hacker News.
There's been more than enough drive-by comments from new accounts/green names even in this HN submission alone.
This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)
It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.
I'll still work on it, of course. It just won't be so surprising.
How much of this result is effectively plagiarized open source compiler code? I don't understand how this is compelling at all: obviously it can regurgitate things that are nearly identical in capability to already existing code it was explicitly trained on...
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...
Look at what those compilers are capable of compiling and to which targets, and compare it to what this compiler can do. Those are wonderful, and I have nothing but respect for them, but they aren't going to be compiling the Linux kernel.
Being written in rust is meaningless IMHO. There is absolutely zero inherent value to something being written in rust. Sometimes it's the right tool for the job, sometimes it isn't.
It means that it's not directly copying existing C compiler code which is overwhelmingly not written in Rust. Even if your argument is that it is plagiarizing C code and doing a direct translation to Rust, that's a pretty interesting capability for it to have.
Surely you agree that directly copying existing code into a different language is still plagiarism?
I completely agree that "reweite this existing codebase into a new language" could be a very powerful tool. But the article is making much bolder claims. And the result was more limited in capability, so you can't even really claim they've achieved the rewrite skill yet.
The fact it couldn't actually stick to the 16 bit ABI so it had to cheat and call out to GCC to get the system to boot says a lot.
Without enough examples to copy from (despite CPU manuals being available in the training set) the approach failed. I wonder how well it'll do when you throw it a new/imaginary instruction set/CPU architecture; I bet it'll fail in similar ways.
"Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.
And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.
IMHO a new architecture doesn't really make it any more interesting: there's too many examples of adding new architectures in the existing codebases. Maybe if the new machine had some bizarre novel property, I suppose, but I can't come up with a good example.
If the model were retrained without any of the existing compilers/toolchains in its training set, and it could still do something like this, that would be very compelling to me.
Honestly, probably not a lot. Not that many C compilers are compatible with all of GCC's weird features, and the ones that are, I don't think are written in Rust. Hell, even clang couldn't compile the Linux kernel until ~10 years ago. This is a very impressive project.
However it was achieved, building a such a complex project like a C compiler on a 20k $ budget in full autonomy is quite impressive.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
> To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
If we're just writing off the billions in up front investment costs, they can just send all that my way while we're at it. No problem. Everybody happy.
How about we get the LLM's to collaborate and design a perfect programming language for LLM coding, it would be terse (less tokens) easy for pattern searches etc and very fast to build, iterate over.
I cannot decide if LLMs would be excellent at writing in pure binary (why waste all that context on superfluous variable names and function symbols) or be absolutely awful at writing pure binary (would get hopelessly lost without the huge diversification of tokens).
We would still need the language to be human readable, but it could be very dense. They could build the ultimate std lib, that goes directly to kernels, so a call like spawn is all the tokens it needs to start a co routine for example.
I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago. It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
In this case it's not reproducing training data verbatim but it probably is using algorithms and data structures that were learned from existing C compilers. On one hand it's good to reuse existing knowledge but such knowledge won't be available if you ask Claude to develop novel software.
They're very good at reiterating, that's true. The issue is that without the people outside of "most humans" there would be no code and no civilization. We'd still be sitting in trees. That is real intelligence.
They couldn't do it because they weren't fine-tuned for multi-agent workflows, which basically means they were constrained by their context window.
How many agents did they use with previous Opus? 3?
You've chosen an argument that works against you, because they actually could do that if they were trained to.
Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?
Because for all those projects, the effective solution is to just use the existing implementation and not launder code through an LLM. We would rather see a stab at fixing CVEs or implementing features in open source projects. Like the wifi situation in FreeBSD.
LLMs can regurgitate almost all of the Harry Potter books, among others [0]. Clearly, these models can actually regurgitate large amounts of their training data, and reconstructing any gaps would be a lot less impressive than implementing the project truly from scratch.
(I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)
> So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
Do you not see the difference between a toy language and a clean room implementation that can compile Linux, QEMU, Postgres, and sqlite? (No, it doesn't have the assembler and linker.)
No? That was a frontend for a toy language calling using LLVM as the backend. This is a totally self-contained compiler that's capable of compiling the Linux kernel. What's the part that you think is similar?
There's a terrible bug where once it compacts then it sometimes pulls in .o or binary files and immediately fills your entire context. Then it compacts again...10m and your token budget is gone for the 5 hour period. edit: hooks that prevent it from reading binary files can't prevent this.
Can it create employment? How is this making life better.
I understand the achievement but come on, wouldn´t it be something to show if you created employment for 10000 people using your 20000 USD!
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
I'm struggling to even parse the syntax of "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE", but assuming that you're talking about resource allocation, my answer is UBI or something similar to it. We only need to "reward" for action when the resources are scarce, but when resources are plentiful, there's no particular reason not to just give them out.
I know it's "easier to imagine an end to the world than an end to capitalism", but to quote another dreamer: "Imagine all the people sharing all the world".
Obviously a human in the loop is always needed and this technology that is specifically trained to excel at all cognitive tasks that humans are capable of will lead to infinite new jobs being created. /s
> This was a clean-room implementation (Claude did not have internet access at any point during its development);
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
I'm trying to recall a quote. Some war where all defeats were censored in the news, possibly Paris was losing to someone. It was something along the lines of "I can't help but notice how our great victories keep getting closer to home".
Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.
So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.
Great. Did your compiler support three different architectures (four, if you include x86 in addition to x86-64) and compile and pass the test suite for all of this software?
> Projects that compile and pass their test suites include PostgreSQL (all 237 regression tests), SQLite, QuickJS, zlib, Lua, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, libuv, Redis, libffi, musl, TCC, and DOOM — all using the fully standalone assembler and linker with no external toolchain. Over 150 additional projects have also been built successfully, including FFmpeg (all 7331 FATE checkasm tests on x86-64 and AArch64), GNU coreutils, Busybox, CPython, QEMU, and LuaJIT.
Writing a C compiler is not that difficult, I agree. Writing a C compiler that can compile a significant amount of real software across multiple architectures? That's significantly more non-trivial.
> I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot
Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.
Generating a 99% compliant C compiler is not a textbook task in any university I've ever heard of. There's a vast difference between a toy compiler and one that can actually compile Linux and Doom.
From a bit of research now, there are only three other compilers that can compile an unmodified Linux kernel: GCC, Clang/LLVM and Intel's oneAPI. I can't find any other compiler implementation that came close.
That's because you need to implement a bunch of gcc-specific behavior that linux relies on.
A 100% standards compliant c23 compiler can't compile linux.
A simple C89 compiler is a textbook task; a GCC-compatible compiler targeting multiple architectures that can pass 99% of the GCC torture test suite is absolutely not.
You could hire a reasonably skilled dev in India for a week for $1k —- or you could pay $20k in LLM tokens, spend 2 hours writing essays to explain what you want, and then get a buggy mess.
This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
> This was a clean-room implementation
This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.
The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.
We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.
It is a research topic for heaven's sake:
https://arxiv.org/abs/2504.16046
We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.
The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"
Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.
While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible they are)
The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.
There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.
The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.
Prove this statement wrong.
> Prove this statement wrong.
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.
This sounds very wrong to me.
Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.
I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.
No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.
The AI guys sure have big brains.
Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
Interesting how the concept of a clean room implementation changes when the agent has been trained on the entire internet already
To the best of my knowledge, there's no Rust-based compiler that comes anywhere close to 99% on the GCC torture test suite, or able to compile Doom. So even if it saw the internals of GCC and a lot of other compilers, the ability to recreate this step-by-step in Rust is extremely impressive to me.
The impressiveness of converting C to Rust by any means is kind of contingent on how much unnecessary unsafe there is in the end result though.
None - all references to 'unsafe' are in comments about the codegen: https://github.com/search?q=repo%3Aanthropics%2Fclaudes-c-co...
Agreed, but the next step is of having an AI agent actually run the business and be able to get the business context it needs as a human would. Obviously we're not quite there, but with the rapid progress on benchmarks like Vending-Bench [0], and especially with this teams approach, it doesn't seem far fetched anymore.
As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.
[0] https://andonlabs.com/evals/vending-bench-2
It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
I would like to see the following published:
- All prompts used
- The structure of the agent team (which agents / which roles)
- Any other material that went into the process
This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.
It's weird to see the expectation that the result should be perfect.
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
A symptom of the increasing backlash against generative AI (both in creative industries and in coding) is that any flaw in the resulting product is predicate to call it AI slop, even if it's very explicitly upfront that it's an experimental demo/proof of concept and not the NEXT BIG THING being hyped by influencers. That nuance is dead even outside of social media.
AI companies set that expectation when their CEOs ran around telling anyone who would listen that their product is a generational paradigm shift that will completely restructure both labor markets and human cognition itself. There is no nuance in their own PR, so why should they benefit from any when their product can't meet those expectations?
Because it leads to poor and nonconstructive discourse that doesn't educate anyone about the implications of the tech, which is expected on social media but has annoyingly leaked to Hacker News.
There's been more than enough drive-by comments from new accounts/green names even in this HN submission alone.
This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)
It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.
I'll still work on it, of course. It just won't be so surprising.
So it copied one of the C compilers? This was always possible but now you need to pay $1000 in API costs to Anthropic
Add a 0 and double it
|Over nearly 2,000 Claude Code sessions and $20,000 in API cost
How much of this result is effectively plagiarized open source compiler code? I don't understand how this is compelling at all: obviously it can regurgitate things that are nearly identical in capability to already existing code it was explicitly trained on...
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
What Rust-based compiler is it plagiarising from?
There are many, here's a simple Google search:
https://github.com/jyn514/saltwater
https://github.com/ClementTsang/rustcc
https://github.com/maekawatoshiki/rucc
Did you actually look at these?
> https://github.com/jyn514/saltwater
This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...
> https://github.com/ClementTsang/rustcc
This will compile basically no real-world code. The only supported data type is "int".
> https://github.com/maekawatoshiki/rucc
This is just a frontend. It uses LLVM as the backend.
Look at what those compilers are capable of compiling and to which targets, and compare it to what this compiler can do. Those are wonderful, and I have nothing but respect for them, but they aren't going to be compiling the Linux kernel.
I just did a quick Google search only on GitHub, maybe there are better ones out there on the internet?
Language doesn't really matter, it's not how things are mapped in the latent space. It only needs to know how to do it in one language.
Being written in rust is meaningless IMHO. There is absolutely zero inherent value to something being written in rust. Sometimes it's the right tool for the job, sometimes it isn't.
It means that it's not directly copying existing C compiler code which is overwhelmingly not written in Rust. Even if your argument is that it is plagiarizing C code and doing a direct translation to Rust, that's a pretty interesting capability for it to have.
Surely you agree that directly copying existing code into a different language is still plagiarism?
I completely agree that "reweite this existing codebase into a new language" could be a very powerful tool. But the article is making much bolder claims. And the result was more limited in capability, so you can't even really claim they've achieved the rewrite skill yet.
Please don't open a bridge to the Rust flamewar from the AI flamewar :-)
Hahaha, fair enough, but I refuse to be shy about having this opinion :)
The fact it couldn't actually stick to the 16 bit ABI so it had to cheat and call out to GCC to get the system to boot says a lot.
Without enough examples to copy from (despite CPU manuals being available in the training set) the approach failed. I wonder how well it'll do when you throw it a new/imaginary instruction set/CPU architecture; I bet it'll fail in similar ways.
"Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.
And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.
IMHO a new architecture doesn't really make it any more interesting: there's too many examples of adding new architectures in the existing codebases. Maybe if the new machine had some bizarre novel property, I suppose, but I can't come up with a good example.
If the model were retrained without any of the existing compilers/toolchains in its training set, and it could still do something like this, that would be very compelling to me.
Honestly, probably not a lot. Not that many C compilers are compatible with all of GCC's weird features, and the ones that are, I don't think are written in Rust. Hell, even clang couldn't compile the Linux kernel until ~10 years ago. This is a very impressive project.
At this point, I genuinely don't know what to learn next to not become obsolete when another Opus version gets released
However it was achieved, building a such a complex project like a C compiler on a 20k $ budget in full autonomy is quite impressive.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
> To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
There is an entire Evaluation section that addresses that criticism (both in agreement and disagreement).
If we're just writing off the billions in up front investment costs, they can just send all that my way while we're at it. No problem. Everybody happy.
How about we get the LLM's to collaborate and design a perfect programming language for LLM coding, it would be terse (less tokens) easy for pattern searches etc and very fast to build, iterate over.
I cannot decide if LLMs would be excellent at writing in pure binary (why waste all that context on superfluous variable names and function symbols) or be absolutely awful at writing pure binary (would get hopelessly lost without the huge diversification of tokens).
Binary is wayyy less information dense than normal code, so it wouldn't work well at all.
We would still need the language to be human readable, but it could be very dense. They could build the ultimate std lib, that goes directly to kernels, so a call like spawn is all the tokens it needs to start a co routine for example.
I'm surprised by the assumption that LLMs would design such a language better than humans. I don't think that's the case.
I think it's funny how me and I assume many others tried to do the same thing and they probably saw it being a popular query or had the same idea.
It can compile the linux kernel, but does it boot?
https://youtu.be/vNeIQS9GsZ8?t=16
They posted this video, looks like they used `qemu-system-riscv64` to test.
https://github.com/anthropics/claudes-c-compiler/blob/main/B... claims to have the first line of dmesg (which is shown using dmesg obviously.)
Nothing in the post about whether the compiled kernel boots.
video does show it booting.
I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago. It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
In this case it's not reproducing training data verbatim but it probably is using algorithms and data structures that were learned from existing C compilers. On one hand it's good to reuse existing knowledge but such knowledge won't be available if you ask Claude to develop novel software.
How often do you need to invent novel algorithms or data structures? Most human written code is just rehashing existing ideas as well.
They're very good at reiterating, that's true. The issue is that without the people outside of "most humans" there would be no code and no civilization. We'd still be sitting in trees. That is real intelligence.
They couldn't do it because they weren't fine-tuned for multi-agent workflows, which basically means they were constrained by their context window.
How many agents did they use with previous Opus? 3?
You've chosen an argument that works against you, because they actually could do that if they were trained to.
Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?
They can literally print out entire books line by line.
Because for all those projects, the effective solution is to just use the existing implementation and not launder code through an LLM. We would rather see a stab at fixing CVEs or implementing features in open source projects. Like the wifi situation in FreeBSD.
As you wish: https://www.axios.com/2026/02/05/anthropic-claude-opus-46-so...
They are doing that too. https://red.anthropic.com/2026/zero-days/
> It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago.
They only have to keep reiterating this because people are still pretending the training data doesn't contain all the information that it does.
> It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
Maybe not any old LLM, but Claude gets really close.
https://arxiv.org/pdf/2601.02671v1
LLMs can regurgitate almost all of the Harry Potter books, among others [0]. Clearly, these models can actually regurgitate large amounts of their training data, and reconstructing any gaps would be a lot less impressive than implementing the project truly from scratch.
(I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)
[0] https://www.theregister.com/2026/01/09/boffins_probe_commerc...
The training data doesn't contain a Rust based C compiler that can build Linux, though.
> So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
Do you not see the difference between a toy language and a clean room implementation that can compile Linux, QEMU, Postgres, and sqlite? (No, it doesn't have the assembler and linker.)
That's for $20,000.
people have built compilers for free, with $20000 you can even a couple of devs for a year in low income countries.
No? That was a frontend for a toy language calling using LLVM as the backend. This is a totally self-contained compiler that's capable of compiling the Linux kernel. What's the part that you think is similar?
There's a terrible bug where once it compacts then it sometimes pulls in .o or binary files and immediately fills your entire context. Then it compacts again...10m and your token budget is gone for the 5 hour period. edit: hooks that prevent it from reading binary files can't prevent this.
Please fix.. :)
Can it create employment? How is this making life better. I understand the achievement but come on, wouldn´t it be something to show if you created employment for 10000 people using your 20000 USD!
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
"Employment" is not intrinsically valuable. It is an emergent property of one way of thinking about economic systems.
For employment I mean "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE".
Call it as you wish, but I am certainly not talking about coding values.
I'm struggling to even parse the syntax of "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE", but assuming that you're talking about resource allocation, my answer is UBI or something similar to it. We only need to "reward" for action when the resources are scarce, but when resources are plentiful, there's no particular reason not to just give them out.
I know it's "easier to imagine an end to the world than an end to capitalism", but to quote another dreamer: "Imagine all the people sharing all the world".
Obviously a human in the loop is always needed and this technology that is specifically trained to excel at all cognitive tasks that humans are capable of will lead to infinite new jobs being created. /s
> This was a clean-room implementation (Claude did not have internet access at any point during its development);
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
https://github.com/anthropics/claudes-c-compiler/blob/main/s...
why x9? who knows?!
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
I'm trying to recall a quote. Some war where all defeats were censored in the news, possibly Paris was losing to someone. It was something along the lines of "I can't help but notice how our great victories keep getting closer to home".
Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.
So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.
> but I can't help but notice how the battles keep getting closer to home
This has been true for all of (known) human history. I’m gonna go ahead and make another bold prediction: tech will keep getting better.
The issue with this blog post is it’s mostly marketing.
Can one man really make a C compiler in one week that can compile linux, sqlite, etc.?
Maybe I'm underestimating the simplicity of the C language, but that doesn't sound very plausible to me.
yes, if you do not care to optimize, yes. source: done it
I would love to see the commit log on this.
Implementing just enough to conform to a language is not as difficult as it seems. Making it fast is hard.
did this before i knew how to git, back in college. target was ARMv5
Great. Did your compiler support three different architectures (four, if you include x86 in addition to x86-64) and compile and pass the test suite for all of this software?
> Projects that compile and pass their test suites include PostgreSQL (all 237 regression tests), SQLite, QuickJS, zlib, Lua, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, libuv, Redis, libffi, musl, TCC, and DOOM — all using the fully standalone assembler and linker with no external toolchain. Over 150 additional projects have also been built successfully, including FFmpeg (all 7331 FATE checkasm tests on x86-64 and AArch64), GNU coreutils, Busybox, CPython, QEMU, and LuaJIT.
Writing a C compiler is not that difficult, I agree. Writing a C compiler that can compile a significant amount of real software across multiple architectures? That's significantly more non-trivial.
Claude is only a few years old so we should compare it to a 3 year old human's C compiler
Claude contains the entire wisdom of the internet, such as it is.
> I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot
Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.
Rewrite is what I’ve been doing so far in such cases. Takes fewer hours
100.000 lines of code for something that is literally a text book task?
I guess if it only created 1.000 lines it would be easy to see where those lines came from.
> literally a text book task
Generating a 99% compliant C compiler is not a textbook task in any university I've ever heard of. There's a vast difference between a toy compiler and one that can actually compile Linux and Doom.
From a bit of research now, there are only three other compilers that can compile an unmodified Linux kernel: GCC, Clang/LLVM and Intel's oneAPI. I can't find any other compiler implementation that came close.
That's because you need to implement a bunch of gcc-specific behavior that linux relies on. A 100% standards compliant c23 compiler can't compile linux.
A simple C89 compiler is a textbook task; a GCC-compatible compiler targeting multiple architectures that can pass 99% of the GCC torture test suite is absolutely not.
This has multiple backends and a long tail of C extensions that are not in the textbook.
You could hire a reasonably skilled dev in India for a week for $1k —- or you could pay $20k in LLM tokens, spend 2 hours writing essays to explain what you want, and then get a buggy mess.
No human developer, not even Fabrice Bellard, could reproduce this specific result in a week. A subset of it, sure, but not everything this does.
just forked https://github.com/Vexu/arocc and it took me 5 seconds to complete it.