Astro - Hacker News

Paddyz an hour ago ago

The most interesting thing about this project isn't the GPT itself - it's what it reveals about the gap between "understanding the math" and "understanding the engineering."

I've read the attention paper, watched Karpathy's own YouTube lectures, implemented toy transformers in notebooks. But reading through this C implementation I finally grokked why KV-cache matters at the systems level - because when you see the memory layout spelled out in raw mallocs instead of hidden behind PyTorch abstractions, the O(n^2) vs O(n) tradeoff for cached inference becomes visceral rather than theoretical.

This is the same reason people learn more about operating systems from xv6 than from reading the Linux source. Minimal implementations aren't just educational toys - they're the only way to build intuition about what the abstractions are actually doing. The fact that a full GPT-2 fits in ~1000 lines of C should make every ML engineer uncomfortable about how much complexity their frameworks are hiding from them.

[-]

tadfisher 15 minutes ago ago

Are you hallucinating or am I? This implementation is 200 lines of Python. Did you mean to link to a C version?
[-]
- nicpottier 2 minutes ago ago
  
  Ya, this reads verbatim on how my OpenClaw bot blogs.
- nnoremap 12 minutes ago ago
  
  Its slop
  [-]
  - tadfisher 7 minutes ago ago
    
    Maybe the article originally featured a 1000-line C implementation.
janis1234 an hour ago ago

I found reading Linux source more useful than learning about xv6 because I run Linux and reading through source felt immediately useful. I.e, tracing exactly how a real process I work with everyday gets created.
Can you explain this O(n2) vs O(n) significance better?
[-]
- Paddyz 43 minutes ago ago
  
  Sure - so without KV cache, every time the model generates a new token it has to recompute attention over the entire sequence from scratch. Token 1 looks at token 1. Token 2 looks at tokens 1,2. Token 3 looks at 1,2,3. That's 1+2+3+...+n = O(n^2) total work to generate n tokens.
  With KV cache you store the key/value vectors from previous tokens, so when generating token n you only compute the new token's query against the cached keys. Each step is O(n) instead of recomputing everything, and total work across all steps drops to O(n^2) in theory but with way better constants because you're not redoing matrix multiplies you already did.
  The thing that clicked for me reading the C code was seeing exactly where those cached vectors get stored in memory and how the pointer arithmetic works. In PyTorch it's just `past_key_values` getting passed around and you never think about it. In C you see the actual buffer layout and it becomes obvious why GPU memory is the bottleneck for long sequences.
misiti3780 an hour ago ago

agreed - no one else is saying this.

rramadass 3 minutes ago ago

C++ version - https://github.com/Charbel199/microgpt.cpp?tab=readme-ov-fil...

Rust version - https://github.com/mplekh/rust-microgpt

lynxbot2026 36 minutes ago ago

The real question this raises for me: if GPT-2 fits in ~1000 lines of C, what exactly are the other 999,000 lines in production frameworks doing? I know the answer is "training infrastructure, distributed computing, mixed precision, etc" but this kind of minimal implementation makes you wonder how much of the ML stack is essential complexity vs accumulated complexity. Karpathy has a talent for making you feel slightly embarrassed about your own abstractions.

[-]

GuB-42 9 minutes ago ago

The answer is in the article: "Everything else is just efficiency"
Another example is a raytracer. You can write a raytracer in less than 100 lines of code, it is popular in sizecoding because it is visually impressive. So why are commercial 3D engines so complex?
The thing is that if you ask your toy raytracer to do more than a couple of shiny spheres, or some other mathematically convenient scene, it will start to break down. Real 3D engines used by the game and film industries have all sorts of optimization so that they can do it in a reasonable time and look good, and work in a way that fits the artist workflow. This is where the million of lines come from.
awwaiid 4 minutes ago ago

Where is this 1000 lines of C coming from? This is python.
sdwr 23 minutes ago ago

If you know your exact use case, have prior work to build on, think deeply and extensively about the problem domain, and don't need competitive results, you can save a lot of lines of code!

fulafel an hour ago ago

This could make an interesting language shootout benchmark.

profsummergig an hour ago ago

If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.

colonCapitalDee 2 hours ago ago

Beautiful work

ThrowawayTestr an hour ago ago

This is like those websites that implement an entire retro console in the browser.

ViktorRay 2 hours ago ago

Which license is being used for this?

[-]

dilap an hour ago ago

MIT (https://gist.github.com/karpathy/8627fe009c40f57531cb1836010...)
[-]
- ViktorRay an hour ago ago
  
  Thank you

tithos 2 hours ago ago

What is the prime use case

[-]

keyle 2 hours ago ago

it's a great learning tool and it shows it can be done concisely.
geerlingguy 2 hours ago ago

Looks like to learn how a GPT operates, with a real example.
[-]
- foodevl an hour ago ago
  
  Yeah, everyone learns differently, but for me this is a perfect way to better understand how GPTs work.
inerte an hour ago ago

Kaparthy to tell you things you thought were hard in fact fit in a screen.
antonvs 2 hours ago ago

To confuse people who only think in terms of use cases.
Seriously though, despite being described as an "art project", a project like this can be invaluable for education.
jackblemming an hour ago ago

Case study to whenever a new copy of Programming Pearls is released.
aaronblohowiak 2 hours ago ago

“Art project”
[-]
- pixelatedindex an hour ago ago
  
  If writing is art, then I’ve been amazed at the source code written by this legend

dhruv3006 27 minutes ago ago

Karapthy with another gem !