Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].
Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)
[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.
I get your point, but I don't think your nit-pick is useful in this case.
The point is that you can't abstract away the details of back propagation (which involve computing gradients) under some circumstances. For example, when we are using gradient descend. Maybe in other circumstances (global optimization algorithm) it wouldn't be an issue, but the leaky abstraction idea isn't that the abstraction is always an issue.
(Right now, back propagation is virtually the only way to calculate gradients in deep learning)
Yes. No need to be apologetic or timid about it — it’s not a nit to push back against a flawed conceptual framing.
I respect Karpathy’s contributions to the field, but often I find his writing and speaking to be more than imprecise — it is sloppy in the sense that it overreaches and butchers key distinctions. This may sound harsh, but at his level, one is held to a higher standard.
> often I find his writing and speaking to be more than imprecise
I think that's more because he's trying to write to an audience who isn't hardcore deep into ML already, so he simplifies a lot, sometimes to the detriment of accuracy.
At this point I see him more as a "ML educator" than "ML practitioner" or "ML researcher", and as far as I know, he's moving in that direction on purpose, and I have no qualms with it overall, he seems good at educating.
But I think shifting the mindset of what the purpose of his writings are maybe help understand why sometimes it feels imprecise.
Whoever chose this topic title perhaps did him a disservice in suggesting he said the problem was backprop itself, since in his blog post he immediately clarifies what he meant by it. It's a nice pithy way of stating the issue though.
Nah, Karpathy's title is "Yes you should understand backprop", and his first highlight is "The problem with Backpropagation is that it is a leaky abstraction." This is his choice as a communicator, not the poster to HN.
And his _examples_ are about gradients, but nowhere does he distinguish between backpropagation, a (part of) an algorithm for automatic differentiation and the gradients themselves. None of the issues are due to BP returning incorrect gradients (it totally could, for example, lose too much precision, but it doesn't).
Yeah - he chose it as a pithy/catchy description of the issue, then immediately clarified what he meant by it.
> In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.
Then follows this with multiple clear examples of exactly what he is talking about.
The target audience was people building and training neural networks (such as his CS231n students), so I think it's safe to assume they knew what backprop and gradients are, especially since he made them code gradients by hand, which is what they were complaining about!
But Karpathy is completely right; students who understand and internalize how backprop works, having implemented it rather than treating it as a magic spell cast by TF/PyTorch, will also be able to intuitively understand these problems of vanishing gradients and so on.
Sure, instead of "the problem with backpropagation is that it's a leaky abstraction" he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading for an introductory-level article for an undergraduate audience, and also unnecessary because he already said that in the introduction.
Karpathy's contribution to teaching around deep learning is just immense. He's got a mountain of fantastic material from short articles like this, longer writing like https://karpathy.github.io/2015/05/21/rnn-effectiveness/ (on recurrent neural networks) and all of the stuff on YouTube.
Plus his GitHub. The recently released nanochat https://github.com/karpathy/nanochat is fantastic. Having minimal, understandable and complete examples like that is invaluable for anyone who really wants to understand this stuff.
I was slightly surprised that my colleagues, who are extremely invested in capabilities of LLMs, didn’t show any interest in Karpathy’s communication on the subject when I recommended it to them.
Later I understood that they don’t need to understand LLMs, and they don’t care how they work. Rather they need to believe and buy into them.
They’re more interested in science fiction discussions — how would we organize a society where all work is done by intelligent machines — than what kinds of tasks are LLMs good at today and why.
What's wrong or odd about that? You can like a technology as a user and not want to delve into how it works (sentence written by a human despite use of "delve"). Everyone should have some notions on what LLMs can or cannot do, in order to use them successfully and not be misguided by their limitations, but we don't need everyone to understand what backpropagation is, just as most of us use cars without knowing much about how an internal combustion engine works.
And the issue you mention in the last paragraph is very relevant, since the scenario is plausible, so it is something we definitely should be discussing.
> What's wrong or odd about that? You can like a technology as a user and not want to delve into how it works
The question here is whether the details are important for the major issues, or whether they can be abstracted away with a vague understanding. To what extent abstracting away is okay depends greatly on the individual case. Abstractions can work over a large area or for a long time, but then suddenly collapse and fail.
The calculator, which has always delivered sufficiently accurate results, can produce nonsense when one approaches the limits of its numerical representation or combines numbers with very different levels of precision. This can be seen, for example, when one rearranges commutative operations; due to rounding problems, it suddenly delivers completely different results.
The 2008 financial crisis was based, among other things, on models that treated certain market risks as independent of one another. Risk could then be spread by splitting and recombining portfolios. However, this only worked as long as the interdependence of the different portfolios was actually quite small. An entire industry, with the exception of a few astute individuals, had abstracted away this interdependence, acted on this basis, and ultimately failed.
As individuals, however, we are completely dependent on these abstractions. Our entire lives are permeated by things whose functioning we simply have to rely on without truly understanding them. Ultimately, it is the nature of modern, specialized societies that this process continues and becomes even more differentiated.
But somewhere there should be people who work at the limits of detailed abstractions and are concerned with researching and evaluating the real complexity hidden behind them, and thus correcting the abstraction if necessary, sending this new knowledge upstream.
The role of an expert is to operate with less abstraction and more detail in her oder his field of expertise than a non-expert -- and the more so, the better an expert she or he is.
I disagree in the case of LLMs, because they really are an accidental side effect of another tool. Not understanding the inner workings will make users attribute false properties to them. Once you understand how they work (how they generate plausible text), you get a far deeper grasp on their capabilities and how to tweak and prompt them.
And in fact this is true of any tool, you don’t have to know exactly how to build them but any craftsman has a good understanding how the tool works internally. LLMs are not a screw or a pen, they are more akin to an engine, you have to know their subtleties if you build a car. And even screws have to be understood structurally in advanced usage. Not understanding the tool is maybe true only for hobbyists.
None of this is about an end user in the sense of the user of an LLM. This is aimed at the prospective user of a training framework which implements backpropagation at a high level of abstraction. As such it draws attention to training problems which arise inside the black box in order to motivate learning what is inside that box. There aren't any ML engineers who shouldn't know all about single layer perceptrons I think, and that makes for a nice analogy to real life issues in using SGD and backpropagation for ML training.
The analogy is if you don't understand the limitations of the tool you may try and make it do something it is bad at and never understand why it will never do the thing you want despite looking like it potentially coild
I think there are a lot of people who just don't care about stuff like nanochat because it's exclusively pedagogical, and a lot of people want to learn by building something cool, not taking a ride on a kiddie bike with training wheels.
That's fine as far as it goes, but there is a middle ground ...
Feynman was right that "If you can't build it, you don't understand it", but of course not everyone needs or wants to fully understand how an LLM works. However, regarding an LLM as a magic black box seems a bit extreme if you are a technologist and hope to understand where the technology is heading.
I guess we are in an era of vibe-coded disposable "fast tech" (cf fast fashion), so maybe it only matters what can it do today, if playing with or applying towards this end it is all you care about, but this seems a rather blinkered view.
If everyone had to understand how carburettors, engines and break systems work; to be able to drive a car - rather than just learn to drive and get from A to B - I'm guessing there would be a lot less cars on the road.
(Thinking about it, would that necessarily be a bad thing...)
Not everybody who drives a car (even as a professional driver) knows how to make one.
If you live in a world of horse carriages, you can be thinking about what the world of cars is going to be like, even if you don't fully understand what fuel mix is the most efficient or what material one should use for a piston in a four-stroke.
I'm personally very interested in how LLMs work under the hood, but I don't think everyone who uses them as tools needs that. I don't know the wiring inside my drill, but I know how to put a hole in my wall and not my hand regardless.
Which is terrible. That's the root of all the BS around LLMs. People lacking understanding of what they are and ascribing capabilities which LLMs just don't have, by design. Even HN discussions are full of that. Even though this page literally has "hacker" in its name.
I see your point but on the other hand a lot of conversations go: A: what will we do when AI do all the jobs, B: that's silly LLMs can't do the jobs. The thing is A didn't say LLM, they said AI as in whatever that will be a short while into the future. Which is changing rapidly because thousands of bright people are being paid to change it.
And what gives you that confidence? A few AI nerds already claimed that in the 80s.
We're currently exploring what LLMs can do. There is no indication that any further fundamental breakthrough is around the corner. Everybody is currently squeezing the same stone.
The trouble is that "AI" is also very much a leaky abstraction, which makes it tempting to see all the "AI" advances of recent years, then correctly predict that these "AI" advances will continue, but then jump to all sorts of wrong conclusions about what those advances will be.
For example, things like "AI" image and video generation are amazing, as are things like AlphaGo and AlphaFold, but none of these have anything to do with LLMs, and the only technology they share with LLMs is machine learning and neural nets. If you lump these together with LLMs, calling them all "AI", then you'll come to the wrong conclusion that all of these non-LLM advances indicate that "AI" is rapidly advancing and therefore LLMs (also being "AI") will do too ...
Even if you leave aside things like AlphaGo, and just focus on LLMs, and other future technology that may take all our jobs, then using terms like "AI" and "AGI" are still confusing and misleading. It's easy to fall into the mindset that "AGI" is just better "AI", and that since LLMs are "AI", AGI is just better LLMs, and is around the corner because "AI" is advancing rapidly ...
In reality LLMs are, like AlphaFold, something highly specific - they are auto-regressive next-word predictor language models (just as a statement of fact, and how they are trained, not a put-down), based on the Transformer architecture.
The technology that could replace humans for most jobs in the future isn't going to be a better language model - a better auto-regressive next-word predictor - but will need to be something much more brain like. The architecture itself doesn't have to be brain-like, but in order to deliver brain-like functionality it will probably need to include another half-dozen "Transformer-level" architectural/algorithmic breakthroughs including things like continual learning, which will likely turn the whole current LLM training and deployment paradigm on it's head.
Again, just focusing on LLMs, and LLM-based agents, regarding them as a black-box technology, it's easy to be misled into thinking that advances in capability are broadly advancing, and will rise all ships, when in reality progress is much more narrow. Headlines about LLMs achievement in math and competitive programming, touted as evidence of reasoning, do NOT imply that LLM reasoning is broadly advancing, but you need to get under the hood and understand RL training goals to realize why that is not necessarily the case. The correctness of most business and real-world reasoning is not as easy to check as is marking a math problem as correct or not, yet that capability is what RL training depends on.
I could go on .. LLM-based agents are also blurring the lines of what "AI" can do, and again if treated as a black box will also misinform as to what is actually progressing and what is not. Thousands of bright people are indeed working on improving LLM-adjacent low-hanging fruit like this, but it'd be illogical to conclude that this is somehow helping to create next-generation brain-like architectures that will take away our jobs.
I'll give you algorithmic breakthroughs have been quite slow to come about - I think backpropagation in 1986 and then transformers in 2017. Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like. I recently read how LLMs work and was surprised how text focused it is, making word vectors and not physical understanding.
> Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like
That's because you, as you admit in the next sentence, have almost no understanding of how they work.
Your reasoning is on the same level as someone in the 1950s thinking ubiquitous flying cars are just a few years away. Or fusion power, for that matter.
In your defense, that seems to be about the average level of engagement with this technology, even on this website.
Yes, it's a bit shocking to realize that all LLMs are doing is predicting next word (token) from samples in the training data, but the Transformer is powerful enough to do a fantastic job of prediction (which you can think of as selecting which training sample(s) to copy from), which is why the LLM - just a dumb function - appears as smart as the human training data it is copying.
The Math Olympiad results are impressive, but at the end of the day is just this same next word prediction, but in this case fine tuned by additional LLM training on solutions to math problems, teaching the LLM which next word predictions (i.e. output) will add up to solution steps that lead to correct problem solutions in the training data. Due to the logical nature of math, the reasoning/solution steps that worked for training data problems will often work for new problems it is then tested on (Math Olympiad), but most reasoning outside of logical domains like math and programming isn't so clear cut, so this approach of training on reasoning examples isn't necessarily going to help LLMs get better at reasoning on more useful real-world problems.
And to all the LLM heads here, this is his work process:
> Yesterday I was browsing for a Deep Q Learning implementation in TensorFlow (to see how others deal with computing the numpy equivalent of Q[:, a], where a is an integer vector — turns out this trivial operation is not supported in TF). Anyway, I searched “dqn tensorflow”, clicked the first link, and found the core code. Here is an excerpt:
Notice how it's "browse" and "search" not just "I asked chatgpt". Notice how it made him notice a bug
I doubt he's letting LLM creep in to his decision-making in 2025, aside from fun side projects (vibes). We don't ever come across Karpathy going to an LLM or expressing that an LLM helped in any of his Youtube videos about building LLMs.
He's just test driving LLMs, nothing more.
Nobody's asking this core question in podcasts. "How much and how exactly are you using LLMs in your daily flow?"
I'm guessing it's like actors not wanting to watch their own movies.
You're free to believe whatever fantasy you wish, but as someone who frequently consults an LLM alongside other resources when thinking about complex and abstract problems, there is no way in hell that Karpathy intentionally limits his options by excluding LLMs when seeking knowledge or understanding.
If he did not believe in the capability of these models, he would be doing something else with his time.
One can believe in the capability of a technology but on principle refuse to use implementations of it built on ethically flawed approaches (e.g., violating GPL licensing laws and/or copyright, thus harming open source ecosystem).
What you see as copyright violation, I see as liberation. I have open models running locally on my machine that would have felled kingdoms in the past.
> Continuing the journey of optimal LLM-assisted coding experience. In particular, I find that instead of narrowing in on a perfect one thing my usage is increasingly diversifying
> I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I've struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version and it wrote up 2 paragraphs admiring it (very wholesome). If you're not giving it your hardest problems you're probably missing out.
I took a course in my Master's (URV.cat) where we had to do exactly this, implementing backpropagation (fwd and backward passes) from a paper explaining it, using just basic math operations in a language of our choice.
I told everyone this was the best single exercise of the whole year for me. It aligns with the kind of activity that I benefit immensely but won't do by myself, so this push was just perfect.
If you are teaching, please consider this kind of assignments.
P.S. Just checked now and it's still in the syllabus :)
The difference in understanding (for me and how my brain works) between reading the paper in what appears to be a future or past alien language & doing a minimal paper / code example is massive.
I had a whole course just about how computers do maths. Matrix multiplication, linear fit, finding eigenvectors, multiplication and division, square root, solving linear systems, numerically calculating differential equations, spline interpolation, FEM analysis.
"Computers are good at maths" is normally a pretty obvious statement... but many things we take for granted from analytical mathematics, is quite difficult to actually implement in a computer. So there is a mountain of clever algorithms hiding behind some of the seemingly most obvious library operations.
It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the
LSTM layers (before the sigmoid and tanh functions are applied) to lie within
a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.
> It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today
Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?
Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.
The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.
I have a naive question about backprop and optimizers.
I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.
But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
> how important is computing the exact gradient using calculus
Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)
> vs just knowing the general direction to step
I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/
The first bit is why it is called Stochastic gradient decent. You follow the gradient of a randomly chosen minibatch at each step. It basically makes you "vibrate" down along the gradient.
> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Yes, absolutely -- a lot of ideas inspired by this have been explored in the field of optimization, and also in machine learning. The very idea of "stochastic" gradient descent using mini-batches basically a cheap (hardware compatible) approximation to the gradient for each step.
Ben Recht has an interesting survey of how various learning algorithms used in reinforcement learning relate with techniques in optimization (and how they each play with the gradient in different ways): https://people.eecs.berkeley.edu/~brecht/l2c-icml2018/ (there's nothing special about RL... as far as optimization is concerned, the concepts work the same even when all the data is given up front rather than generated on-the-fly based on interactions with the environment)
First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.
What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).
The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).
Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.
The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.
Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.
Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).
That's an interesting idea, it sounds similar to the principles behind low precision models like BitNet (where each weight is +-1 or 0).
That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.
> But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
Why would these things be "fudging"? Vanishing gradients (see the initial batch norm paper) are a real thing, and ensuring that the relative magnitudes are in some sense "smooth" between layers allows for an easier optimization problem.
> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Very. In high dimensional space, small steps can move you extremely far from a proper solution. See adversarial examples.
You don't need exact gradients, since gradient descent is self-correcting (which can make it hard to find gradient calculation bugs!). One approach using inexact gradients is to use predicted "synthetic gradients" which avoids needing to wait for backward pass for weight updates.
It is possible to compute the approximate gradient (direction to step) without using the formulas: we can change the value of each parameter individually, compute the loss, set the values of all parameters in such a way that the loss is minimized, and then repeat. This means, however, that we have to do number-of-parameters forward passes for one optimization step, which is very expensive. With formulas, we can compute all these values in one backward pass.
All first order methods use the gradient or Jacobian of a function. Calculating the first order derivatives is really cheap.
Non-stochastic gradient descent has to optimize over the full dataset. This doesn't matter for non-machine learning applications, because often there is no such thing as a dataset in the first place and the objective has a small fixed size. The gradient here is exact.
With stochastic gradient descent you're turning gradient descent into an online algorithm, where you process a finite subset of the dataset at a time. Obviously the gradient is no longer exact, you still have to calculate it though.
Seems like "exactness" is not that useful of a property for optimization. Also, I can't stress it enough, but calculating first order derivatives is so cheap there is no need to bother. It's roughly 2x the cost of evaluating the function in the first place.
It's second order derivatives that you want to approximate using first order derivatives. That's how BFGS and Gauss-Newton work.
Agree, better title for this post; the fact that back prop is a leaky abstraction is a reason one should understand it and know how to do the mechanics by hand to truly experience it and develop understanding and intuition. Software / code abstracting away even more of the process leaves it open to magical thinking. I had to do hand calculations and convolutions in my Deep Learning graduate school course.
> “Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”
worries me because is structured with the same reasoning of "why we have to demonstrate we understand addition if in the real world we have calculators"
The counter-argument would be that you can make excellent arguments for why we should understand what the compiler is doing, understand what the transistors are doing (e.g. to understand the limitations and risks of overclocking), understand how sort algorithms work (doing sort algorithms manually was at one time a typical interview question). It's not that these things aren't useful, but whether or not that learning would be useful is not the right question.
Is this _more_ useful than the other things, which I could be learning but won't because I spent the time and effort to learn this instead?
We have a finite capacity for learning, if for no other reason then at least because we have a finite amount of time in this life, and infinite topics to learn (and there are plenty of other constraints besides time). The reason given for learning this topic, is that it has hidden failure modes which you will not be on the lookout for if you didn't know how it worked "under the hood".
Is this a good enough reason to spend time learning this rather than, say, how to model the physics of the system you're training the neural network to deal with? Tough question; maybe, maybe not. If you have time to learn both, do that, but if not, then you will have to choose which is most important. And in our education system, we do things like teach calculus but not intermediate statistics, and it would have been better to do the opposite for something like 90% of the people taking calculus.
That said, I've implemented backpropagation multiple times, it's a good way to evaluate a new language (just complex enough to reveal problems, not so complex that it takes forever).
I have to be contrarian here. The students were right. You didn't need to learn to implement backprop in NumPy. Any leakiness in BackProp is addressed by researchers who introduce new optimizers. As a developer, you just pick the best one and find good hparams for it.
The problem isn't with backprop itself or the optimizer - it's potentially in (the dervatives of) the functions you are building the neural net out of, such as the Sigmoid and ReLU examples that Karpathy gave.
Just because the framework you are using provides things like ReLU doesn't mean you can assume someone else has done all the work and you can just use these and expect them to work all the time. When things go wrong training a neural net you need to know where to look, and what to look for - things like exploding and vanishing gradients.
> Any leakiness in BackProp is addressed by researchers who introduce new optimizers
> As a developer, you just pick the best one and find good hparams for it
It would be more correct to say: "As a developer, (not researcher), whose main goal is to get a good model working — just pick a proven architecture, hyperparameters, and training loop for it."
Because just picking the best optimizer isn't enough. Some of the issues in the article come from the model design, e.g. sigmoids, relu, RNNs. And some of the issues need to be addressed in the training loop, e.g. gradient clipping isn't enabled by default in most DL frameworks.
And it should be noted that the article is addressing people on the academic / research side, who would benefit from a deeper understanding.
It's for a CS course at Stanford not a PyTorch boot camp. It seems reasonable to expect some level of academic rigour and need to learn and demonstrate understanding of the fundamentals. If researchers aren't learning the fundamentals in courses like these where are they learning them?
You've also missed the point of the article, if you're building novel model architectures you can't magic away the leakiness. You need to understand the back prop behaviours of the building blocks you use to achieve a good training run. Ignore these and what could be a good model architecture with some tweaks will either entirely fail to train or produce disappointing results.
Perhaps you're working at a level of bolting pre built models together or training existing architectures on new datasets but this course operates below that level to teach you how things actually work.
Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)
If we don't subtract from the second branch, there will be a discontinuity around x = 1, so the derivative will not be well-defined. Also the value of the loss will jump at this value, which will make it hard to inspect the errors, for one thing.
I did not say there will be a discontinuity in the gradient; I said that the modified loss function will not have a mathematically well-defined derivative because of the discontinuity in the function.
You do that to make things smoother when plotted. You could in theory add some crazy stairstep that adds a hundred to the middle part. It would make your loss curves spike and increase towards convergence but then those spikes are just visual artifacts from doing weird discontinuous nonsense with yoru loss.
they are negligible, especially when the post was written when ops were not fused. The extra memory you need to store the extra tensors when you use the original version is more expensive
More generally, it's often worth learning and understanding things one step deeper. Having a more fundamental understanding of things explains more of the "why" behind why some things are the way they are, or why we do some things a certain way. There's probably a cutoff point for balancing how much you actually need to know though. You could potentially take things a step further by writing the backwards pass without using matrix multiplication, or spend some time understanding what the numerical value of a gradient means.
Karpathy's work on large datasets for deep neural flow is conceiving of the "backward pass" as the preparation for initializing the mechanics for weight ranges, either as derivatives in -10/+10 statistic deviations.
Karpathy is butchering the metaphor. There is no abstraction here. Backprop is an algorithm. Automatic differentiation is a technique. Neither promises to hide anything.
I agree that understanding them is useful, but they are not abstractions much less leaky abstractions.
Given that we're now in the year 2025 and AI has become ubiquitous, I'd be curious to estimate what percentage of developers now actually understand backprop.
It's a bit snarky of me, but whenever I see some web developer or product person with a strong opinion about AI and its future, I like to ask "but can you at least tell me how gradient descent works?"
I'd like to see a future where more developers have a basic understanding of ML even if they never go on to do much of it. I think we would all benefit from being a bit more ML-literate.
I'm wondering: how can understanding gradient descent help in building AI systems on top of LLMs? To mee it feels like the skills of building "AI" are almost orthogonal to skills of building on top of "AI"
I take your point in that they are mostly orthogonal in practice, but with that being said, I think understanding how these AI's were created is still helpful.
For example, I believe that if we were to ask the average developer about why LLM's behave randomly, they would not be able to answer. This to me exposes a fundamental hole in their knowledge of AI. Obviously one shouldn't feel bad about not knowing the answer, but I think we'd benefit from understanding the basic mathematical and statistical underpinnings on these things.
You can still understand that quite well without understanding backprop, though.
All you need is:
- Basic understanding of how a Markov chain can generate text (generating each word using corpus statistics on the previous few words).
- Understanding that you can then replace the Markov chain with a neural model which gives you more context length and more flexibility (words are now in a continuous space so you don't need to find literally the same words, you can exploit synonyms, similarity, etc., plus massive training data also helps).
- Finally, you add the instruction tuning (among all the plausible continuations the model could choose, teach it to prefer the ones human prefer - e.g. answering a question rather than continuing with a list of similar questions. You give the model cookies or slaps so it learns to prefer the answers humans prefer).
- But the core is still like in the Markov chain (generating each word using corpus statistics on the previous words).
I often give dissemination talks on LLMs to the general public and I have the feeling that with this mental model, you can basically know everything a lay user needs to know about how they work (you can explain things like hallucinations, stochastic nature, relevance of training data, relevance of instruction tuning, dispelling myths like "they always choose the most likely word", etc.) without any calculus at all; although of course this is subjective and maybe some people will think that explaining it in this way is heresy.
Sure, but it'd be similar to being a software developer and not understanding roughly what a compiler does. In a world full of neural network based technology, it'd be a bit lame for a technologist not to at least have a rudimentary understanding of how it works.
Nowadays, fine tuning LLMs is becoming quite mainstream, so even if you are not training neural nets of any kind from scratch, if you don't understand how gradients are used in the training (& fine tuning) process, then that is going to limit your ability to fully work with the technology.
Impossible requirement. The inherent quality of abstractions is to allow us to get more done without understanding everything. We dont write raw assembly for the same reason, you dont make fire by rubbing sticks, you dont go hunting for food in the woods, etc.
There is no need for the knowledge that you propose in a world where this is solved, you will achieve more goals by utilizing higher-level tools.
I get your point and this certainly applies to most modern computing where each new layer of abstraction becomes so solid and reliable that devs can usually afford to just build on top of it without worrying about how it works. I don’t believe this applies to modern AI/ML however. Knowing the chain rule, gradient descent and basic statistics IMO is not the same level of solid as other abstractions in computing. We can’t afford to not know these things. (At least not yet!)
I’d say so, the hallmark of being a car guy is understanding the basics, like the difference between a four cylinder and a six cylinder, a turbocharger from a supercharger, the different types of gearboxes (DCT vs AT or a CVT), and so on. They all affect the feel, capabilities, and limitations of the car.
Electric cars have similar complexities and limitations, for example the Bolt I owned could only go ~92MPH due to limitations in the gearing as a result of having a 1 speed gearbox. I would expect someone with a strong opinion of a car to know something as simple as the top speed.
Depends on what exactly the opinion is, but generally I'd say yes. If it's about their looks, maybe not.. but even then understanding the basics that determine things like not needing air intake, exhaust, the placement of batteries, etc. can be helpful.
If it's about the supply chain, understanding at least the requirements for magnets is helpful.
On way to make sure you understand all of these things is to understand the electric motor. But you could learn the separate pieces of knowledge on the fly too.
The more you understand the fundamentals of what you're talking about, the more likely you are to have genuine insight because you can connect more aspects of the problem and understand more of the "why".
> I'd like to see a future where more developers have a basic understanding of ML even if they never go on to do much of it. I think we would all benefit from being a bit more ML-literate.
Why "ML-literate" specifically? Also, there are some people against calculus and statistic in CS curriculum because it's not "useful" or "practical", why does ML get special treatment here?
Plus, I don't think a "gotcha" question like "what is gradient descent" will give you a good signal about someone if it get popularized. It probably will lead to the present-day OOP cargo cult, where everyone just memorizes whatever their lecturer/bootcamp/etc and repeats it to you without actually understanding what it does, why it's the preferred method over other strategies, etc.
We could also say AI-literate too, I suppose. I guess I just like to focus on ML generally because 1) most modern AI is possible only due to ML and 2) it’s more narrow and emphasizes the low level of how AI works.
Yes. Pretraining and fine-tuning use standard Adam optimizers (usually with weight-decay). Reinforcement learning has been the odd-man out historically, but these days almost all RL algorithms also use backprop and gradient descent.
Are LLMs still trained by (variants of) stochastic GRADIENT descent? AFAIK what used to be called "backprop" is nowadays known as "automatic differentiation". It's widely used in PyTorch, JAX etc
I was happy to see Karpathy writing a new blog post instead of simply Twitter threads, but when I opened the link I just got dispointed to realize it's from 9 years ago…
Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].
Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)
[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.
I get your point, but I don't think your nit-pick is useful in this case.
The point is that you can't abstract away the details of back propagation (which involve computing gradients) under some circumstances. For example, when we are using gradient descend. Maybe in other circumstances (global optimization algorithm) it wouldn't be an issue, but the leaky abstraction idea isn't that the abstraction is always an issue.
(Right now, back propagation is virtually the only way to calculate gradients in deep learning)
Yes. No need to be apologetic or timid about it — it’s not a nit to push back against a flawed conceptual framing.
I respect Karpathy’s contributions to the field, but often I find his writing and speaking to be more than imprecise — it is sloppy in the sense that it overreaches and butchers key distinctions. This may sound harsh, but at his level, one is held to a higher standard.
> often I find his writing and speaking to be more than imprecise
I think that's more because he's trying to write to an audience who isn't hardcore deep into ML already, so he simplifies a lot, sometimes to the detriment of accuracy.
At this point I see him more as a "ML educator" than "ML practitioner" or "ML researcher", and as far as I know, he's moving in that direction on purpose, and I have no qualms with it overall, he seems good at educating.
But I think shifting the mindset of what the purpose of his writings are maybe help understand why sometimes it feels imprecise.
Whoever chose this topic title perhaps did him a disservice in suggesting he said the problem was backprop itself, since in his blog post he immediately clarifies what he meant by it. It's a nice pithy way of stating the issue though.
Nah, Karpathy's title is "Yes you should understand backprop", and his first highlight is "The problem with Backpropagation is that it is a leaky abstraction." This is his choice as a communicator, not the poster to HN.
And his _examples_ are about gradients, but nowhere does he distinguish between backpropagation, a (part of) an algorithm for automatic differentiation and the gradients themselves. None of the issues are due to BP returning incorrect gradients (it totally could, for example, lose too much precision, but it doesn't).
Yeah - he chose it as a pithy/catchy description of the issue, then immediately clarified what he meant by it.
> In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.
Then follows this with multiple clear examples of exactly what he is talking about.
The target audience was people building and training neural networks (such as his CS231n students), so I think it's safe to assume they knew what backprop and gradients are, especially since he made them code gradients by hand, which is what they were complaining about!
But Karpathy is completely right; students who understand and internalize how backprop works, having implemented it rather than treating it as a magic spell cast by TF/PyTorch, will also be able to intuitively understand these problems of vanishing gradients and so on.
Sure, instead of "the problem with backpropagation is that it's a leaky abstraction" he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading for an introductory-level article for an undergraduate audience, and also unnecessary because he already said that in the introduction.
Karpathy's contribution to teaching around deep learning is just immense. He's got a mountain of fantastic material from short articles like this, longer writing like https://karpathy.github.io/2015/05/21/rnn-effectiveness/ (on recurrent neural networks) and all of the stuff on YouTube.
Plus his GitHub. The recently released nanochat https://github.com/karpathy/nanochat is fantastic. Having minimal, understandable and complete examples like that is invaluable for anyone who really wants to understand this stuff.
I was slightly surprised that my colleagues, who are extremely invested in capabilities of LLMs, didn’t show any interest in Karpathy’s communication on the subject when I recommended it to them.
Later I understood that they don’t need to understand LLMs, and they don’t care how they work. Rather they need to believe and buy into them.
They’re more interested in science fiction discussions — how would we organize a society where all work is done by intelligent machines — than what kinds of tasks are LLMs good at today and why.
What's wrong or odd about that? You can like a technology as a user and not want to delve into how it works (sentence written by a human despite use of "delve"). Everyone should have some notions on what LLMs can or cannot do, in order to use them successfully and not be misguided by their limitations, but we don't need everyone to understand what backpropagation is, just as most of us use cars without knowing much about how an internal combustion engine works.
And the issue you mention in the last paragraph is very relevant, since the scenario is plausible, so it is something we definitely should be discussing.
> What's wrong or odd about that? You can like a technology as a user and not want to delve into how it works
The question here is whether the details are important for the major issues, or whether they can be abstracted away with a vague understanding. To what extent abstracting away is okay depends greatly on the individual case. Abstractions can work over a large area or for a long time, but then suddenly collapse and fail.
The calculator, which has always delivered sufficiently accurate results, can produce nonsense when one approaches the limits of its numerical representation or combines numbers with very different levels of precision. This can be seen, for example, when one rearranges commutative operations; due to rounding problems, it suddenly delivers completely different results.
The 2008 financial crisis was based, among other things, on models that treated certain market risks as independent of one another. Risk could then be spread by splitting and recombining portfolios. However, this only worked as long as the interdependence of the different portfolios was actually quite small. An entire industry, with the exception of a few astute individuals, had abstracted away this interdependence, acted on this basis, and ultimately failed.
As individuals, however, we are completely dependent on these abstractions. Our entire lives are permeated by things whose functioning we simply have to rely on without truly understanding them. Ultimately, it is the nature of modern, specialized societies that this process continues and becomes even more differentiated.
But somewhere there should be people who work at the limits of detailed abstractions and are concerned with researching and evaluating the real complexity hidden behind them, and thus correcting the abstraction if necessary, sending this new knowledge upstream.
The role of an expert is to operate with less abstraction and more detail in her oder his field of expertise than a non-expert -- and the more so, the better an expert she or he is.
Because if you don't understand how a tool works you can't use the tool to it's full potential.
Imagine if you were using single layer perceptrons without understanding seperability and going "just a few more tweaks and it will approximate XOR!"
I disagree in the case of LLMs, because they really are an accidental side effect of another tool. Not understanding the inner workings will make users attribute false properties to them. Once you understand how they work (how they generate plausible text), you get a far deeper grasp on their capabilities and how to tweak and prompt them.
And in fact this is true of any tool, you don’t have to know exactly how to build them but any craftsman has a good understanding how the tool works internally. LLMs are not a screw or a pen, they are more akin to an engine, you have to know their subtleties if you build a car. And even screws have to be understood structurally in advanced usage. Not understanding the tool is maybe true only for hobbyists.
You hit the nail on the head, in my opinion.
There are things that you just can’t expect from current LLMs that people routinely expect from them.
They start out projects with those expectations. And that’s fine. But they don’t always learn from the outcomes of those projects.
I don't think that's a good analogy, becuase if you're trying to train a single layer perceptron to approximate XOR you're not the end user.
None of this is about an end user in the sense of the user of an LLM. This is aimed at the prospective user of a training framework which implements backpropagation at a high level of abstraction. As such it draws attention to training problems which arise inside the black box in order to motivate learning what is inside that box. There aren't any ML engineers who shouldn't know all about single layer perceptrons I think, and that makes for a nice analogy to real life issues in using SGD and backpropagation for ML training.
The analogy is if you don't understand the limitations of the tool you may try and make it do something it is bad at and never understand why it will never do the thing you want despite looking like it potentially coild
I think there are a lot of people who just don't care about stuff like nanochat because it's exclusively pedagogical, and a lot of people want to learn by building something cool, not taking a ride on a kiddie bike with training wheels.
That's fine as far as it goes, but there is a middle ground ...
Feynman was right that "If you can't build it, you don't understand it", but of course not everyone needs or wants to fully understand how an LLM works. However, regarding an LLM as a magic black box seems a bit extreme if you are a technologist and hope to understand where the technology is heading.
I guess we are in an era of vibe-coded disposable "fast tech" (cf fast fashion), so maybe it only matters what can it do today, if playing with or applying towards this end it is all you care about, but this seems a rather blinkered view.
If everyone had to understand how carburettors, engines and break systems work; to be able to drive a car - rather than just learn to drive and get from A to B - I'm guessing there would be a lot less cars on the road.
(Thinking about it, would that necessarily be a bad thing...)
Not everybody who drives a car (even as a professional driver) knows how to make one.
If you live in a world of horse carriages, you can be thinking about what the world of cars is going to be like, even if you don't fully understand what fuel mix is the most efficient or what material one should use for a piston in a four-stroke.
I'm personally very interested in how LLMs work under the hood, but I don't think everyone who uses them as tools needs that. I don't know the wiring inside my drill, but I know how to put a hole in my wall and not my hand regardless.
Do you go deep into molecular biology to see how it works , it is much more interesting and important
But the question is if you have a better understanding of LLMs from a user's perspective, or they.
Which is terrible. That's the root of all the BS around LLMs. People lacking understanding of what they are and ascribing capabilities which LLMs just don't have, by design. Even HN discussions are full of that. Even though this page literally has "hacker" in its name.
I see your point but on the other hand a lot of conversations go: A: what will we do when AI do all the jobs, B: that's silly LLMs can't do the jobs. The thing is A didn't say LLM, they said AI as in whatever that will be a short while into the future. Which is changing rapidly because thousands of bright people are being paid to change it.
> a short while into the future
And what gives you that confidence? A few AI nerds already claimed that in the 80s.
We're currently exploring what LLMs can do. There is no indication that any further fundamental breakthrough is around the corner. Everybody is currently squeezing the same stone.
The trouble is that "AI" is also very much a leaky abstraction, which makes it tempting to see all the "AI" advances of recent years, then correctly predict that these "AI" advances will continue, but then jump to all sorts of wrong conclusions about what those advances will be.
For example, things like "AI" image and video generation are amazing, as are things like AlphaGo and AlphaFold, but none of these have anything to do with LLMs, and the only technology they share with LLMs is machine learning and neural nets. If you lump these together with LLMs, calling them all "AI", then you'll come to the wrong conclusion that all of these non-LLM advances indicate that "AI" is rapidly advancing and therefore LLMs (also being "AI") will do too ...
Even if you leave aside things like AlphaGo, and just focus on LLMs, and other future technology that may take all our jobs, then using terms like "AI" and "AGI" are still confusing and misleading. It's easy to fall into the mindset that "AGI" is just better "AI", and that since LLMs are "AI", AGI is just better LLMs, and is around the corner because "AI" is advancing rapidly ...
In reality LLMs are, like AlphaFold, something highly specific - they are auto-regressive next-word predictor language models (just as a statement of fact, and how they are trained, not a put-down), based on the Transformer architecture.
The technology that could replace humans for most jobs in the future isn't going to be a better language model - a better auto-regressive next-word predictor - but will need to be something much more brain like. The architecture itself doesn't have to be brain-like, but in order to deliver brain-like functionality it will probably need to include another half-dozen "Transformer-level" architectural/algorithmic breakthroughs including things like continual learning, which will likely turn the whole current LLM training and deployment paradigm on it's head.
Again, just focusing on LLMs, and LLM-based agents, regarding them as a black-box technology, it's easy to be misled into thinking that advances in capability are broadly advancing, and will rise all ships, when in reality progress is much more narrow. Headlines about LLMs achievement in math and competitive programming, touted as evidence of reasoning, do NOT imply that LLM reasoning is broadly advancing, but you need to get under the hood and understand RL training goals to realize why that is not necessarily the case. The correctness of most business and real-world reasoning is not as easy to check as is marking a math problem as correct or not, yet that capability is what RL training depends on.
I could go on .. LLM-based agents are also blurring the lines of what "AI" can do, and again if treated as a black box will also misinform as to what is actually progressing and what is not. Thousands of bright people are indeed working on improving LLM-adjacent low-hanging fruit like this, but it'd be illogical to conclude that this is somehow helping to create next-generation brain-like architectures that will take away our jobs.
I'll give you algorithmic breakthroughs have been quite slow to come about - I think backpropagation in 1986 and then transformers in 2017. Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like. I recently read how LLMs work and was surprised how text focused it is, making word vectors and not physical understanding.
> Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like
That's because you, as you admit in the next sentence, have almost no understanding of how they work.
Your reasoning is on the same level as someone in the 1950s thinking ubiquitous flying cars are just a few years away. Or fusion power, for that matter.
In your defense, that seems to be about the average level of engagement with this technology, even on this website.
Yes, it's a bit shocking to realize that all LLMs are doing is predicting next word (token) from samples in the training data, but the Transformer is powerful enough to do a fantastic job of prediction (which you can think of as selecting which training sample(s) to copy from), which is why the LLM - just a dumb function - appears as smart as the human training data it is copying.
The Math Olympiad results are impressive, but at the end of the day is just this same next word prediction, but in this case fine tuned by additional LLM training on solutions to math problems, teaching the LLM which next word predictions (i.e. output) will add up to solution steps that lead to correct problem solutions in the training data. Due to the logical nature of math, the reasoning/solution steps that worked for training data problems will often work for new problems it is then tested on (Math Olympiad), but most reasoning outside of logical domains like math and programming isn't so clear cut, so this approach of training on reasoning examples isn't necessarily going to help LLMs get better at reasoning on more useful real-world problems.
I’m trying not to be disappointed by people, I’d rather understand what’s going on in their minds, and how to navigate that.
Obviously they are more focused on making something that works
Wow. Definitely NOT management material then.
And to all the LLM heads here, this is his work process:
> Yesterday I was browsing for a Deep Q Learning implementation in TensorFlow (to see how others deal with computing the numpy equivalent of Q[:, a], where a is an integer vector — turns out this trivial operation is not supported in TF). Anyway, I searched “dqn tensorflow”, clicked the first link, and found the core code. Here is an excerpt:
Notice how it's "browse" and "search" not just "I asked chatgpt". Notice how it made him notice a bug
First of all, this is not a competition between “are LLMs better than search”.
Secondly, the article is from 2016, ChatGPT didn’t exist back then
I doubt he's letting LLM creep in to his decision-making in 2025, aside from fun side projects (vibes). We don't ever come across Karpathy going to an LLM or expressing that an LLM helped in any of his Youtube videos about building LLMs.
He's just test driving LLMs, nothing more.
Nobody's asking this core question in podcasts. "How much and how exactly are you using LLMs in your daily flow?"
I'm guessing it's like actors not wanting to watch their own movies.
Karpathy talking for 2 hours about how he uses LLMs:
https://www.youtube.com/watch?v=EWvNQjAaOHw
Vibing, not firing at his ML problems.
He's doing a capability check in this video (for the general audience, which is good of course), not attacking a hard problem in ML domain.
Despite this tweet: https://x.com/karpathy/status/1964020416139448359 , I've never seen him citing an LLM helped him out in ML work.
You're free to believe whatever fantasy you wish, but as someone who frequently consults an LLM alongside other resources when thinking about complex and abstract problems, there is no way in hell that Karpathy intentionally limits his options by excluding LLMs when seeking knowledge or understanding.
If he did not believe in the capability of these models, he would be doing something else with his time.
One can believe in the capability of a technology but on principle refuse to use implementations of it built on ethically flawed approaches (e.g., violating GPL licensing laws and/or copyright, thus harming open source ecosystem).
What you see as copyright violation, I see as liberation. I have open models running locally on my machine that would have felled kingdoms in the past.
AI is more important than copyright law. Any fight between them will not go well for the latter.
Truth be told, a whole lot of things are more important than copyright law.
https://news.ycombinator.com/item?id=45788753
> Continuing the journey of optimal LLM-assisted coding experience. In particular, I find that instead of narrowing in on a perfect one thing my usage is increasingly diversifying
https://x.com/karpathy/status/1959703967694545296
what you did here is called confirmation bias.
> I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I've struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version and it wrote up 2 paragraphs admiring it (very wholesome). If you're not giving it your hardest problems you're probably missing out.
https://x.com/karpathy/status/1964020416139448359
Yes, embedding .py code inside of a speedrun.sh to "simplify the [sic] bash scripts."
Eureka runs LLM101n, which is teaching software for pedagogic symbiosis.
[1]:https://eurekalabs.ai/
I took a course in my Master's (URV.cat) where we had to do exactly this, implementing backpropagation (fwd and backward passes) from a paper explaining it, using just basic math operations in a language of our choice.
I told everyone this was the best single exercise of the whole year for me. It aligns with the kind of activity that I benefit immensely but won't do by myself, so this push was just perfect.
If you are teaching, please consider this kind of assignments.
P.S. Just checked now and it's still in the syllabus :)
The difference in understanding (for me and how my brain works) between reading the paper in what appears to be a future or past alien language & doing a minimal paper / code example is massive.
same here, even more if I'm doing it over few days and different angles
I did this in highschool from some online textbook in plain Java. I recall implementing matrix multiplication myself being the hardest part.
I made a UI that showed how the weights and biases changed throughout the training iterations.
I had a whole course just about how computers do maths. Matrix multiplication, linear fit, finding eigenvectors, multiplication and division, square root, solving linear systems, numerically calculating differential equations, spline interpolation, FEM analysis.
"Computers are good at maths" is normally a pretty obvious statement... but many things we take for granted from analytical mathematics, is quite difficult to actually implement in a computer. So there is a mountain of clever algorithms hiding behind some of the seemingly most obvious library operations.
One of the best courses I've ever had.
Would you mind sharing which course it was? Is it available online by any chance?
Unfortunately it was a course at my university, and in Swedish. But it wouldn't surprise me if there are similar courses online.
Is that paper publicly available?
It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.
> It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today
Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?
Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.
The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.
I have a naive question about backprop and optimizers.
I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.
But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Two thoughts:
> how important is computing the exact gradient using calculus
Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)
> vs just knowing the general direction to step
I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/
The first bit is why it is called Stochastic gradient decent. You follow the gradient of a randomly chosen minibatch at each step. It basically makes you "vibrate" down along the gradient.
> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Yes, absolutely -- a lot of ideas inspired by this have been explored in the field of optimization, and also in machine learning. The very idea of "stochastic" gradient descent using mini-batches basically a cheap (hardware compatible) approximation to the gradient for each step.
For a relatively extreme example of how we might circumvent the computational effort of backprop, see Direct Feedback Alignment: https://towardsdatascience.com/feedback-alignment-methods-7e...
Ben Recht has an interesting survey of how various learning algorithms used in reinforcement learning relate with techniques in optimization (and how they each play with the gradient in different ways): https://people.eecs.berkeley.edu/~brecht/l2c-icml2018/ (there's nothing special about RL... as far as optimization is concerned, the concepts work the same even when all the data is given up front rather than generated on-the-fly based on interactions with the environment)
>computing the exact gradient using calculus
First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.
What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).
The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).
Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.
The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.
Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.
Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).
That's an interesting idea, it sounds similar to the principles behind low precision models like BitNet (where each weight is +-1 or 0).
That said, I know Deepseek use fp32 for their gradient updates even though they use fp8 for inference. And a recent paper shows that RL+LLM training is shakier at bf16 than fp16, which would both imply that numerical precision in gradients still matters.
> But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
Why would these things be "fudging"? Vanishing gradients (see the initial batch norm paper) are a real thing, and ensuring that the relative magnitudes are in some sense "smooth" between layers allows for an easier optimization problem.
> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Very. In high dimensional space, small steps can move you extremely far from a proper solution. See adversarial examples.
You don't need exact gradients, since gradient descent is self-correcting (which can make it hard to find gradient calculation bugs!). One approach using inexact gradients is to use predicted "synthetic gradients" which avoids needing to wait for backward pass for weight updates.
Calculus isn't that complicated, at least not what's done in backprop.
How do you propose calculating the "general direction" ?
And, an example "advanced optimizer" - AdamW - absolutely uses gradients. It just does more, but not less.
It is possible to compute the approximate gradient (direction to step) without using the formulas: we can change the value of each parameter individually, compute the loss, set the values of all parameters in such a way that the loss is minimized, and then repeat. This means, however, that we have to do number-of-parameters forward passes for one optimization step, which is very expensive. With formulas, we can compute all these values in one backward pass.
All first order methods use the gradient or Jacobian of a function. Calculating the first order derivatives is really cheap.
Non-stochastic gradient descent has to optimize over the full dataset. This doesn't matter for non-machine learning applications, because often there is no such thing as a dataset in the first place and the objective has a small fixed size. The gradient here is exact.
With stochastic gradient descent you're turning gradient descent into an online algorithm, where you process a finite subset of the dataset at a time. Obviously the gradient is no longer exact, you still have to calculate it though.
Seems like "exactness" is not that useful of a property for optimization. Also, I can't stress it enough, but calculating first order derivatives is so cheap there is no need to bother. It's roughly 2x the cost of evaluating the function in the first place.
It's second order derivatives that you want to approximate using first order derivatives. That's how BFGS and Gauss-Newton work.
The original title is "Yes you should understand backprop" - which is good and descriptive.
Agree, better title for this post; the fact that back prop is a leaky abstraction is a reason one should understand it and know how to do the mechanics by hand to truly experience it and develop understanding and intuition. Software / code abstracting away even more of the process leaves it open to magical thinking. I had to do hand calculations and convolutions in my Deep Learning graduate school course.
Yep. Also, I don’t find the metaphorical connection to leaky abstractions useful at all. It feels strained.
This comment:
> “Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”
worries me because is structured with the same reasoning of "why we have to demonstrate we understand addition if in the real world we have calculators"
The counter-argument would be that you can make excellent arguments for why we should understand what the compiler is doing, understand what the transistors are doing (e.g. to understand the limitations and risks of overclocking), understand how sort algorithms work (doing sort algorithms manually was at one time a typical interview question). It's not that these things aren't useful, but whether or not that learning would be useful is not the right question.
Is this _more_ useful than the other things, which I could be learning but won't because I spent the time and effort to learn this instead?
We have a finite capacity for learning, if for no other reason then at least because we have a finite amount of time in this life, and infinite topics to learn (and there are plenty of other constraints besides time). The reason given for learning this topic, is that it has hidden failure modes which you will not be on the lookout for if you didn't know how it worked "under the hood".
Is this a good enough reason to spend time learning this rather than, say, how to model the physics of the system you're training the neural network to deal with? Tough question; maybe, maybe not. If you have time to learn both, do that, but if not, then you will have to choose which is most important. And in our education system, we do things like teach calculus but not intermediate statistics, and it would have been better to do the opposite for something like 90% of the people taking calculus.
That said, I've implemented backpropagation multiple times, it's a good way to evaluate a new language (just complex enough to reveal problems, not so complex that it takes forever).
Are dead ReLUs still a pronlem today? Why not?
I have to be contrarian here. The students were right. You didn't need to learn to implement backprop in NumPy. Any leakiness in BackProp is addressed by researchers who introduce new optimizers. As a developer, you just pick the best one and find good hparams for it.
From the perspective of the university, the students are being trained to become researchers, not engineers.
The problem isn't with backprop itself or the optimizer - it's potentially in (the dervatives of) the functions you are building the neural net out of, such as the Sigmoid and ReLU examples that Karpathy gave.
Just because the framework you are using provides things like ReLU doesn't mean you can assume someone else has done all the work and you can just use these and expect them to work all the time. When things go wrong training a neural net you need to know where to look, and what to look for - things like exploding and vanishing gradients.
> Any leakiness in BackProp is addressed by researchers who introduce new optimizers
> As a developer, you just pick the best one and find good hparams for it
It would be more correct to say: "As a developer, (not researcher), whose main goal is to get a good model working — just pick a proven architecture, hyperparameters, and training loop for it."
Because just picking the best optimizer isn't enough. Some of the issues in the article come from the model design, e.g. sigmoids, relu, RNNs. And some of the issues need to be addressed in the training loop, e.g. gradient clipping isn't enabled by default in most DL frameworks.
And it should be noted that the article is addressing people on the academic / research side, who would benefit from a deeper understanding.
It's for a CS course at Stanford not a PyTorch boot camp. It seems reasonable to expect some level of academic rigour and need to learn and demonstrate understanding of the fundamentals. If researchers aren't learning the fundamentals in courses like these where are they learning them?
You've also missed the point of the article, if you're building novel model architectures you can't magic away the leakiness. You need to understand the back prop behaviours of the building blocks you use to achieve a good training run. Ignore these and what could be a good model architecture with some tweaks will either entirely fail to train or produce disappointing results.
Perhaps you're working at a level of bolting pre built models together or training existing architectures on new datasets but this course operates below that level to teach you how things actually work.
The problem with your reasoning is you never tackle your "unknown unknowns". You just assume they are "known unknowns".
Diving through the abstraction reveals some of those.
Karpathy suggests the following error:
Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)If we don't subtract from the second branch, there will be a discontinuity around x = 1, so the derivative will not be well-defined. Also the value of the loss will jump at this value, which will make it hard to inspect the errors, for one thing.
No, that's not how backprop works. There will be no discontinuity in a backpropagated gradient.
I did not say there will be a discontinuity in the gradient; I said that the modified loss function will not have a mathematically well-defined derivative because of the discontinuity in the function.
You do that to make things smoother when plotted. You could in theory add some crazy stairstep that adds a hundred to the middle part. It would make your loss curves spike and increase towards convergence but then those spikes are just visual artifacts from doing weird discontinuous nonsense with yoru loss.
square roots are expensive
they are negligible, especially when the post was written when ops were not fused. The extra memory you need to store the extra tensors when you use the original version is more expensive
More generally, it's often worth learning and understanding things one step deeper. Having a more fundamental understanding of things explains more of the "why" behind why some things are the way they are, or why we do some things a certain way. There's probably a cutoff point for balancing how much you actually need to know though. You could potentially take things a step further by writing the backwards pass without using matrix multiplication, or spend some time understanding what the numerical value of a gradient means.
... (2016)
9 years ago, 365 points, 101 comments
https://news.ycombinator.com/item?id=13215590
Karpathy's work on large datasets for deep neural flow is conceiving of the "backward pass" as the preparation for initializing the mechanics for weight ranges, either as derivatives in -10/+10 statistic deviations.
Karpathy is butchering the metaphor. There is no abstraction here. Backprop is an algorithm. Automatic differentiation is a technique. Neither promises to hide anything.
I agree that understanding them is useful, but they are not abstractions much less leaky abstractions.
off-topic, anybody knows what's going on with EurekaLabs? It's been a while since the announcement
He does have a history of abandoning projects. OpenAI. Tesla. OpenAI again...
Then again, it might have been the corporate stuff that burned him out rather than the engineering.
He gives an update in the Dwarkesh interview:
https://youtu.be/lXUZvyajciY?si=vbqKDOOY7l-491Ka&t=7028
Not too many details on timeline - just that he's working on it.
Given that we're now in the year 2025 and AI has become ubiquitous, I'd be curious to estimate what percentage of developers now actually understand backprop.
It's a bit snarky of me, but whenever I see some web developer or product person with a strong opinion about AI and its future, I like to ask "but can you at least tell me how gradient descent works?"
I'd like to see a future where more developers have a basic understanding of ML even if they never go on to do much of it. I think we would all benefit from being a bit more ML-literate.
I'm wondering: how can understanding gradient descent help in building AI systems on top of LLMs? To mee it feels like the skills of building "AI" are almost orthogonal to skills of building on top of "AI"
I take your point in that they are mostly orthogonal in practice, but with that being said, I think understanding how these AI's were created is still helpful.
For example, I believe that if we were to ask the average developer about why LLM's behave randomly, they would not be able to answer. This to me exposes a fundamental hole in their knowledge of AI. Obviously one shouldn't feel bad about not knowing the answer, but I think we'd benefit from understanding the basic mathematical and statistical underpinnings on these things.
You can still understand that quite well without understanding backprop, though.
All you need is:
- Basic understanding of how a Markov chain can generate text (generating each word using corpus statistics on the previous few words).
- Understanding that you can then replace the Markov chain with a neural model which gives you more context length and more flexibility (words are now in a continuous space so you don't need to find literally the same words, you can exploit synonyms, similarity, etc., plus massive training data also helps).
- Finally, you add the instruction tuning (among all the plausible continuations the model could choose, teach it to prefer the ones human prefer - e.g. answering a question rather than continuing with a list of similar questions. You give the model cookies or slaps so it learns to prefer the answers humans prefer).
- But the core is still like in the Markov chain (generating each word using corpus statistics on the previous words).
I often give dissemination talks on LLMs to the general public and I have the feeling that with this mental model, you can basically know everything a lay user needs to know about how they work (you can explain things like hallucinations, stochastic nature, relevance of training data, relevance of instruction tuning, dispelling myths like "they always choose the most likely word", etc.) without any calculus at all; although of course this is subjective and maybe some people will think that explaining it in this way is heresy.
Sure, but it'd be similar to being a software developer and not understanding roughly what a compiler does. In a world full of neural network based technology, it'd be a bit lame for a technologist not to at least have a rudimentary understanding of how it works.
Nowadays, fine tuning LLMs is becoming quite mainstream, so even if you are not training neural nets of any kind from scratch, if you don't understand how gradients are used in the training (& fine tuning) process, then that is going to limit your ability to fully work with the technology.
Impossible requirement. The inherent quality of abstractions is to allow us to get more done without understanding everything. We dont write raw assembly for the same reason, you dont make fire by rubbing sticks, you dont go hunting for food in the woods, etc.
There is no need for the knowledge that you propose in a world where this is solved, you will achieve more goals by utilizing higher-level tools.
I get your point and this certainly applies to most modern computing where each new layer of abstraction becomes so solid and reliable that devs can usually afford to just build on top of it without worrying about how it works. I don’t believe this applies to modern AI/ML however. Knowing the chain rule, gradient descent and basic statistics IMO is not the same level of solid as other abstractions in computing. We can’t afford to not know these things. (At least not yet!)
so if you want to have a strong opinion on electric cars you need to be able to explain how an electric engine works right?
I’d say so, the hallmark of being a car guy is understanding the basics, like the difference between a four cylinder and a six cylinder, a turbocharger from a supercharger, the different types of gearboxes (DCT vs AT or a CVT), and so on. They all affect the feel, capabilities, and limitations of the car.
Electric cars have similar complexities and limitations, for example the Bolt I owned could only go ~92MPH due to limitations in the gearing as a result of having a 1 speed gearbox. I would expect someone with a strong opinion of a car to know something as simple as the top speed.
Depends on what exactly the opinion is, but generally I'd say yes. If it's about their looks, maybe not.. but even then understanding the basics that determine things like not needing air intake, exhaust, the placement of batteries, etc. can be helpful.
If it's about the supply chain, understanding at least the requirements for magnets is helpful.
On way to make sure you understand all of these things is to understand the electric motor. But you could learn the separate pieces of knowledge on the fly too.
The more you understand the fundamentals of what you're talking about, the more likely you are to have genuine insight because you can connect more aspects of the problem and understand more of the "why".
TL;DR it depends, but it almost always helps.
Plus, I don't think a "gotcha" question like "what is gradient descent" will give you a good signal about someone if it get popularized. It probably will lead to the present-day OOP cargo cult, where everyone just memorizes whatever their lecturer/bootcamp/etc and repeats it to you without actually understanding what it does, why it's the preferred method over other strategies, etc.
> Why "ML-literate" specifically?
We could also say AI-literate too, I suppose. I guess I just like to focus on ML generally because 1) most modern AI is possible only due to ML and 2) it’s more narrow and emphasizes the low level of how AI works.
Do LLMs still use backprop?
Yes. Pretraining and fine-tuning use standard Adam optimizers (usually with weight-decay). Reinforcement learning has been the odd-man out historically, but these days almost all RL algorithms also use backprop and gradient descent.
Are LLMs still trained by (variants of) stochastic GRADIENT descent? AFAIK what used to be called "backprop" is nowadays known as "automatic differentiation". It's widely used in PyTorch, JAX etc
Gradient descent doesn't matter here. Second order and higher methods still use lower order derivatives.
Back propagation is reverse mode auto differentiation. They are the same thing.
And for those who don't understand what back propagation is, it is just an efficient method to calculate the gradient for all parameters.
Sidenote why are people still using medium?
article is from 2016
I was happy to see Karpathy writing a new blog post instead of simply Twitter threads, but when I opened the link I just got dispointed to realize it's from 9 years ago…
I really hate what Twitter did to blogging…
He has a new blog at https://karpathy.bearblog.dev/blog/
Oh, I wasn't aware of it, thank you very much!