I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.
I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.
I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.
I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.
From what I can gather, llama.cpp supports Anthropic's message format now[1], so you can use it with Claude Code[2].
[1]: https://github.com/ggml-org/llama.cpp/pull/17570
[2]: https://news.ycombinator.com/item?id=44654145
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.
Any tips to spot this? I want to avoid arguing with a X bot.
> In this world view, nano banana is a first early hint of what that might look like.
What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?
I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
xposted to https://x.com/karpathy/status/2002118205729562949
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.
I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.