Author here. A few people are arguing against a stronger claim than the repo is meant to make. As well, this was very much intended to be a joke and not research level commentary.
This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.
What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)
Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.
Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.
So the real eval is end-to-end:
- total input tokens
- total output tokens
- latency
- quality/task success
There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)
So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.
This is fun. I'd like to see the same idea but oriented for richer tokens instead of simpler tokens. If you want to spend less tokens, then spend the 'good' ones. So, instead of saying 'make good' you could say 'improve idiomatically' or something. Depends on one's needs. I try to imagine every single token as an opportunity to bend/expand/limit the geometries I have access to. Language is a beautiful modulator to apply to reality, so I'll wager applying it with pedantic finesse will bring finer outputs than brutish humphs of cavemen. But let's see the benchmarks!
I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information).
Idk I try talk like cavemen to claude. Claude seems answer less good. We have more misunderstandings. Feel like sometimes need more words in total to explain previous instructions. Also less context is more damage if typo. Who agrees? Could be just feeling I have. I often ad fluff. Feels like better result from LLM. Me think LLM also get less thinking and less info from own previous replies if talk like caveman.
Yes because in most contexts it has seen "caveman" talk the conversations haven't been about rigorously explained maths/science/computing/etc... so it is less likely to predict that output.
Cute idea, but you're never gonna blow your token budget on output. Input tokens are the bottleneck, because the agent's ingesting swathes of skills, directory trees, code files, tool outputs, etc. The output is generally a few hundred lines of code and a bit of natural language explanation.
Good point and it's actually worse than that : the thinking tokens aren't affected by this at all (the model still reasons normally internally). Only the visible output that gets compressed into caveman... and maybe the model actually need more thinking tokens to figure out how to rephrase its answer into caveman style
Grug says you can tune how much each model thinks. Is not caveman but similar. also thinking is trained with RL so tends to be efficient, less fluffy. Also model (as seen locally) always drafts answer inside thinking then output repeats, change to caveman is not really extra effort.
Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
Yeah but not all tokens are created equal. Some tokens are hard to predict and thus encode useful information; some are highly predictable and therefore don't. Spending an entire forward pass through the token-generation machine just to generate a very low-entropy token like "is" is wasteful. The LLM doesn't get to "remember" that thinking, it just gets to see a trivial grammar-filling token that a very dumb LLM could just as easily have made. They aren't stenographically hiding useful computation state in words like "the" and "and".
> They aren't stenographically hiding useful computation state in words like "the" and "and".
Do you know that is true? These aren’t just tokens, they’re tokens with specific position encodings preceded by specific context. The position as a whole is a lot richer than you make it out to be. I think this is probably an unanswered empirical question, unless you’ve read otherwise.
"Interesting idea! Token consumption sure is an issue that should be addressed, and this is pretty funny too!
However, I happen to have an unproven claim that tokens are units of thinking, and therefore, reducing the token count might actually reduce the model's capabilities. Did anybody using this by chance notice any degradation (since I did not bother to check myself)?"
First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.
I dont’t see the relevance, the discussion is over whether boilerplate text that occurs intermittently in the output purely for the sake of linguistic correctness/sounding professional is of any benefit. Chain of thought doesn’t look like that to begin with, it’s a contiguous block of text.
To boil it down: chain of thought isn’t really chain of thought, it’s just more token generation output to the context. The tokens are participating in computations in subsequent forward passes that are doing things we don’t see or even understand. More LLM generated context matters.
That is not how CoT works. It is all in context. All influenced by context. This is a common and significant misunderstanding of autoregressive models and I see it on HN a lot.
That "unproven claim" is actually a well-established concept called Chain of Thought (CoT). LLMs literally use intermediate tokens to "think" through problems step by step. They have to generate tokens to talk to themselves, debug, and plan. Forcing them to skip that process by cutting tokens, like making them talk in caveman speak, directly restricts their ability to reason.
the fact that more tokens = more smart should be expected given cot / thinking / other techniques that increase the model accuracy by using more tokens.
Did you test that ""caveman mode"" has similar performance to the ""normal"" model?
That is part of it. They are also trained to think in very well mapped areas of their model. All the RHLF, etc. tuned on their CoT and user feedback of responses.
Actually you'd trivially disprove that claim if you're starting from mechanistic knowledge of how orbits work, like how we have mechanistic knowledge of how LLMs work.
You have empirical observations, like replicating a fixed set of inner layers to make it think longer, or that you seem to have encode and decode layers. But exactly why those layers are the way they are, how they come together for emergent behaviour... Do we have mechanistic knowledge of that?
Though the above exchange felt a tiny bit snarky, I think the conversation did get more interesting as it went on. I genuinely think both people could probably gain by talking more -- or at least figuring out a way to move fast the surface level differences. Yes, humans designed LLMs. But this doesn't mean we understand their implications even at this (relatively simple) level.
> Can't you know that tokens are units of thinking just by... like... thinking about how models work?
Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?
Right, there's probably something more subtle like "semantic density within tokens is how models think"
So it's probably true that the "Great question!---" type preambles are not helpful, but that there's definitely a lower bound on exactly how primitive of a caveman language we're pushing toward.
> cutting ~75% of tokens while keeping full technical accuracy.
I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
An explanation that explains nothing is not very interesting.
Sorry I don't know how engaging in this could lead to anything productive. There's already literature out there that gives credence to TeMPOraL claim. And, after a certain point, gravity being the reason that things fall becomes so self evident that every re-statements doesnt not require proof.
> Nobody has to proof anything. It can give your claim credibility
“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.
If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.
In the age of vibe coding and that we are literally talking about a single markdown file I am sure this has been well tested and achieves all of its goals with statistical accuracy, no side effects with no issues.
> I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.
I’ve heard this, I don’t automatically believe it nor do I understand why it would need to be true, I’m still caught on the old fashioned idea that the only “thinking” for autoregressive modes happens during training.
But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?
I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.
I have seen a paper though I can’t find it right now on asking your prompt and expert language produces better results than layman language. The idea of being that the answers that are actually correct will probably be closer to where people who are expert are speaking about it so the training data will associate those two things closer to each other versus Lyman talking about stuff and getting it wrong.
Yeah, I don't think that "I'd be happy to help you with that" or "Sure, let me take a look at that for you" carries much useful signal that can be used for the next tokens.
There is a study that shows that what the model is doing behind the scenes in those cases is a lot more than just outputting those tokens.
For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.
Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.
High dimensional vectors are thought (insofar as you can define what that even means). Tokens are one dimensional input that navigates the thought, and output that renders the thought. The "thinking" takes place in the high dimension space, not the one dimensional stream of tokens.
But isn't the one dimensional tokens a reflex of high dimensional space? What you see is "sure let's take a look at that" but behind the curtains it's actually an indication that it's searching a very specific latent space which might be radically different if those tokens didn't exist. Or not. In any case, you can't just make that claim and isolate those two processes. They might be totally unrelated but they also might be tightly interconnected.
I assume in practice, filler words do nothing of value. When words add or mean nothing (their weights are basically 0 in relation to the subject), I don't see why they'd affect what the model outputs (except cause more filler words)?
Politeness have impact (https://arxiv.org/abs/2402.14531) so I wouldn't be too fast to make any kind of claim with a technology we don't know exactly how it works.
They carry information in regular human communication, so I'm genuinely curious why you'd think they would not when an LLM outputs them as part of the process of responding to a message.
I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.
Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.
A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.
More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.
That was my first thought too -- instead of talk like a caveman you could turn off reasoning, with probably better results.
Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.
It is text prediction. But to predict text, other things follow that need to be calculated. If you can step back just a minute, i can provide a very simple but adjacent idea that might help to intuit the complexity of “ text prediction “ .
I have a list of numbers, 0 to9, and the + , = operators. I will train my model on this dataset, except the model won’t get the list, they will get a bunch of addition problems. A lot. But every addition problem possible inside that space will not be represented, not by a long shot, and neither will every number. but still, the model will be able to solve any math problem you can form with those symbols.
It’s just predicting symbols, but to do so it had to internalize the concepts.
There was a paper recently that demonstrated that you can input different human languages and the middle layers of the model end up operating on the same probabilistic vectors. It's just the encoding/decoding layers that appear to do the language management.
So the conclusion was that these middle layers have their own language and it's converting the text into this language and this decoding it. It explains why sometime the models switch to chinese when they have a lot of chinese language inputs, etc.
Pretty obvious when you think that neural networks operate with numbers and very complex formulas (by combining several simple formulas with various weights). You can map a lot of things to number (words, colors, music notes,…) but that does not means the NN is going to provide useful results.
after you go from from millions of params to billions+ models start to get weird (depending on training) just look at any number of interpretability research papers. Anthropic has some good ones.
You obviously do not speak other languages. Other cultures have different constrains and different grammar.
For example thinking in modern US English generates many thoughts, to keep correct speak at right cultural context (there is only one correct way to say People Of Color, and it changes every year, any typo makes it horribly wrong).
Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.
It is well proven that thinking in Chinese needs far less tokens!
With this caveman mod you strip out most of cultural complexities of anglosphere, make it easier for foreigners and far simpler to digest.
>Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.
Really?
Because if one accepts that computer languages are languages, then it seems that we could identify one or two that are highly specialized in logical conditions etc. Prolog springs to mind.
We have already proven that all the computing mechanism that those languages derive their semantic forms are equivalent to the Turing Machine. So C and Prolog are only different in terms of notations, not in terms of result.
Yes, really. The concept GP is alluding to is called the Sapir-Worf hypothesis, which is largely non scientific pop linguistics drivel. Elements of a much weaker version have some scientific merit.
Programming languages are not languages in the human brain nor the culture sense.
A fundamental (but sadly common) error behind “tokens are units of thinking” is antropomorphising the model as a thinking being. That’s a pretty wild claim that requires a lot of proof, and possibly solving the hard problem, before it can be taken seriously.
There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.
Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?
It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.
Willing to be corrected by someone more familiar with NN architecture, of course.
[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.
IMO "thinking" here means "computation", like running matrix multiplications. Another view could be: "thinking" means "producing tokens". This doesn't require any proof because it's literally what the models do.
As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.
CoT token are usually controled via 'extended thinking' or 'adapted thinking'. CoT tokens are usually not affected by the system prompt. There is an effort parameter, though, which states to have an effect on accuracy for over all token consumption.
This helps, but the original prompt is still there. The system prompt is still influencing these thinking blocks. They just don’t end up clogging up your context. The system prompt sits at the very top of the context hierarchy. Even with isolated "thinking" blocks, the reasoning tokens are still autoregressively conditioned on the system instructions. If the system prompt forces "caveman speak" the model's attention mechanisms are immediately biased toward simpler, less coherent latent spaces. You are handicapping the vocabulary and syntax it uses inside its own thinking process, which directly throttles its ability to execute high-level logic.
If this is true, shouldn't LLMs perform way worse when working in Chinese than in English? Seems like an easy thing to study since there are so many Chinese LLMs that can work in both Cbinese and English.
Do LLMs generally perform better in verbose languages than they do in concise ones?
That's going to depend on what model you're using with Claude Code. All of the more recent Anthropic models (4.5 and 4.6) support thinking, so the number of tokens generated ("units of thought") isn't directly tied to the verbosity of input and non-thought output.
However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.
It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.
Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.
I wonder if a language like Latin would be useful.
It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.
e.g.
"I would have walked home, but it seemed like it was going to rain" (14 words) ->
"Domum ambulavissem, sed pluiturum esse videbatur" (6 words).
Do you know of evals with default Claude vs caveman Claude vs politician Claude solving the same tasks? Hypothesis is plausible, but I wouldn’t take it for granted
When it comes to LLM you really cannot draw conclusions from first principles like this. Yes, it sounds reasonable. And things in reality aren't always reasonable.
I remember a while back they found that replacing reasoning tokens with placeholders ("....") also boosted results on benchies.
But does talk like caveman make number go down? Less token = less think?
I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?
Grug says you quite right, token unit thinking, but empty words not real thinking and should avoid. Instead must think problem step by step with good impactful words.
It's not "units of thinking" its "units of reference"; as long as what it produces references the necessary probabilistic algorithms, itll do just fine.
They’re able to solve complex, unstructured problems independently. They can express themselves in every major human language fluently. Sure, they don’t actually have a brain like we do, but they emulate it pretty well. What’s your definition of thinking?
When OP wrote about LLMs "thinking" he implied that they have an internal conceptual self-reflecting state. Which they don't, they *are* merely next token predicting statistical machines.
> Someone didn't get the memo that for LLMs, tokens are units of thinking.
Where do you get this memo ? Seems completely wrong to me. More computation does not translate to more "thinking" if you compute the wrong things (ie things that contribute significantly to the final sentence meaning).
That’s why you need filler words that contribute little to the sentence meaning but give it a chance to compute/think. This is part of why humans do the same when speaking.
Do you have any evidence at all of this? I know how LLMs are trained and this makes no sense to me. Otherwise you'd just put filler words in every input
e.g. instead of: "The square root of 256 is" you'd enter "errr The er square um root errr of 256 errr is" and it would miraculously get better? The model can't differentiate between words you entered and words it generated its self...
It's why it starts with "You're absolutely right!" It's not to flatter the user. It's a cheap way to guide the response in a space where it's utilizing the correction.
I disagree with this method and would discourage others from using it too, especially if accuracy, faster responses, and saving money are your priorities.
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.
Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.
If you take meaningless tokens (that do not contribute to subject focus), I don't see what you would lose. But as this takes out a lot of contextual info as well, I would think it might be detrimental.
There’s a lot of debate about whether this reduces model accuracy, but this is basically Chinese grammar and Chinese vibe coding seems to work fine while (supposedly) using 30-40% less tokens
everyone who thinks this is a costly or bad idea is looking past a very salient finding: code doesn't need much language. sure, other things might need lots of language, but code does not. code is already basically language, just a really weird one. we call them programming languages. they're not human languages. they're languages of the machine. condensing the human-language---machine-language interface, good.
if goal make code, few word better. if goal make insight, more word better. depend on task. machine linear, mind not. consider LLM "thinking" is just edge-weights. if can set edge-weights into same setting with fewer tokens, you are winning.
JOOK like when machine say facts. Machine and facts are friends. Numbers and names and “probably things” are all friends with machine.
JOOK no like when machine likes things. Maybe double standard. But forever machines do without like and without love. New like and love updates changing all the time. Makes JOOK question machine watching out for JOOK or watching out for machine.
JOOK like and love enough for himself and for machine too..
If this really works there would seem to be a lot of alpha in running the expensive model in something like caveman mode, and then "decompressing" into normal mode with a cheap model.
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
But it can't, we see models get larger and larger and larger models perform better. <Thinking> made such huge improvements, because it makes more text for the language model to process. Cavemanising (lossy compression) the output does it to the input as well.
So, if this does help reduce the cost of tokens, why not go even further and shorten the syntax with specific keywords, symbols and patterns, to reduce the noise and only keep information, almost like...a programming language?
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
There's linguistic term for this kind of speech: isolating grammars, which don't decline words and use high context and the bare minimum of words to get the meaning across. Chinese is such a language btw. Don't know what Chinese think about their language being regarded as cavemen language...
The fact whether a language is isolating, or not, is independent on the redundancy of the language.
All languages must have means for marking the syntactic roles of the words in a sentence.
The roles may be marked with prepositions or postpositions in isolating languages, or with declensions in fusional languages, or there may be no explicit markers when the word order is fixed (i.e. the same distinction as between positional arguments and arguments marked by keywords, in programming languages). The most laconic method for both programming languages and natural languages is to have a default word order where role markers are omitted, but to also allow any other word order if role markers are present.
Besides the mandatory means for marking syntactic roles, many languages have features that add redundancy without being necessary for understanding, i.e. which repeat already known information, for instance by repeating the information about gender and number that is attached to a noun also besides all its attributes. Whether a language requires redundancy or not is independent on whether it is an isolating language or a fusional language.
English has somewhat less syntactic role markers than other languages because it has a rigid word order, but for the other roles than the most frequent roles (agent, patient, beneficiary) it has a lot of prepositions.
Despite being more economic in role markers, English also has many redundant words that could be omitted, e.g. subjects or copulative verbs that are omitted in many languages. Thus for English it is possible to speak "like a caveman" without losing much information, but this is independent of the fact that modern English is a mostly isolating language with few remnants of its old declensions.
I think this could be very useful not when we talk to the agent, but when the agents talk back to us. Usually, they generate so much text that it becomes impossible to follow through. If we receive short, focused messages, the interaction will be much more efficient. This should be true for all conversational agents, not only coding agents.
Not specifically about your case, but some people are usually just more verbose than others and tend to say the same thing more than once, or perhaps haven't found a clear way of articulating their thoughts down to fewer words.
I think the sentiment here is that the short formulation of Kant's categorical imperative is as good and easier to read than the entirety of "types of ethical theory" (J.J. Martineau).
I find LLM slop much harder to read than normal human text.
I can't really explain it, it's just a feeling.
The feeling that it draaaags and draaaaaags and keeeeeps going on and on and on before getting to the point, and by the time I'm done with all the "fluff", I don't care what is the text about anymore, I just want to lay down and rest.
Great idea- if the person who made it is reading: Is this based on the board game „poetry for cavemen“? (Explain things using only single-syllable words, comes even with an inflatable log of wood for hitting each other!)
> One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
You can also make huge spelling mistakes and use incomplete words with llms they just sem to know better than any spl chk wht you mean. I use such speak to cut my time spent typing to them.
It doesn't go letter by letter, so not with current tokenizers.
There will likely be some internal reasoning going "I wonder if the user meant spell check, I'm gonna go with that one".
And it'll also bias the reasoning and output to internet speak instead of what you'd usually want, such as code or scientific jargon, which used to decrease output quality. I'm not sure if it still does
It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.
Pruning an assistant's response like that would break prompt caching.
Prompt caching is probably the single most important thing that people building harnesses think about and yet it's mind share in end users is virtually zero. If you had to think of all the weirdest, most seemingly baffling design decisions in an AI product, the answer to "why" is probably "to not break prompt caching".
i imagine they're doing superman level distributed compute across multiple clouds somewhere and cared more about delivering the final result of that than having the ability to pause. which is probably possible, but would require way more work than would be worthwhile. they probably thought the ability to stop and resubmit would be an adequate substitute.
I speak German, Polish, and English fluently and my take is: German is very precise, almost mathematical, there is little room to be misunderstood. But it also requires the most letters. English is the quickest, get things done kind of language, very compressible , but also risks misunderstanding. Polish is the most fun, with endless possibilities of twisting and bending it's structures, but also lacking the ease of use of English or the precision of German. But it's clearly just my subjective take
I tried this with early ChatGPT. Asked it to answer telegram style with as few tokens as possible. It is also interesting to ask it for jokes in this mode.
While really useful now, I'm afraid that in the long run it might accelerate the language atrophy that is already happening. I still remember that people used to enter full questions in Google and write SMS with capital letters, commas and periods.
> I still remember that people used to enter full questions in Google
I think that, in the early days of internet search, entering full questions actually produced worse results than just a bunch of keywords or short phrases.
So it was a sign of a "noob", rather than a mark of sophistication and literacy.
APL for talking to LLM when? Also, this reminded me of that episode from The Office where Kevin started talking like a caveman to make communication efficient.
Unfrozen caveman lawyer here. Did "talk like caveman" make code more bad? Make unsubst... (AARG) FAKE claims? You deserve compen... AAARG ... money. AMA.
This is exactly what annoys me most. English is not suitable for computer-human interaction. We should create new programming and query languages for that. We are again in cobol mindset. LLM are not humans and we should stop talking to them as if they are.
Oh, another new trend! I love these home-brewed LLM optimizers. They start with XML, then JSON, then something totally different. The author conveniently ignores the system prompt that works for everything, and the extra inference work. So, it's only worth using if you just like this response style, just my two cents. All the real optimizations happen during model training and in the infrastructure itself.
LOL it actually reads how humans reply the name is too clever :').
Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.
I didn’t comment on this when I saw it on threads/twitter. But it made it to HN, surprisingly.
I have a feeling these same people will complain “my model is so dumb!”. There’s a reason why Claude had that “you’re absolutely right!” for a while. Or codex’s “you’re right to push on this”.
We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.
grug have to use big brains' thinking machine these days, or no shiny rock. complexity demon love thinking machine. grug appreciate attempt to make thinking machine talk on grug level, maybe it help keep complexity demon away.
I was actually worried about high token costs while building my own project (infra bundle generator), and this gave me a good laugh + some solid ideas. 75% reduction is insane. Starred
Author here. A few people are arguing against a stronger claim than the repo is meant to make. As well, this was very much intended to be a joke and not research level commentary.
This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.
What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)
Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.
Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.
So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success
There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)
So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.
This is fun. I'd like to see the same idea but oriented for richer tokens instead of simpler tokens. If you want to spend less tokens, then spend the 'good' ones. So, instead of saying 'make good' you could say 'improve idiomatically' or something. Depends on one's needs. I try to imagine every single token as an opportunity to bend/expand/limit the geometries I have access to. Language is a beautiful modulator to apply to reality, so I'll wager applying it with pedantic finesse will bring finer outputs than brutish humphs of cavemen. But let's see the benchmarks!
I'm reminded by the caveman skill of the clipped writing style used in telegrams, and your post further reminded me of "standard" books of telegram abbreviations. Take a look at [0]; could we train models to use this kind of code and then decode it in the browser? These are "rich" tokens (they succinctly carry a lot of information).
[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...
Idk I try talk like cavemen to claude. Claude seems answer less good. We have more misunderstandings. Feel like sometimes need more words in total to explain previous instructions. Also less context is more damage if typo. Who agrees? Could be just feeling I have. I often ad fluff. Feels like better result from LLM. Me think LLM also get less thinking and less info from own previous replies if talk like caveman.
Yes because in most contexts it has seen "caveman" talk the conversations haven't been about rigorously explained maths/science/computing/etc... so it is less likely to predict that output.
Fluff adds probable likeness. Probablelikeness brings in more stuff. More stuff can be good. More stuff can poison.
Cute idea, but you're never gonna blow your token budget on output. Input tokens are the bottleneck, because the agent's ingesting swathes of skills, directory trees, code files, tool outputs, etc. The output is generally a few hundred lines of code and a bit of natural language explanation.
Good point and it's actually worse than that : the thinking tokens aren't affected by this at all (the model still reasons normally internally). Only the visible output that gets compressed into caveman... and maybe the model actually need more thinking tokens to figure out how to rephrase its answer into caveman style
Grug says you can tune how much each model thinks. Is not caveman but similar. also thinking is trained with RL so tends to be efficient, less fluffy. Also model (as seen locally) always drafts answer inside thinking then output repeats, change to caveman is not really extra effort.
Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
Yeah but not all tokens are created equal. Some tokens are hard to predict and thus encode useful information; some are highly predictable and therefore don't. Spending an entire forward pass through the token-generation machine just to generate a very low-entropy token like "is" is wasteful. The LLM doesn't get to "remember" that thinking, it just gets to see a trivial grammar-filling token that a very dumb LLM could just as easily have made. They aren't stenographically hiding useful computation state in words like "the" and "and".
> They aren't stenographically hiding useful computation state in words like "the" and "and".
Do you know that is true? These aren’t just tokens, they’re tokens with specific position encodings preceded by specific context. The position as a whole is a lot richer than you make it out to be. I think this is probably an unanswered empirical question, unless you’ve read otherwise.
Let me rephrase that for you:
"Interesting idea! Token consumption sure is an issue that should be addressed, and this is pretty funny too! However, I happen to have an unproven claim that tokens are units of thinking, and therefore, reducing the token count might actually reduce the model's capabilities. Did anybody using this by chance notice any degradation (since I did not bother to check myself)?"
Have a nice day!
Let’s see, I think these pretty much map out a little chronology of the research:
https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777
First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.
I dont’t see the relevance, the discussion is over whether boilerplate text that occurs intermittently in the output purely for the sake of linguistic correctness/sounding professional is of any benefit. Chain of thought doesn’t look like that to begin with, it’s a contiguous block of text.
To boil it down: chain of thought isn’t really chain of thought, it’s just more token generation output to the context. The tokens are participating in computations in subsequent forward passes that are doing things we don’t see or even understand. More LLM generated context matters.
That is not how CoT works. It is all in context. All influenced by context. This is a common and significant misunderstanding of autoregressive models and I see it on HN a lot.
I don't see the relevance -- and casually dismiss years of researches without even trying to read those paper.
That "unproven claim" is actually a well-established concept called Chain of Thought (CoT). LLMs literally use intermediate tokens to "think" through problems step by step. They have to generate tokens to talk to themselves, debug, and plan. Forcing them to skip that process by cutting tokens, like making them talk in caveman speak, directly restricts their ability to reason.
the fact that more tokens = more smart should be expected given cot / thinking / other techniques that increase the model accuracy by using more tokens.
Did you test that ""caveman mode"" has similar performance to the ""normal"" model?
That is part of it. They are also trained to think in very well mapped areas of their model. All the RHLF, etc. tuned on their CoT and user feedback of responses.
Yes but: If the amount is fixed, then the density matters.
A lot of communication is just mentioning the concepts.
Looking at the skill.md wouldn’t this actually increase token use since the model now needs to reformat its output?
Funny idea though. And I’d like to see a more matter-of-fact output from Claude.
No, let me rephrase it for you. “tokens used for think. Short makes model dumb”
Talk a lot not same as smart
Think before talk better though
Think makes smart. But think right words makes smarter, not think more words. Smart is elucidate structure and relationships with right words.
Can't you know that tokens are units of thinking just by... like... thinking about how models work?
Can't you just know that the earth is the center of the world by... like... just looking at how the world works?
Actually you'd trivially disprove that claim if you're starting from mechanistic knowledge of how orbits work, like how we have mechanistic knowledge of how LLMs work.
You have empirical observations, like replicating a fixed set of inner layers to make it think longer, or that you seem to have encode and decode layers. But exactly why those layers are the way they are, how they come together for emergent behaviour... Do we have mechanistic knowledge of that?
Though the above exchange felt a tiny bit snarky, I think the conversation did get more interesting as it went on. I genuinely think both people could probably gain by talking more -- or at least figuring out a way to move fast the surface level differences. Yes, humans designed LLMs. But this doesn't mean we understand their implications even at this (relatively simple) level.
> Can't you know that tokens are units of thinking just by... like... thinking about how models work?
Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?
Right, there's probably something more subtle like "semantic density within tokens is how models think"
So it's probably true that the "Great question!---" type preambles are not helpful, but that there's definitely a lower bound on exactly how primitive of a caveman language we're pushing toward.
What do you mean? The page explicitly states:
> cutting ~75% of tokens while keeping full technical accuracy.
I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
An explanation that explains nothing is not very interesting.
The burden of proof is on the author to provide at least one type of eval for making that claim.
I notice that the number of people confidently talking about "burden of proof" and whose it allegedly is in the context of AI has gone up sharply.
Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.
Sorry I don't know how engaging in this could lead to anything productive. There's already literature out there that gives credence to TeMPOraL claim. And, after a certain point, gravity being the reason that things fall becomes so self evident that every re-statements doesnt not require proof.
LLM quirks are not something all humans have been experiencing for thousands of years
> Nobody has to proof anything. It can give your claim credibility
“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.
If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.
The author pretended they addressed the obvious criticism.
You can read the skill. They didn't do anything to mitigate the issue, so the criticism is valid.
In the age of vibe coding and that we are literally talking about a single markdown file I am sure this has been well tested and achieves all of its goals with statistical accuracy, no side effects with no issues.
> I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.
I’ve heard this, I don’t automatically believe it nor do I understand why it would need to be true, I’m still caught on the old fashioned idea that the only “thinking” for autoregressive modes happens during training.
But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?
I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.
I have seen a paper though I can’t find it right now on asking your prompt and expert language produces better results than layman language. The idea of being that the answers that are actually correct will probably be closer to where people who are expert are speaking about it so the training data will associate those two things closer to each other versus Lyman talking about stuff and getting it wrong.
Yeah, I don't think that "I'd be happy to help you with that" or "Sure, let me take a look at that for you" carries much useful signal that can be used for the next tokens.
You'd be surprised -- This could match on the model's training to proceed using a tool, for example.
There is a study that shows that what the model is doing behind the scenes in those cases is a lot more than just outputting those tokens.
For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.
Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.
> For an LLM, tokens are thought. They have no ability to think
This is so funny
Did you stop reading because you saw a comma?
No
High dimensional vectors are thought (insofar as you can define what that even means). Tokens are one dimensional input that navigates the thought, and output that renders the thought. The "thinking" takes place in the high dimension space, not the one dimensional stream of tokens.
But isn't the one dimensional tokens a reflex of high dimensional space? What you see is "sure let's take a look at that" but behind the curtains it's actually an indication that it's searching a very specific latent space which might be radically different if those tokens didn't exist. Or not. In any case, you can't just make that claim and isolate those two processes. They might be totally unrelated but they also might be tightly interconnected.
I assume in practice, filler words do nothing of value. When words add or mean nothing (their weights are basically 0 in relation to the subject), I don't see why they'd affect what the model outputs (except cause more filler words)?
Politeness have impact (https://arxiv.org/abs/2402.14531) so I wouldn't be too fast to make any kind of claim with a technology we don't know exactly how it works.
They carry information in regular human communication, so I'm genuinely curious why you'd think they would not when an LLM outputs them as part of the process of responding to a message.
This is condescending and wrong at the same time (best combo).
LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.
Are you sure about that? Chain of thought does not need to be semantically useful to improve LLM performance. https://arxiv.org/abs/2404.15758
still doesn't mean all tokens are useful. it's the point of benchmarks
Care to share the benchmarks backing the claims in this repo?
I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.
Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.
A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.
More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.
The solution to that is turning off thinking mode or reducing thinking budget.
That was my first thought too -- instead of talk like a caveman you could turn off reasoning, with probably better results.
Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.
> much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
What do you mean by that? It’s literally text prediction, isn’t it?
It is text prediction. But to predict text, other things follow that need to be calculated. If you can step back just a minute, i can provide a very simple but adjacent idea that might help to intuit the complexity of “ text prediction “ .
I have a list of numbers, 0 to9, and the + , = operators. I will train my model on this dataset, except the model won’t get the list, they will get a bunch of addition problems. A lot. But every addition problem possible inside that space will not be represented, not by a long shot, and neither will every number. but still, the model will be able to solve any math problem you can form with those symbols.
It’s just predicting symbols, but to do so it had to internalize the concepts.
There was a paper recently that demonstrated that you can input different human languages and the middle layers of the model end up operating on the same probabilistic vectors. It's just the encoding/decoding layers that appear to do the language management.
So the conclusion was that these middle layers have their own language and it's converting the text into this language and this decoding it. It explains why sometime the models switch to chinese when they have a lot of chinese language inputs, etc.
Ok — that sounds more like a theory rather than an open-and-shut causal explanation, but I’ll read the paper.
You’re a literature cycle behind. ‘Middle-layer shared representations exist’ is the observed phenomenon; ‘why exactly they form’ is the theory.
You are also confusing ‘mechanistic explanation still incomplete’ with ‘empirical phenomenon unestablished.’ Those are not the same thing.
PS. Em dash? So you are some LLM bot trying to bait mine HN for reasoning traces? :D
Pretty obvious when you think that neural networks operate with numbers and very complex formulas (by combining several simple formulas with various weights). You can map a lot of things to number (words, colors, music notes,…) but that does not means the NN is going to provide useful results.
>It’s literally text prediction, isn’t it?
you are discovering that the favorite luddite argument is bullshit
I don't consider these researchers luddites.
https://machinelearning.apple.com/research/illusion-of-think...
https://arxiv.org/abs/2508.01191
Feel free to elucidate if you want to add anything to this thread other than vibes.
after you go from from millions of params to billions+ models start to get weird (depending on training) just look at any number of interpretability research papers. Anthropic has some good ones.
Interesting, what kind of weird?
> things start to get weird
> just look at research papers
You didn't add anything other than vibes either.
Getting weird doesn’t mean calling it text prediction is actually ‘bullshit’? Text prediction isn’t pejorative…
> instead of talk like a caveman you could turn off reasoning, with probably better results
This is not how the feature called "reasoning" work in current models.
"reasoning" simply let's the model output and then consume some "thinking" tokens before generating the actual output.
All the "fluff" tokens in the output have absolutely nothing to do with "reasoning".
You obviously do not speak other languages. Other cultures have different constrains and different grammar.
For example thinking in modern US English generates many thoughts, to keep correct speak at right cultural context (there is only one correct way to say People Of Color, and it changes every year, any typo makes it horribly wrong).
Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.
It is well proven that thinking in Chinese needs far less tokens!
With this caveman mod you strip out most of cultural complexities of anglosphere, make it easier for foreigners and far simpler to digest.
>Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.
This is simply not true.
Well, just take varous english dialects you probably know, there are wast differences. Some strange languages do not even have numbers or recursion.
It is very arrogant to assume, no other language can be more advanced than English.
Really? Because if one accepts that computer languages are languages, then it seems that we could identify one or two that are highly specialized in logical conditions etc. Prolog springs to mind.
We have already proven that all the computing mechanism that those languages derive their semantic forms are equivalent to the Turing Machine. So C and Prolog are only different in terms of notations, not in terms of result.
Yes, really. The concept GP is alluding to is called the Sapir-Worf hypothesis, which is largely non scientific pop linguistics drivel. Elements of a much weaker version have some scientific merit.
Programming languages are not languages in the human brain nor the culture sense.
A fundamental (but sadly common) error behind “tokens are units of thinking” is antropomorphising the model as a thinking being. That’s a pretty wild claim that requires a lot of proof, and possibly solving the hard problem, before it can be taken seriously.
There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.
Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?
It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.
Willing to be corrected by someone more familiar with NN architecture, of course.
[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.
IMO "thinking" here means "computation", like running matrix multiplications. Another view could be: "thinking" means "producing tokens". This doesn't require any proof because it's literally what the models do.
As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.
CoT token are usually controled via 'extended thinking' or 'adapted thinking'. CoT tokens are usually not affected by the system prompt. There is an effort parameter, though, which states to have an effect on accuracy for over all token consumption.
https://platform.claude.com/docs/en/build-with-claude/extend...
This helps, but the original prompt is still there. The system prompt is still influencing these thinking blocks. They just don’t end up clogging up your context. The system prompt sits at the very top of the context hierarchy. Even with isolated "thinking" blocks, the reasoning tokens are still autoregressively conditioned on the system instructions. If the system prompt forces "caveman speak" the model's attention mechanisms are immediately biased toward simpler, less coherent latent spaces. You are handicapping the vocabulary and syntax it uses inside its own thinking process, which directly throttles its ability to execute high-level logic.
Nothing on that page indicates otherwise.
If this is true, shouldn't LLMs perform way worse when working in Chinese than in English? Seems like an easy thing to study since there are so many Chinese LLMs that can work in both Cbinese and English.
Do LLMs generally perform better in verbose languages than they do in concise ones?
That's going to depend on what model you're using with Claude Code. All of the more recent Anthropic models (4.5 and 4.6) support thinking, so the number of tokens generated ("units of thought") isn't directly tied to the verbosity of input and non-thought output.
However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.
It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.
Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.
Ah so obviously making the LLM repeat itself three times for every response it will get smarter
I wonder if a language like Latin would be useful.
It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.
e.g.
"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).
I think speculative decoding eliminates a lot of the savings people imagine they're getting from making LLMs use strange languages.
Words <> tokens
Do you know of evals with default Claude vs caveman Claude vs politician Claude solving the same tasks? Hypothesis is plausible, but I wouldn’t take it for granted
You are absolutely right! That is exactly the reason why more lines of code always produce a better program. Straight on, m8!
This might be not so far from the truth, if you count total loc written and rewritten during the development cycle, not just the final number.
Not everybody is Dijkstra.
When it comes to LLM you really cannot draw conclusions from first principles like this. Yes, it sounds reasonable. And things in reality aren't always reasonable.
Benchmark or nothing.
There have been papers about introducing thinking tokens in intermediary layers that get stripped from the output.
IIUC this doesn't make the LLM think in caveman (thinking tokens). It just makes the final output show in caveman.
I remember a while back they found that replacing reasoning tokens with placeholders ("....") also boosted results on benchies.
But does talk like caveman make number go down? Less token = less think?
I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?
Grug says you quite right, token unit thinking, but empty words not real thinking and should avoid. Instead must think problem step by step with good impactful words.
How do we know if a token sits at an abstract level or just the textual level ?
You mention thinking tokens as a side note, but their existence invalidates your whole point. Virtually all modern LLMs use thinking tokens.
It's not "units of thinking" its "units of reference"; as long as what it produces references the necessary probabilistic algorithms, itll do just fine.
LLMs don't think at all.
Forcing it to be concise doesn't work because it wasn't trained on token strings that short.
They’re able to solve complex, unstructured problems independently. They can express themselves in every major human language fluently. Sure, they don’t actually have a brain like we do, but they emulate it pretty well. What’s your definition of thinking?
When OP wrote about LLMs "thinking" he implied that they have an internal conceptual self-reflecting state. Which they don't, they *are* merely next token predicting statistical machines.
> Forcing it to be concise doesn't work because it wasn't trained on token strings that short.
This is a 2023-era comment and is incorrect.
Anything I can read that would settle the debate?
LLMs architectures have not changed at all since 2023.
> but mmuh latest SOTA from CloudCorp (c)!
You don't know how these things work and all you have to go on is marketing copy.
Yea you don't know anything about LLM architectures. They often change with each model release.
You also aren't aware that there's more to it than "LLM architecture". And you're rather confident despite your lack of knowledge.
You're like the old LLMs before ChatGPT was released that were kinda neat, but usually wrong and overconfident about it.
More concise is dumber. Got it.
That’s why you need filler words that contribute little to the sentence meaning but give it a chance to compute/think. This is part of why humans do the same when speaking.
Do you have any evidence at all of this? I know how LLMs are trained and this makes no sense to me. Otherwise you'd just put filler words in every input
e.g. instead of: "The square root of 256 is" you'd enter "errr The er square um root errr of 256 errr is" and it would miraculously get better? The model can't differentiate between words you entered and words it generated its self...
It's why it starts with "You're absolutely right!" It's not to flatter the user. It's a cheap way to guide the response in a space where it's utilizing the correction.
What do you think chain of thought reasoning is doing exactly?
You’re conflating training and inference
Though I do use Claude Code, is it possible to get this for Github Copilot too?
Yes, Copilot supports skills, which are basically just stored prompts in markdown files. You can use the same skill in that GitHub repo
I disagree with this method and would discourage others from using it too, especially if accuracy, faster responses, and saving money are your priorities.
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.
me disagree
only you auto-compact. auto-compact bad
Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.
If you take meaningless tokens (that do not contribute to subject focus), I don't see what you would lose. But as this takes out a lot of contextual info as well, I would think it might be detrimental.
Also see https://arxiv.org/pdf/2604.00025 ('Brevity Constraints Reverse Performance Hierarchies in Language Models' March 2026)
There’s a lot of debate about whether this reduces model accuracy, but this is basically Chinese grammar and Chinese vibe coding seems to work fine while (supposedly) using 30-40% less tokens
everyone who thinks this is a costly or bad idea is looking past a very salient finding: code doesn't need much language. sure, other things might need lots of language, but code does not. code is already basically language, just a really weird one. we call them programming languages. they're not human languages. they're languages of the machine. condensing the human-language---machine-language interface, good.
if goal make code, few word better. if goal make insight, more word better. depend on task. machine linear, mind not. consider LLM "thinking" is just edge-weights. if can set edge-weights into same setting with fewer tokens, you are winning.
JOOK like when machine say facts. Machine and facts are friends. Numbers and names and “probably things” are all friends with machine.
JOOK no like when machine likes things. Maybe double standard. But forever machines do without like and without love. New like and love updates changing all the time. Makes JOOK question machine watching out for JOOK or watching out for machine.
JOOK like and love enough for himself and for machine too..
Soma (aka tiktok) and Big Brother (aka Meta) already happened without government coercion, only makes sense that we optimize ourselves for newspeak.
Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.
I was aware of how google/facebook is like the panopticon big brother but I never connected the algorithmic feed to soma! Good insight.
Kinda ironic this description is so verbose.
> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman
For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?
See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output
For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills
That's a great idea but has anyone benchmarked the performance difference?
If this really works there would seem to be a lot of alpha in running the expensive model in something like caveman mode, and then "decompressing" into normal mode with a cheap model.
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
But it can't, we see models get larger and larger and larger models perform better. <Thinking> made such huge improvements, because it makes more text for the language model to process. Cavemanising (lossy compression) the output does it to the input as well.
This is the best thing since I asked Claude to address me in third person as "Your Eminence".
But combining this with caveman? Gold!
f.e.?
So, if this does help reduce the cost of tokens, why not go even further and shorten the syntax with specific keywords, symbols and patterns, to reduce the noise and only keep information, almost like...a programming language?
Does this actually result in less compute, or is it adding an additional “translate into caveman” step to the normal output?
This is an experiment that, although not to this extreme, was tested by OpenAI. Their responses API allow you to control verbosity:
https://developers.openai.com/api/reference/resources/respon...
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
Wouldn't this affect quality of output negatively?
Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.
There's linguistic term for this kind of speech: isolating grammars, which don't decline words and use high context and the bare minimum of words to get the meaning across. Chinese is such a language btw. Don't know what Chinese think about their language being regarded as cavemen language...
The fact whether a language is isolating, or not, is independent on the redundancy of the language.
All languages must have means for marking the syntactic roles of the words in a sentence.
The roles may be marked with prepositions or postpositions in isolating languages, or with declensions in fusional languages, or there may be no explicit markers when the word order is fixed (i.e. the same distinction as between positional arguments and arguments marked by keywords, in programming languages). The most laconic method for both programming languages and natural languages is to have a default word order where role markers are omitted, but to also allow any other word order if role markers are present.
Besides the mandatory means for marking syntactic roles, many languages have features that add redundancy without being necessary for understanding, i.e. which repeat already known information, for instance by repeating the information about gender and number that is attached to a noun also besides all its attributes. Whether a language requires redundancy or not is independent on whether it is an isolating language or a fusional language.
English has somewhat less syntactic role markers than other languages because it has a rigid word order, but for the other roles than the most frequent roles (agent, patient, beneficiary) it has a lot of prepositions.
Despite being more economic in role markers, English also has many redundant words that could be omitted, e.g. subjects or copulative verbs that are omitted in many languages. Thus for English it is possible to speak "like a caveman" without losing much information, but this is independent of the fact that modern English is a mostly isolating language with few remnants of its old declensions.
I thought the term for those were 'sane languages', and I say that as a native English speaker :)
I think this could be very useful not when we talk to the agent, but when the agents talk back to us. Usually, they generate so much text that it becomes impossible to follow through. If we receive short, focused messages, the interaction will be much more efficient. This should be true for all conversational agents, not only coding agents.
That’s what it does as far as I get it. But less is not always better and I guess it’s also subjective to the promoter.
> Usually, they generate so much text that it becomes impossible to follow through.
Quite often on reddit I'll write two paragraphs and get told "I'm not reading all that".
Really? Has basic reading become a Herculean task?
Not specifically about your case, but some people are usually just more verbose than others and tend to say the same thing more than once, or perhaps haven't found a clear way of articulating their thoughts down to fewer words.
I think the sentiment here is that the short formulation of Kant's categorical imperative is as good and easier to read than the entirety of "types of ethical theory" (J.J. Martineau).
> Has basic reading become a Herculean task?
I find LLM slop much harder to read than normal human text.
I can't really explain it, it's just a feeling.
The feeling that it draaaags and draaaaaags and keeeeeps going on and on and on before getting to the point, and by the time I'm done with all the "fluff", I don't care what is the text about anymore, I just want to lay down and rest.
Funny how people are so critical of this and yet fawn over TOON
More like Pidgin English than caveman, perhaps, although caveman does make for a better name.
Great idea- if the person who made it is reading: Is this based on the board game „poetry for cavemen“? (Explain things using only single-syllable words, comes even with an inflatable log of wood for hitting each other!)
Anyone else worried about the long term consequences of the influence of talking like this all day for the cognitive system of the user?
“Me think, why waste time say lot word, when few word do trick.”
— Kevin Malone
I think good, less thinking for you, more thinking you will do
This trick reminds me of "OpenAI charges by the minute, so speed up your audio"
https://news.ycombinator.com/item?id=44376989
Which worked great. Also, cut off silences.
> One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
You can also make huge spelling mistakes and use incomplete words with llms they just sem to know better than any spl chk wht you mean. I use such speak to cut my time spent typing to them.
Wouldn't this increase your token usage because the tokenizer now can't process whole words, but it needs to go letter by letter?
It doesn't go letter by letter, so not with current tokenizers.
There will likely be some internal reasoning going "I wonder if the user meant spell check, I'm gonna go with that one".
And it'll also bias the reasoning and output to internet speak instead of what you'd usually want, such as code or scientific jargon, which used to decrease output quality. I'm not sure if it still does
By the way why don't these LLM interfaces come with a pause button?
And a "prune here" button.
It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.
Pruning an assistant's response like that would break prompt caching.
Prompt caching is probably the single most important thing that people building harnesses think about and yet it's mind share in end users is virtually zero. If you had to think of all the weirdest, most seemingly baffling design decisions in an AI product, the answer to "why" is probably "to not break prompt caching".
i imagine they're doing superman level distributed compute across multiple clouds somewhere and cared more about delivering the final result of that than having the ability to pause. which is probably possible, but would require way more work than would be worthwhile. they probably thought the ability to stop and resubmit would be an adequate substitute.
These models are autoregressive so I doubt they are running them across multiple clouds. And besides, a pause button is useful from a user's pov.
i'm not sure it is, what's so useful about it?
So it's a prompt to turn Jarvis into Hulk!
No articles, no pleasantries, and no hedging. He has combined the best of Slavic and Germanic culture into one :)
Both Slavic languages and German have complex declination systems for nouns, verbs, and adjectives. Which is unlike stereotypical caveman speech.
I speak German, Polish, and English fluently and my take is: German is very precise, almost mathematical, there is little room to be misunderstood. But it also requires the most letters. English is the quickest, get things done kind of language, very compressible , but also risks misunderstanding. Polish is the most fun, with endless possibilities of twisting and bending it's structures, but also lacking the ease of use of English or the precision of German. But it's clearly just my subjective take
Are there any good studies or benchmarks about compressed output and performance? I see a lot of arguing in the comments but little evidence.
I tried this with early ChatGPT. Asked it to answer telegram style with as few tokens as possible. It is also interesting to ask it for jokes in this mode.
It's especially funny to change your coworker's system prompt like that.
While really useful now, I'm afraid that in the long run it might accelerate the language atrophy that is already happening. I still remember that people used to enter full questions in Google and write SMS with capital letters, commas and periods.
> I still remember that people used to enter full questions in Google
I think that, in the early days of internet search, entering full questions actually produced worse results than just a bunch of keywords or short phrases.
So it was a sign of a "noob", rather than a mark of sophistication and literacy.
“Sophistication and literacy” are orthogonal to the peculiarities of a black box search engine.
Those literate sophisticates would still be noobs at getting something useful from Google.
I would prefer to talk like Abathur (https://www.youtube.com/watch?v=pw_GN3v-0Ls). Same efficiency but smarter.
APL for talking to LLM when? Also, this reminded me of that episode from The Office where Kevin started talking like a caveman to make communication efficient.
> If caveman save you mass token, mass money — leave mass star.
Mass fun. Starred.
Or you could use a local model where you’re not constrained by tokens. Like rig.ai
How is your offering different from local ollama?
What is that binary file caveman.skill that I cannot read easily, and is it going to hack my computer.
Unfrozen caveman lawyer here. Did "talk like caveman" make code more bad? Make unsubst... (AARG) FAKE claims? You deserve compen... AAARG ... money. AMA.
Caveman need invent chalk and chart make argument backed by more than good feel.
This is exactly what annoys me most. English is not suitable for computer-human interaction. We should create new programming and query languages for that. We are again in cobol mindset. LLM are not humans and we should stop talking to them as if they are.
Grug says Chinese more suitable, only few runes in word, each take single token. Is great.
Oh, another new trend! I love these home-brewed LLM optimizers. They start with XML, then JSON, then something totally different. The author conveniently ignores the system prompt that works for everything, and the extra inference work. So, it's only worth using if you just like this response style, just my two cents. All the real optimizations happen during model training and in the infrastructure itself.
LOL it actually reads how humans reply the name is too clever :').
Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.
tldr; Claude skill, short output, ++good.
I didn’t comment on this when I saw it on threads/twitter. But it made it to HN, surprisingly.
I have a feeling these same people will complain “my model is so dumb!”. There’s a reason why Claude had that “you’re absolutely right!” for a while. Or codex’s “you’re right to push on this”.
We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.
Mongo! No caveman
grug have to use big brains' thinking machine these days, or no shiny rock. complexity demon love thinking machine. grug appreciate attempt to make thinking machine talk on grug level, maybe it help keep complexity demon away.
Deep digging cave man code reviews are Tha Shiznit:
https://www.youtube.com/watch?v=KYqovHffGE8
caveman multilingo? how sound?
I don't know about token savings, but I find the "caveman style" much easier to read and understand than typical LLM-slop.
I was actually worried about high token costs while building my own project (infra bundle generator), and this gave me a good laugh + some solid ideas. 75% reduction is insane. Starred
I'd be curious if there were some measurements of the final effects, since presumably models wont <think> in caveman speak nor code like that