You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.
Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer
s to in the article.
They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.
Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.
Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.
However, I've found that speculative decoders don't help much if you're running a model locally on limited hardware (for instance, my 32GB VRAM M1 Max from 2021). For one, you have to fit both the large and the small drafter model in memory. For another, if you're running a quantized model, the activation distribution is different enough that the draft model has a hard time guessing what's coming next.
My take is that speculative decoding is most useful on _very expensive_ prosumer/hobbyist setups where you have 128GB of VRAM and are running your local models with full fidelity. It's also helpful for inference providers where they can send output tokens at a computational cost slightly higher than their input token cost.
Your experience might be a bit dated, depending on when was the last time you tried it. MTP (which is a flavor of spec decoding) is showing really solid improvements on local models, even on consumer hardware.
In fact, as the article mentions, you get the biggest gains at low concurrency (so local should apply), with diminishing returns for higher concurrency (if you think in terms of unit of compute, it's probably better to serve more requests in parallel and get more throughput that way).
Eagle3 was great at low context tho, and this seems to improve things at high context. That's really cool, and hopefully it'll turn oout to be useful at those lengths. Eagle3 is also training dependant, so you could try training your own, if your use-cases diverge enough that 3rd party "generalist" models don't suit your needs. (in general nvda, redhat, etc. have provided general eagle3 models for popular families).
The EAGLE team traced this fragility to a phenomenon we call ‘attention drift’
Ok that’s downright fascinating. I am one of the world’s foremost experts on the AI psychosis sufferers posting grand theories on Reddit, and ‘drift’ is one of the words that chatbots come back to again and again when told to ponder their own Being (so much so that it even shows up in clearly-unrelated/incorrect contexts — pretty sure I’ve seen both ‘quantum drift’ and ‘spiritual drift’).
It’s probably the #3 most common, after ‘recursion’ and ‘coherence’; I bet ‘coherence drift’ has popped up a thousand times by now, but ‘attention drift’, ‘token drift’, ‘spiritual drift’, ‘cognitive drift’, and ‘semantic drift’ have all gotten airtime AFAIR.
Obviously the primary thing going on there is vulnerable laypeople convincing themselves that they’ve cracked some major part of science, but I do honestly wonder about the unintentional throughlines… This might be the first time I’ve noticed one of them show up in a real paper, though.
Is there some intuitive wisdom in how LLMs tend to approach themselves, perhaps? Or are those terms inevitable when talking via and/or about a 1:1 turn-taking conversation?
When one is talking about two things gradually diverging, isn't "drift" a natural, descriptive verb to reach for? I've heard it often when discussing an implementation diverging from a specification, for example.
Well, there are only so many nouns, and even fewer "cool-sounding" ones. For better project differentiation, do you think we should instead be naming things "ZurgGlurg327"? I'm sure you can find a completely-unique combo for each thing, but good luck remembering the name!
docker images and ubuntu releases use an adjective, this could at least allow some alternatives like bold/supreme/decisive/depressed eagle (or just use battery staple)
> performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.
I heard that speculative decoding doesn't affect performance (I meant accuracy). Am I wrong about it?
You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output.
Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer s to in the article.
Are these speculative decoders ok to use for AI coding agents or do they only fit certain workloads?
They work better for coding workloads. Essentially, the more regular the output, the more the faster model gets right, the less the big model has to do.
Writing tends to have more false positives. I haven't tried this particular one, however, but that is the general trend.
Speculative decoding shouldn't actually change the accuracy of the response. The draft model drafts a couple tokens, and the inference framework verifies that the larger model would have picked them.
However, I've found that speculative decoders don't help much if you're running a model locally on limited hardware (for instance, my 32GB VRAM M1 Max from 2021). For one, you have to fit both the large and the small drafter model in memory. For another, if you're running a quantized model, the activation distribution is different enough that the draft model has a hard time guessing what's coming next.
My take is that speculative decoding is most useful on _very expensive_ prosumer/hobbyist setups where you have 128GB of VRAM and are running your local models with full fidelity. It's also helpful for inference providers where they can send output tokens at a computational cost slightly higher than their input token cost.
Your experience might be a bit dated, depending on when was the last time you tried it. MTP (which is a flavor of spec decoding) is showing really solid improvements on local models, even on consumer hardware.
In fact, as the article mentions, you get the biggest gains at low concurrency (so local should apply), with diminishing returns for higher concurrency (if you think in terms of unit of compute, it's probably better to serve more requests in parallel and get more throughput that way).
Eagle3 was great at low context tho, and this seems to improve things at high context. That's really cool, and hopefully it'll turn oout to be useful at those lengths. Eagle3 is also training dependant, so you could try training your own, if your use-cases diverge enough that 3rd party "generalist" models don't suit your needs. (in general nvda, redhat, etc. have provided general eagle3 models for popular families).
I think so, the benchmark is on a coding dataset (SPEED-Bench).
It’s probably the #3 most common, after ‘recursion’ and ‘coherence’; I bet ‘coherence drift’ has popped up a thousand times by now, but ‘attention drift’, ‘token drift’, ‘spiritual drift’, ‘cognitive drift’, and ‘semantic drift’ have all gotten airtime AFAIR.
Obviously the primary thing going on there is vulnerable laypeople convincing themselves that they’ve cracked some major part of science, but I do honestly wonder about the unintentional throughlines… This might be the first time I’ve noticed one of them show up in a real paper, though.
Is there some intuitive wisdom in how LLMs tend to approach themselves, perhaps? Or are those terms inevitable when talking via and/or about a 1:1 turn-taking conversation?
When one is talking about two things gradually diverging, isn't "drift" a natural, descriptive verb to reach for? I've heard it often when discussing an implementation diverging from a specification, for example.
I saw EAGLE and thought it's going to be about PCB design. Was left disappointed.
Well, there are only so many nouns, and even fewer "cool-sounding" ones. For better project differentiation, do you think we should instead be naming things "ZurgGlurg327"? I'm sure you can find a completely-unique combo for each thing, but good luck remembering the name!
docker images and ubuntu releases use an adjective, this could at least allow some alternatives like bold/supreme/decisive/depressed eagle (or just use battery staple)
Same here. KiCad is great, but I miss EAGLE.
It feels like AI folks are particularly careless about checking if a name has been used in computing before.