Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.
The mainframe analogy is close but I think the key difference is that mainframe->PC was driven by hardware getting cheap, while LLM efficiency needs algorithmic breakthroughs which are way less predictable. My bet is we get a split: anything latency-sensitive (code completion, local assistants) goes to edge as soon as models fit on consumer hardware, because you can't cheat physics on network round trips. But training and heavy reasoning stays centralized -- the data gravity just gets worse as models improve. Also I keep going back and forth on whether stuff like MoE and speculative decoding is "better math" or just "better engineering." Feels like an important distinction since they have very different cost curves.
I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
Despite the shortage, RAM is still cheaper than mathematicians.
This is one of the basic avenues for advancement.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
The same could be said about other IT domain... When you see single webpages that weight by tens of MB you wonder how we came to this.
We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.
Jevons paradox https://en.wikipedia.org/wiki/Jevons_paradox
I disagree. I think a sharp drop in memory requirements of at least an order of magnitude will cause demand to adjust accordingly.
If models become more efficient we will move more of the work to local devices instead of using SaaS models. We’re still in the mainframe era of LLM.
The mainframe analogy is close but I think the key difference is that mainframe->PC was driven by hardware getting cheap, while LLM efficiency needs algorithmic breakthroughs which are way less predictable. My bet is we get a split: anything latency-sensitive (code completion, local assistants) goes to edge as soon as models fit on consumer hardware, because you can't cheat physics on network round trips. But training and heavy reasoning stays centralized -- the data gravity just gets worse as models improve. Also I keep going back and forth on whether stuff like MoE and speculative decoding is "better math" or just "better engineering." Feels like an important distinction since they have very different cost curves.
I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.