Great write up. We use vLLM kv cache and continuous batching as a foundation for requests in ScalarLM and also add batching optimizations in a centralized queue and by adding explicit batching support in our client.
Thanks for writing this up! I learnt a bunch from it. I noticed this didn’t discuss additional layers of caching - I can see how it would fit in, but is prompt caching out of the scope of this system?
Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!
Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.
I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?
That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.
Instead for decode, you need to sequentially generate each token.
Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.
Decode is the next major step where you start generating output tokens one at a time.
Both run on GPUs but have slightly different workloads
1. Prefill has very little I/o from VRAM to HBM and more compute
2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token
Great write up. We use vLLM kv cache and continuous batching as a foundation for requests in ScalarLM and also add batching optimizations in a centralized queue and by adding explicit batching support in our client.
https://www.scalarlm.com
There is more perf you can sqeeuze out of vLLM
Thanks for writing this up! I learnt a bunch from it. I noticed this didn’t discuss additional layers of caching - I can see how it would fit in, but is prompt caching out of the scope of this system?
Hi, I'm the author of this post. Writing it was a great learning experience. I gained a lot of insight into vLLM. If you have any feedback or questions, feel free to drop a comment below!
Thanks for writing the article!
I didn't quite get
Note that during the prefill phase, all prompt tokens from a request can be processed in one batch. This is possible because the query (Q) tensors, calculated from the tokens immediately before them, are available for each prompt token position.
I know that in practice prefill is much faster than inference. Would watching the 2h video from Karpathy help me understand why?
That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.
Instead for decode, you need to sequentially generate each token.
And on the topic of prefill: Do you know what the role of GPUs is vs. in inference?
Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.
Decode is the next major step where you start generating output tokens one at a time.
Both run on GPUs but have slightly different workloads
1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token