For the particular case of the 5 delimiters '\n', '.', '?', '!', and ';', it just happens to be so that you can do this as a single shuffle instruction, replacing the explicit lookup table.
You can do this whenever `c & 0x0F` is unique for the set of characters you're looking for.
This is a really neat technique, well explained at your link.
Now that I understand it, I'd describe it as: For each byte, based on its bottom 4 bits, map it to either the unique "target" value that you're looking for that has those bottom 4 bits, or if there is no such target value, to any value that is different from what it is right now. Then simply check whether each resulting byte is equal to its corresponding original byte!
Not sure if the above will help people understand it, but after you understand it, I think you'll agree with the above description :)
We're the maintainers of Chonkie, a chunking library for RAG pipelines.
Recently, we've been using Chonkie to build deep research agents that watch topics for new developments and automatically update their reports. This requires chunking a large amount of data constantly.
While building this, we noticed Chonkie felt slow. We started wondering: what's the theoretical limit here? How fast can text chunking actually get if we throw out all the abstractions and go straight to the metal?
This post is about that rabbit hole and how it led us to build memchunk - the fastest chunking library, capable of chunking text at 1TB/s.
Some notes:
1. Nice and tight article, good work
2. Shipped a piece of code, always props to that
3. The has_zero_byte it would be nice to actually do the math in the example. As is the example doesn't really show anything. It also should say "its" instead of "it's"
4. The work done per chunk shouldn't include the broadcasts. That should be done at the start of the search and those values kept in the registers, no?
5. Isn't AVX and SSE also SWAR? They're just wider registers
6. I think a graph showing the cost of the lookup table vs n needles would be cool to see
I've been seeing a bunch of LLM-adjacent articles recently that are focusing on being fast - and they leave me a bit stumped.
While latency _can_ be a problem, reliability and accuracy are almost always my bottlenecks (to user value). Especially with chunking. Chunking is generally a one-time process where users aren't latency sensitive.
If you have reliability and accuracy (big if) then the practical usability and cost become performance problems.
And this is a bit of a sliding scale. Of course users want the best possible answer. However, if they can get 80% (magic hand-wavey fakie number) of the best answer on one second instead of 20, that may be a worthwhile tradeoff.
> Chunking is generally a one-time process where users aren't latency sensitive.
This is not necessarily true. For example, in our use case we are constantly monitoring websites, blogs, and other sources for changes. When a new page is added, we need to chunk and embed it fast so it's searchable immediately. Chunking speed matters for us.
When you're processing changes constantly, chunking is in the hot path. I think as LLMs get used more in real time workflows, every part of the stack will start facing latency pressure.
Don't get me wrong, it's fun to see performance optimizations like this.
But I'd expect that a naive implementation of the same strategy would already take like 0.1% of the time needed to actually generate embeddings for your chunks. So practically, is it really worth the effort of writing a bunch of non-trivial SIMD code to reduce that overhead from 0.1% to 0.001%?
But the bottleneck is generating embeddings either way.
memchunk has a throughput of 164 GB/s. A really fast embedder can deliver maybe 16k embeddings/sec, or ~1.6GB/s (if you assume 100 char sentences)
That's two orders of magnitude difference. Chunking is not the bottleneck.
It might be an architectural issue - you stuff chunks into a MQ, and you want to have full visibility in queue size ASAP - but otherwise it doesn't matter how much you chunk, your embedder will slow you down.
It's still a neat exercise on principle, though :)
It doesn't matter if A takes much more time than B, if B is large enough. You're still saving resources and time by optimising B. Also, you seem to assume that every chunk will get embedded - they may be revisiting some pages where the chunks are already present in the database.
Amdahl's law still holds, though. If A and B differ in execution times by orders of magnitude, optimising B yields minimal returns (assuming streaming, vs fully serial processing)
And sure, you can reject chunks, but a) the rejection isn't free, and B) you're still bound by embedding speed.
As for resource savings.... not in the Wikipedia data range. If you scale up massively and go to a PB of data, going from kiru to memchunk saves you ~25 CPU days. But you also suddenly need to move from bog-standard high cpu machines to machines supporting 164GB/s memory throughput, likely full metal with 8 memory channels. I'm too lazy to do the math, but it's going to be a mild difference at O($100)
Again, I'm not arguing this isn't a cool achievement. But it's very much engineering fun, not "crucial optimization".
what i really want to see from this article is a curve showing a tradeoff between speed and embedded text quality. there's the preamble that just going by character has quality problems, but i dont think delimiters are necessarily the best either, vs being able to find paragraph or even chapter boundaries.
how much of a problem is it that ~1 sentence per chunk gets corrupted in the by-character solution? what level of sentence corruption is left in by switching to these delimiters? what level of paragraph/idea corruption is left in with each? chapter/argument level?
"if you search forward, you need to scan through the entire window to find where to split. you’d find a delimiter at byte 50, but you can’t stop there — there might be a better split point closer to your target size. so you keep searching, tracking the last delimiter you saw, until you finally cross the chunk boundary. that’s potentially thousands of matches and index updates."
So I understand that this is optimal if you want to make your chunks as large as possible for a given chunk size.
What I don't understand is why is it desirable to grab the largest chunk possible for a given chunk limit?
We've found that maximizing chunk size gives the best retrieval performance and is easier to maintain since you don't have to customize chunking strategy per document type.
The upper limit for chunk size is set by your embedding model. After a certain size, encoding becomes too lossy and performance degrades.
There is a downside: blindly splitting into large chunks may cut a sentence or word off mid-way. We handle this by splitting at delimiters and adding overlap to cover abbreviations and other edge cases.
> you have a massive pile of text, and you need to split it into smaller pieces that fit into embedding models or context windows.
I think the recently posted Recursive Language Models paper approaches this in a far more compelling way. They put the long context into the environment and make the LLM write and iterate python code to query against it in a recursive loop. Fig. 2 & 4 are most relevant here.
I really like this because it is in The Bitter Lesson genre of solutions. Make the model learn the best way to retrieve info from a massive prompt on disk given the domain and any human feedback (explicit and otherwise).
The bigger the prompt.txt, the less relevant the LLM's raw context capabilities are. Context scaling is quadratic in cost. It's a very expensive rabbit to chase. Recursively invoking the same agent with decomposed problem bits is more of a logarithmic scaling thing. You could hypothetically manage a 1 gigabyte prompt with a relatively minuscule context window under a recursive scheme using nothing other than a shell/python interpreter.
(On a beefy machine) It gets 1 TB/s throughput including all IO and position mapping back to original text location. I used it to split project gutenberg novels. It does 20k+ novels in about 7 seconds.
Note it keeps all dialog together- which may not be what others want, but was what i wanted.
As the other comment said, its a practice in good enough chunks quality. We focus on big chunks (largest we can make without hurting embedding quality) as fast as possible. In our experience, retrieval accuracy is mostly driven by embedding quality, so perfect splits don't move the needle much.
But as the number of files to ingest grows, chunking speed does become a bottleneck. We want faster everything (chunking, embedding, retrieval) but chunking was the first piece we tackled. Memchunk is the fastest we could build.
Which language are you thinking of? Ideally, how would you identify split points in this language?
I suppose we've only tested this with languages that do have delimiters - Hindi, English, Spanish, and French
There are two ways to control the splitting point. First is through delimiters, and the second is by setting chunk size. If you're parsing a language where chunks can't be described by either of those params, then I suppose memchunk wouldn't work. I'd be curious to see what does work though!
There are certainly cases of Greek/Latin without any punctuation at all, typically in a historical context. Chinese & Japanese historically did not have any punctuation whatsoever.
For the particular case of the 5 delimiters '\n', '.', '?', '!', and ';', it just happens to be so that you can do this as a single shuffle instruction, replacing the explicit lookup table.
You can do this whenever `c & 0x0F` is unique for the set of characters you're looking for.
See https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec... for details.
Hey! Author of the blog here.
This is pretty cool~ Thanks for suggesting this, I will read this in detail and add it to the next (0.5.0) release of memchunk.
Why does your title not have any context?
Note your compiler might turn that _mm256_set_epi64x into a load from memory, so there might still be memory accesses you don't expect.
This is a really neat technique, well explained at your link.
Now that I understand it, I'd describe it as: For each byte, based on its bottom 4 bits, map it to either the unique "target" value that you're looking for that has those bottom 4 bits, or if there is no such target value, to any value that is different from what it is right now. Then simply check whether each resulting byte is equal to its corresponding original byte!
Not sure if the above will help people understand it, but after you understand it, I think you'll agree with the above description :)
We're the maintainers of Chonkie, a chunking library for RAG pipelines.
Recently, we've been using Chonkie to build deep research agents that watch topics for new developments and automatically update their reports. This requires chunking a large amount of data constantly.
While building this, we noticed Chonkie felt slow. We started wondering: what's the theoretical limit here? How fast can text chunking actually get if we throw out all the abstractions and go straight to the metal?
This post is about that rabbit hole and how it led us to build memchunk - the fastest chunking library, capable of chunking text at 1TB/s.
Blog: https://minha.sh/posts/so,-you-want-to-chunk-really-fast
GitHub: https://github.com/chonkie-inc/memchunk
Happy to answer any questions!
English word, clause, sentence, and paragraph boundaries do not always match characters.
How does the software handle these:
Mrs. Blue went to the sea shore with Mr. Black.
"What's for dinner?" Mrs. Blue asked.
Some notes: 1. Nice and tight article, good work 2. Shipped a piece of code, always props to that 3. The has_zero_byte it would be nice to actually do the math in the example. As is the example doesn't really show anything. It also should say "its" instead of "it's" 4. The work done per chunk shouldn't include the broadcasts. That should be done at the start of the search and those values kept in the registers, no? 5. Isn't AVX and SSE also SWAR? They're just wider registers 6. I think a graph showing the cost of the lookup table vs n needles would be cool to see
Overall nice work
I've been seeing a bunch of LLM-adjacent articles recently that are focusing on being fast - and they leave me a bit stumped.
While latency _can_ be a problem, reliability and accuracy are almost always my bottlenecks (to user value). Especially with chunking. Chunking is generally a one-time process where users aren't latency sensitive.
If you have reliability and accuracy (big if) then the practical usability and cost become performance problems.
And this is a bit of a sliding scale. Of course users want the best possible answer. However, if they can get 80% (magic hand-wavey fakie number) of the best answer on one second instead of 20, that may be a worthwhile tradeoff.
> Chunking is generally a one-time process where users aren't latency sensitive.
This is not necessarily true. For example, in our use case we are constantly monitoring websites, blogs, and other sources for changes. When a new page is added, we need to chunk and embed it fast so it's searchable immediately. Chunking speed matters for us.
When you're processing changes constantly, chunking is in the hot path. I think as LLMs get used more in real time workflows, every part of the stack will start facing latency pressure.
How much compute do your systems expend on chunking vs. the embedding itself?
Don't get me wrong, it's fun to see performance optimizations like this.
But I'd expect that a naive implementation of the same strategy would already take like 0.1% of the time needed to actually generate embeddings for your chunks. So practically, is it really worth the effort of writing a bunch of non-trivial SIMD code to reduce that overhead from 0.1% to 0.001%?
Agreed. For any code written, there is a sort of return on time expended. Optimisations are really only required when demanded.
From the author: > at some point we started benchmarking on wikipedia-scale datasets. > that’s when things started feeling… slow.
So they're talking about this becoming an issue when chunking TBs of data (I assume), not your 1kb random string...
But the bottleneck is generating embeddings either way.
memchunk has a throughput of 164 GB/s. A really fast embedder can deliver maybe 16k embeddings/sec, or ~1.6GB/s (if you assume 100 char sentences)
That's two orders of magnitude difference. Chunking is not the bottleneck.
It might be an architectural issue - you stuff chunks into a MQ, and you want to have full visibility in queue size ASAP - but otherwise it doesn't matter how much you chunk, your embedder will slow you down.
It's still a neat exercise on principle, though :)
It doesn't matter if A takes much more time than B, if B is large enough. You're still saving resources and time by optimising B. Also, you seem to assume that every chunk will get embedded - they may be revisiting some pages where the chunks are already present in the database.
Amdahl's law still holds, though. If A and B differ in execution times by orders of magnitude, optimising B yields minimal returns (assuming streaming, vs fully serial processing)
And sure, you can reject chunks, but a) the rejection isn't free, and B) you're still bound by embedding speed.
As for resource savings.... not in the Wikipedia data range. If you scale up massively and go to a PB of data, going from kiru to memchunk saves you ~25 CPU days. But you also suddenly need to move from bog-standard high cpu machines to machines supporting 164GB/s memory throughput, likely full metal with 8 memory channels. I'm too lazy to do the math, but it's going to be a mild difference at O($100)
Again, I'm not arguing this isn't a cool achievement. But it's very much engineering fun, not "crucial optimization".
So, whole english wikipedia in <1 second (~20GB compressed)?
Or is it now a lack of proper pipelining where you first load, then uncompress, then chunk, then write?
Add a nice strong linear model on top like vowpal wabbit and chunk at 100GB/s any language of your choice.
what i really want to see from this article is a curve showing a tradeoff between speed and embedded text quality. there's the preamble that just going by character has quality problems, but i dont think delimiters are necessarily the best either, vs being able to find paragraph or even chapter boundaries.
how much of a problem is it that ~1 sentence per chunk gets corrupted in the by-character solution? what level of sentence corruption is left in by switching to these delimiters? what level of paragraph/idea corruption is left in with each? chapter/argument level?
Can some clarify this part of the article for me
"if you search forward, you need to scan through the entire window to find where to split. you’d find a delimiter at byte 50, but you can’t stop there — there might be a better split point closer to your target size. so you keep searching, tracking the last delimiter you saw, until you finally cross the chunk boundary. that’s potentially thousands of matches and index updates."
So I understand that this is optimal if you want to make your chunks as large as possible for a given chunk size.
What I don't understand is why is it desirable to grab the largest chunk possible for a given chunk limit?
Or have I misunderstood this part of the article?
You have the right understanding.
We've found that maximizing chunk size gives the best retrieval performance and is easier to maintain since you don't have to customize chunking strategy per document type.
The upper limit for chunk size is set by your embedding model. After a certain size, encoding becomes too lossy and performance degrades.
There is a downside: blindly splitting into large chunks may cut a sentence or word off mid-way. We handle this by splitting at delimiters and adding overlap to cover abbreviations and other edge cases.
While this article is about perf — and trading off semantic precision by design — there is a Unicode standard for sentence boundaries, may be interesting: https://www.unicode.org/reports/tr29/#Sentence_Boundaries
I implemented the sentence boundaries, but also thought that the notion of a “phrase” might be useful for such applications: https://github.com/clipperhouse/uax29/tree/master/phrases
> you have a massive pile of text, and you need to split it into smaller pieces that fit into embedding models or context windows.
I think the recently posted Recursive Language Models paper approaches this in a far more compelling way. They put the long context into the environment and make the LLM write and iterate python code to query against it in a recursive loop. Fig. 2 & 4 are most relevant here.
https://news.ycombinator.com/item?id=46475395
https://arxiv.org/abs/2512.24601
I really like this because it is in The Bitter Lesson genre of solutions. Make the model learn the best way to retrieve info from a massive prompt on disk given the domain and any human feedback (explicit and otherwise).
The bigger the prompt.txt, the less relevant the LLM's raw context capabilities are. Context scaling is quadratic in cost. It's a very expensive rabbit to chase. Recursively invoking the same agent with decomposed problem bits is more of a logarithmic scaling thing. You could hypothetically manage a 1 gigabyte prompt with a relatively minuscule context window under a recursive scheme using nothing other than a shell/python interpreter.
4/5 of today's top CNN articles have words with periods in them: "Mr.", "Dr.", "No.", "John D. Smith", "Rep."
The last one also has periods within quotations, so period chunking would cut off the quote.
This gets those cases right.
https://github.com/KnowSeams/KnowSeams
(On a beefy machine) It gets 1 TB/s throughput including all IO and position mapping back to original text location. I used it to split project gutenberg novels. It does 20k+ novels in about 7 seconds.
Note it keeps all dialog together- which may not be what others want, but was what i wanted.
A big chunk size with overlap solves this. Chunks don't have to be be "perfectly" split in order to work well.
True, but you don’t need 150GB/s delimiter scanning in that case either.
As the other comment said, its a practice in good enough chunks quality. We focus on big chunks (largest we can make without hurting embedding quality) as fast as possible. In our experience, retrieval accuracy is mostly driven by embedding quality, so perfect splits don't move the needle much.
But as the number of files to ingest grows, chunking speed does become a bottleneck. We want faster everything (chunking, embedding, retrieval) but chunking was the first piece we tackled. Memchunk is the fastest we could build.
I suspect chunking is an exercise in „good enough“
Does this even work if you're incredulous enough???
Nice! Thanks for sharing. Does it do overlap?
Do you see this project merge with the Chonkie at some point? Or do you intend to keep it separate?
Memchunk is already in Chonkie as the `FastChunker`
To install: pip install chonkie[fast]
``` from chonkie import FastChunker
chunker = FastChunker(chunk_size=4096) chunks = chunker(huge_document) ```
This warms my heart
.NET's string.Split implementation is very close to what the article showcases, even 3-character limit is there: https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
Not all languages have such well-defined and commonly used delimiters. Is this "English only"?
Which language are you thinking of? Ideally, how would you identify split points in this language?
I suppose we've only tested this with languages that do have delimiters - Hindi, English, Spanish, and French
There are two ways to control the splitting point. First is through delimiters, and the second is by setting chunk size. If you're parsing a language where chunks can't be described by either of those params, then I suppose memchunk wouldn't work. I'd be curious to see what does work though!
There are certainly cases of Greek/Latin without any punctuation at all, typically in a historical context. Chinese & Japanese historically did not have any punctuation whatsoever.
Do the delimiters have to be single bytes? e.g. Japanese full stop (IDEOGRAPHIC FULL STOP) is 3 bytes in UTF-8.
No, delimiters can be multiple bytes. They have to be passed as a pattern.
// With multi-byte pattern
let metaspace = "<japanese_full_stop>".as_bytes();
let chunks: Vec<&[u8]> = chunk(text).pattern(metaspace).prefix().collect();