You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!
Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.
I don't think it is possible to talk about fractions of nanoseconds without having an extremely good idea of the stability and accuracy of your clock. At best I think you could claim there is some kind of reduction but it is super hard to make such claims in the absolute without doing a massive amount of prep work to ensure that the measured times themselves are indeed accurate. You could be off by a large fraction and never know the difference. So unless there is a hidden atomic clock involved somewhere in these measurements I think they should be qualified somehow.
Stability and accuracy, when applied to clocks, are generally about dynamic range, i.e. how good is the scale with which you are measuring time. So if you're talking about nanoseconds across a long time period, seconds or longer, then yeah, you probably should care about your clock. But when you're measuring nanoseconds out of a millisecond or microsecond, it really doesn't matter that much and you're going to be OK with the average crystal oscillator in a PC. (and if you're measuring a 10% difference like in the article, you're going to be fine with a mechanical clock as your reference if you can do the operation a billion times in a row).
This setup is a user space program on a machine that is not exclusively dedicated to the test running all kinds of interrupts (and other tasks) left, right and center through the software under test.
Fair point. These were run on a standard dev workstation under load, which may account for the noise. I haven't done a deep dive into the outliers yet, but the distribution definitely warrants a more isolated look.
I have that 10-30s time window to fill when claude might be loading some stuff ; the 1 liner is exactly what fits in that window - it makes me wonder about the original idea of twitter now that I think of it - but since it's not the same kind of content I don't bother with it.It really feels like "here is the stuff, here's more about it if you want to" - really really appreciate that form and will definitely do the same format myself
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.
Only for some clocks (CLOCK_MONOTONIC, etc) and some clock sources. For VIRT/SCHED, the vDSO shim still has to invoke the actual syscall. You can't avoid the kernel transition when you need per-thread accounting.
If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).
You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!
Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)
This is a great example of how a small change in the right place can outweigh years of incremental tuning.
Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.
I don't think it is possible to talk about fractions of nanoseconds without having an extremely good idea of the stability and accuracy of your clock. At best I think you could claim there is some kind of reduction but it is super hard to make such claims in the absolute without doing a massive amount of prep work to ensure that the measured times themselves are indeed accurate. You could be off by a large fraction and never know the difference. So unless there is a hidden atomic clock involved somewhere in these measurements I think they should be qualified somehow.
Stability and accuracy, when applied to clocks, are generally about dynamic range, i.e. how good is the scale with which you are measuring time. So if you're talking about nanoseconds across a long time period, seconds or longer, then yeah, you probably should care about your clock. But when you're measuring nanoseconds out of a millisecond or microsecond, it really doesn't matter that much and you're going to be OK with the average crystal oscillator in a PC. (and if you're measuring a 10% difference like in the article, you're going to be fine with a mechanical clock as your reference if you can do the operation a billion times in a row).
This setup is a user space program on a machine that is not exclusively dedicated to the test running all kinds of interrupts (and other tasks) left, right and center through the software under test.
Did you look into the large spread on your distributions? Some of these span multiple orders of magnitude which is interesting
Fair point. These were run on a standard dev workstation under load, which may account for the noise. I haven't done a deep dive into the outliers yet, but the distribution definitely warrants a more isolated look.
Very thankful for the 1liner tldr
edit : I had an afterthought about this because it ended up being a low quality comment ;
Bringing up such TLDR give a lot of value to reading content, especially on HN, as it provides way more inertia and let focus on -
reading this short form felt like that cool friend who gave you a heads up.
I was unsure whether to post it or not so I am glad you found it useful!
I have that 10-30s time window to fill when claude might be loading some stuff ; the 1 liner is exactly what fits in that window - it makes me wonder about the original idea of twitter now that I think of it - but since it's not the same kind of content I don't bother with it.It really feels like "here is the stuff, here's more about it if you want to" - really really appreciate that form and will definitely do the same format myself
> Flame graph image
> Click to zoom, open in a new tab for interactivity
I admit I did not expect "Open Image in New Tab" to do what it said on the tin. I guess I was aware that it was possible with SVG but I don't think I've ever seen it done and was really not expecting it.
Courtesy of Brendan Gregg and his flamegraph.pl scripts: https://github.com/brendangregg/FlameGraph
Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.
The QuestDB team are among the best doing it.
Love the people and their software.
Great blog Jaromir!
This is such a great writeup
clock_gettime() goes through vDSO, avoiding a context switch. It shows up on the flamegraph as well.
Only for some clocks (CLOCK_MONOTONIC, etc) and some clock sources. For VIRT/SCHED, the vDSO shim still has to invoke the actual syscall. You can't avoid the kernel transition when you need per-thread accounting.
Thanks, I really should've looked deeper than that.
no problem at all, I was confused too when I saw the profile for the first time.
If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).
Exactly this.
edit: agh, no. CLOCK_THREAD_CPUTIME_ID falls through the vdso to the kernel which makes sense as it would likely need to look at the task struct.
here it gets the task struct: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... and here https://elixir.bootlin.com/linux/v6.18.5/source/kernel/time/... to here where it actually pulls the value out: https://elixir.bootlin.com/linux/v6.18.5/source/kernel/sched...
where here is the vdso clock pick logic https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get... and here is the fallback to the syscall if it's not a vdso clock https://elixir.bootlin.com/linux/v6.18.5/source/lib/vdso/get...