I'm not quite seeing the real benefit of this. Is the idea that warps will now be able to do work-stealing and continuation-stealing when running heterogenous parallel workloads? But that requires keeping the async function's state in GPU-wide shared memory, which is generally a scarce resource.
GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.
Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).
My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).
Very cool to see this and something I have been curious about myself and exploring the space as well. I'd be curious what are some parallels and differentiations between this and NVIDIA's stdexec (outside of it being in Rust and using Future, which is also cooL)
We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.
The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.
warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.
GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).
So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?
Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?
I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.
I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.
I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.
In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)
I'm not quite seeing the real benefit of this. Is the idea that warps will now be able to do work-stealing and continuation-stealing when running heterogenous parallel workloads? But that requires keeping the async function's state in GPU-wide shared memory, which is generally a scarce resource.
A ton of GPU workloads require leaving large amounts of RAM resident on the GPU and running computation with some new data from the CPU.
Yes, that's the idea.
GPU-wide memory is not quite as scarce on datacenter cards or systems with unified memory. One could also have local executors with local futures that are `!Send` and place in a faster address space.
Really cool experiment (the whole company).
Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).
My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).
Very cool to see this and something I have been curious about myself and exploring the space as well. I'd be curious what are some parallels and differentiations between this and NVIDIA's stdexec (outside of it being in Rust and using Future, which is also cooL)
What's the performance like? What would the benefits be of converting a streaming multiprocessor programming model to this?
We aren't focused on performance yet (it is often workload and executor dependent, and as the post says we currently do some inefficient polling) but Rust futures compile down to state machines so they are a zero-cost abstraction.
The anticipated benefits are similar to the benefits of async/await on CPU: better ergonomics for the developer writing concurrent code, better utilization of shared/limited resources, fewer concurrency bugs.
warp is expensive - essentially it's running a 'don't run code' to maintain SIMT.
GPUs are still not practically-Turing-complete in the sense that there are strict restrictions on loops/goto/IO/waiting (there are a bunch of band-aids to make it pretend it's not a functional programming model).
So I am not sure retrofitting a Ferrari to cosplay an Amazon delivery van is useful other than for tech showcase?
Good tech showcase though :)
Very cool!
Is the goal with this project (generally, not specifically async) to have an equivalent to e.g. CUDA, but in Rust? Or is there another intended use-case that I'm missing?
Et tu, GPU?
I am, bluntly, sick of Async taking over rust ecosystems. Embedded and web/HTTP have already fallen. I'm optimistic this won't take hold in GPU; well see. Async splits the ecosystem. I see it as the biggest threat to Rust staying a useful tool.
I use rust on the GPU for the following: 3d graphics via WGPU, cuFFT via FFI, custom kernels via Cudarc, and ML via Burn and Candle. Thankfully these are all Async-free.
Is this Nvidia-only or does it work on other architectures?
Currently NVIDIA-only, we're cooking up some Vulkan stuff in rust-gpu though.
I don't have anything to offer but my encouragement, but there are _dozens_ of ROCm enjoyers out there.
In years prior I wouldn't have even bothered, but it's 2026 and AMD's drivers actually come with a recent version of torch that 'just works' on windows. Anything is possible :)
Does the lack of forward progress guarantees (ITS) on other architectures pose challenges for async/await?