I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)
This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.
im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...
There is actually work on adding autodiff to Rust, maybe not really first class citizen, but at least build in: https://doc.rust-lang.org/std/autodiff/index.html (it is still at a pre-RFC stage so it is not something that soon will be added)
Incredible, I have never heard of std::autodiff before. Isn't it rare for a programming language to provide AD within the standard library? Even Julia doesn't have it, I wouldn't expect Rust out of all languages to experiment it in std.
This is a bit good for Rust if you want to use the language with CUDA. The problem is, it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries and care about open source.
Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.
Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.
> it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries
Those people probably did not buy an Nvidia GPU for themselves. It should be common knowledge that the "Open" Nvidia drivers still run gigantic firmware blobs to dispatch complex workloads. And Nouveau is close to useless for GPGPU compute.
Why do we bother with programming languages today? Why not have the LLMs just write assembly code and skip the human readable part? We are not reviewing it anymore anyway.
1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.
2) LLMs get stuck or fail to solve a problem sometimes. It is preferable to have artifacts that humans can grok without the massive extra effort of parsing out assembly code.
3) Assembly code varies massively across targets. We want provable, deterministic transformation from the intent (specified in a higher level language) to the target assembly language. LLMs can't reliably output many artifacts for different platforms that behave the same.
4) Hopefully, we are still reviewing the code output by LLMs to some extent.
I get what you mean but I think if anything AI pairs extremely well with strongly typed languages that are at times cumbersome for humans, but decrease the latency at which AI can get feedback on its code. In my (very) limited experience Rust is an excellent target for AI codegen.
This is a Rust to CUDA converter so I guess it is for codes where the programmer wants it to function properly (Rust) and have good performance (CUDA).
It’s just a matter of different workflows for different users and application.
I mean, AI is not good at writing x86-64 assembly code. Last time I tried (with both Claude and ChatGPT), the AI failed to even create basic programs other than Hello World.
I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)
This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.
im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...
Here's the repo link https://github.com/NVlabs/cuda-oxide
Personally I really don't want new GPU languages that do not have AD as a first class citizen. I mean rust is an improvement over C++ CUDA but still.
Sorry, what is AD in this context?
edit: oh, automatic differentiation?
There is actually work on adding autodiff to Rust, maybe not really first class citizen, but at least build in: https://doc.rust-lang.org/std/autodiff/index.html (it is still at a pre-RFC stage so it is not something that soon will be added)
Incredible, I have never heard of std::autodiff before. Isn't it rare for a programming language to provide AD within the standard library? Even Julia doesn't have it, I wouldn't expect Rust out of all languages to experiment it in std.
Really hard to find alternatives to Julia for AD as a first class citizen
I think the parent is mostly referring to solutions like Slang.D
This is a bit good for Rust if you want to use the language with CUDA. The problem is, it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries and care about open source.
Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.
Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.
Unlike this...but uses Rust.
[0] https://news.ycombinator.com/item?id=48067228
Mojo remains to be seen if it isn't another Swift for Tensorflow, apparently 1.0 won't even support Windows properly.
who the fuck uses windows
The majority of computer owners on planet Earth
But also the majority of programmers?
In AI-focused fields like business analytics and data science, yeah.
> it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries
Those people probably did not buy an Nvidia GPU for themselves. It should be common knowledge that the "Open" Nvidia drivers still run gigantic firmware blobs to dispatch complex workloads. And Nouveau is close to useless for GPGPU compute.
Why do we bother with programming languages today? Why not have the LLMs just write assembly code and skip the human readable part? We are not reviewing it anymore anyway.
A lot of really good reasons:
1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.
2) LLMs get stuck or fail to solve a problem sometimes. It is preferable to have artifacts that humans can grok without the massive extra effort of parsing out assembly code.
3) Assembly code varies massively across targets. We want provable, deterministic transformation from the intent (specified in a higher level language) to the target assembly language. LLMs can't reliably output many artifacts for different platforms that behave the same.
4) Hopefully, we are still reviewing the code output by LLMs to some extent.
Is this a serious question or are you just trolling?
I get what you mean but I think if anything AI pairs extremely well with strongly typed languages that are at times cumbersome for humans, but decrease the latency at which AI can get feedback on its code. In my (very) limited experience Rust is an excellent target for AI codegen.
This is a Rust to CUDA converter so I guess it is for codes where the programmer wants it to function properly (Rust) and have good performance (CUDA).
It’s just a matter of different workflows for different users and application.
I mean, AI is not good at writing x86-64 assembly code. Last time I tried (with both Claude and ChatGPT), the AI failed to even create basic programs other than Hello World.
Because when this idiotic hypemachinery finally dies an agonising, painful death, some of us still want to work with computers