This needs an index and introduction. It's also not super interesting to people in industry? Like yeah, it'd be nice if bindless textures were part of the API so you didn't need to create that global descriptor set. It'd be nice if you just sample from pointers to textures similar to how dereferencing buffer pointers works.
This is a fantastic article that demonstrates how many parts of vulkan and DX12 are no longer needed.
I hope the IHVs have a look at it because current DX12 seems semi abandoned, with it not supporting buffer pointers even when every gpu made on the last 10 (or more!) years can do pointers just fine, and while Vulkan doesnt do a 2.0 release that cleans things, so it carries a lot of baggage, and specially, tons of drivers that dont implement the extensions that really improve things.
If this api existed, you could emulate openGL on top of this faster than current opengl to vulkan layers, and something like SDL3 gpu would get a 3x/4x boost too.
DirectX documentation is on a bad state currently, you have the Frank Lunas's books, which don't cover the latest improvements, and then is hunting through Learn, Github samples and reference docs.
Vulkan is another mess, even if there was a 2.0, how are devs supposed to actually use it, especially on Android, the biggest consumer Vulkan platform?
Isn't this all because PCI resizable BAR is not required to run any GPU besides Intel Arc? As in, maybe it's mostly down to Microsoft/Intel mandating reBAR in UEFI so we can start using stuff like bindless textures without thousands of support tickets and negative reviews.
I think this puts a floor on supported hardware though, like Nvidia 30xx and Radeon 5xxx. And of course motherboard support is a crapshoot until 2020 or so.
This is not really directly about resizable BAR. you could do mostly the same api without it. Resizable bar simplifies it a little bit because you skip manual transfer operations, but its not completely required as you can write things to a cpu-writeable buffer and then begin your frame with a transfer command.
Bindless textures never needed any kind of resizable BAR, you have been able to use them since early 2010s on opengl through an extension. Buffer pointers also have never needed it.
The article is missing this motivation paragraph, taken from the blog index:
> Graphics APIs and shader languages have significantly increased in complexity over the past decade. It’s time to start discussing how to strip down the abstractions to simplify development, improve performance, and prepare for future GPU workloads.
Thanks, I had trouble figuring out what the article was about, lost in all the "here's how I used AI and had the article screened by industry insiders".
Meaning ... SSDs initially reused IDE/SATA interfaces, which had inherent bottlenecks because those standards were designed for spinning disks.
To fully realize SSD performance, a new transport had to be built from the ground up, one that eliminated those legacy assumptions, constraints and complexities.
I have followed Sebastian Aaltonen's work for quite a while now, so maybe I am a bit biased, this is however a great article.
I also think that the way forward is to go back to software rendering, however this time around those algorithms and data structures are actually hardware accelerated as he points out.
Note that this is an ongoing trend on VFX industry already, about 5 years ago OTOY ported their OctaneRender into CUDA as the main rendering API.
There are tons of places within the GPU where dedicated fixed function hardware provides massive speedups within the relevant pipelines (rasterization, raytracing). The different shader types are designed to fit inbetween those stages. Abandoning this hardware would lead to a massive performance regression.
Offtop, but sorry, I can't resist. "Inbetween" is not a word. I started seeing many people having trouble with prepositions lately, for some unknown reason.
> “Inbetween” is never written as one word. If you have seen it written in this way before, it is a simple typo or misspelling. You should not use it in this way because it is not grammatically correct as the noun phrase or the adjective form.
https://grammarhow.com/in-between-in-between-or-inbetween/
Isn't this already happening to some degree? E.g. UE's Nanite uses a software rasterizer for small triangles, albeit running on the GPU via a compute shader.
Things are kind of heading in two opposite directions at the moment. Early GPU rasterization was all done in fixed-function hardware, but then we got programmable shading, and then we started using compute shaders to feed the HW rasterizer, and then we started replacing the HW rasterizer itself with more compute (as in Nanite). The flexibility of doing whatever you want in software has gradually displaced the inflexible hardware units.
Meanwhile GPU raytracing was a purely software affair until quite recently when fixed-function raytracing hardware arrived. It's fast but also opaque and inflexible, only exposed through high-level driver interfaces which hide most of the details, so you have to let Jensen take the wheel. There's nothing stopping someone from going back to software RT of course but the performance of hardware RT is hard to pass up for now, so that's mostly the way things are going even if it does have annoying limitations.
Why do you say 'albeit'? I think it's established that 'software rendering' can mean running on the GPU. That's what Octane is doing with CUDA in the comment you are replying to. But good callout on Nanite.
Impressive post, so many details. I could only understand some parts of it, but I think this article will probably be a reference for future graphics API.
I think it's fair to say that for most gamers, Vulkan/DX12 hasn't really been a net positive, the PSO problem affected many popular games and while Vulkan has been trying to improve, WebGPU is tricky as it has is roots on the first versions of Vulkan.
Perhaps it was a bad idea to go all in to a low level API that exposes many details when the hardware underneath is evolving so fast. Maybe CUDA, as the post says in some places, with its more generic computing support is the right way after all.
I miss Mantle. It had its quirks but you felt as if you were literally programming hardware using a pretty straight forward API. The most fun I’ve had programming was for the Xbox 360.
You know what else is good like that? The Switch graphics API - designed by Nvidia and Nintendo. Easily the most straightforward of the console graphics APIs
Great post, it brings back a lot of memories. Two additional factors that designers of these APIs consider are:
* GPU virtualization (e.g., the D3D residency APIs), to allow many applications to share GPU resources (e.g., HBM).
* Undefined behavior: how easy is it for applications to accidentally or intentionally take a dependency on undefined behavior? This can make it harder to translate this new API to an even newer API in the future.
And the GPU API cycle of life and death continues!
I was an only-half-joking champion of ditching vertex attrib bindings when we were drafting WebGPU and WGSL, because it's a really nice simplification, but it was felt that would be too much of a departure from existing APIs. (Spending too many of our "Innovation Tokens" on something that would cause dev friction in the beginning)
In WGSL we tried (for a while?) to build language features as "sugar" when we could. You don't have to guess what order or scope a `for` loop uses when we just spec how it desugars into a simpler, more explicit (but more verbose) core form/dialect of the language.
That said, this powerpoint-driven-development flex knocks this back a whole seriousness and earnestness tier and a half:
> My prototype API fits in one screen: 150 lines of code. The blog post is titled “No Graphics API”. That’s obviously an impossible goal today, but we got close enough. WebGPU has a smaller feature set and features a ~2700 line API (Emscripten C header).
Try to zoom out on the API and fit those *160* lines on one screen! My browser gives up at 30%, and I am still only seeing 127. This is just dishonesty, and we do not need more of this kind of puffery in the world.
And yeah, it's shorter because it is a toy PoC, even if one I enjoyed seeing someone else's take on it. Among other things, the author pretty dishonestly elides the number of lines the enums would take up. (A texture/data format enum on one line? That's one whole additional Pinocchio right there!)
I took WebGPU.webidl and did a quick pass through removing some of the biggest misses of this API (queries, timers, device loss, errors in general, shader introspection, feature detection) and some of the irrelevant parts (anything touching canvas, external textures), and immediately got it down to 241 declarations.
This kind of dishonest puffery holds back an otherwise interesting article.
I disagree that ray tracing and mesh shaders are largely ignored - at least within AAA game engines they are leaned on quite a lot. Particularly ray tracing.
Yes, the centralization of engines to Unreal, Unity, etc makes it so there’s less interest in pushing the boundaries, they are still pushed just on the GPU side.
From a CPU API perspective, it’s very close to just plain old buffer mapping and go. We would need a hardware shift that would add something more to the pipeline than what we currently do. Like when tesselation shaders came about from geometry shader practices.
The frontier of graphics APIs might be the consoles and they don't get a bump until the hardware gets a bump and the console hardware is a little bit behind.
Valve seems to be substantially responsible for the mess that is Vulkan. They were one of its pioneers from what I heard when chatting with Vulkan people.
There's plenty of blame to go around, but if any one faction is responsible for the Vulkan mess it's the mobile GPU vendors and Khronos' willingness to compromise for their sake at every turn. Huge amounts of API surface was dedicated to accommodating limitations that only existed on mobile architectures, and earlier versions of Vulkan insisted on doing things the mobile way even if you knew your software was only ever going to run on desktop.
Thankfully later versions have added escape hatches which bypass much of that unnecessary bureaucracy, but it was grim for a while, and all that early API cruft is still there to confuse newcomers.
"Pipeline State Objects" (immutable state objects which define most of the rendering state needed for a draw/dispatch call). Tbf, it's a very common term in rendering since around 2015 when the modern 3D APIs showed up.
There is no implementation of it but this is how i see it, at least comparing with how things with fully extensioned vulkan work, which uses a few similar mechanics.
Per-drawcall cost goes to nanosecond scale. Assuming you do drawcalls of course, this makes bindless and indirect rendering a bit easier so you could drop CPU cost to near-0 in a renderer.
It would also highly mitigate shader compiler hitches due to having a split pipeline instead of a monolythic one.
The simplification on barriers could improve performance a significant amount because currently, most engines that deal with Vulkan and DX12 need to keep track of individual texture layouts and transitions, and this completely removes such a thing.
Getting rid of cruft isn't really a goal in and of itself, it's a goal in service of other goals. If it's not about performance, what else would be accomplished?
A simplified API means higher programmer productivity, higher robustness, simplified debugging and testing, and also less internal complexity in the driver. All this together may also result in slightly higher performance, but it's not the main goal. You might gain a couple hundred microseconds per frame as a side effect of the simpler code, but if your use case already perfectly fits the 'modern subset' of Vulkan or D3D12, the performance gains will be deep in 'diminishing returns area' and hardly noticeable in the frame rate. It's mostly about secondary effects by making the programmer's life easier on both sides of the API.
The cost/compromise is dropping support for outdated GPUs.
It would likely reduce or eliminate the "compiling shaders" step many games now have on first run after an update, and the stutters many games have as new objects or effects come on screen for the first time.
But then game/engine devs want to use the vertex shader producing a uv coordinate and a normal together with a pixel shader that only reads the uv coordinate (or neither for shadow mapping) and don't want to pay for the bandwidth of the unused vertex outputs (or the cost of calculating them).
Or they want to be able to randomly enable any other pipeline stage like tessellation or geometry and the same shader should just work without any performance overhead.
This needs an index and introduction. It's also not super interesting to people in industry? Like yeah, it'd be nice if bindless textures were part of the API so you didn't need to create that global descriptor set. It'd be nice if you just sample from pointers to textures similar to how dereferencing buffer pointers works.
This is a fantastic article that demonstrates how many parts of vulkan and DX12 are no longer needed.
I hope the IHVs have a look at it because current DX12 seems semi abandoned, with it not supporting buffer pointers even when every gpu made on the last 10 (or more!) years can do pointers just fine, and while Vulkan doesnt do a 2.0 release that cleans things, so it carries a lot of baggage, and specially, tons of drivers that dont implement the extensions that really improve things.
If this api existed, you could emulate openGL on top of this faster than current opengl to vulkan layers, and something like SDL3 gpu would get a 3x/4x boost too.
No longer needed is a strong statement given how recent the GPU support is. It's unlikely anything could accept those minimum requirements today.
But soon? Hopefully
I'm surprised he made no mention of the SDL3 GPU API since his proposed API has pretty significant overlap with it.
DirectX documentation is on a bad state currently, you have the Frank Lunas's books, which don't cover the latest improvements, and then is hunting through Learn, Github samples and reference docs.
Vulkan is another mess, even if there was a 2.0, how are devs supposed to actually use it, especially on Android, the biggest consumer Vulkan platform?
Isn't this all because PCI resizable BAR is not required to run any GPU besides Intel Arc? As in, maybe it's mostly down to Microsoft/Intel mandating reBAR in UEFI so we can start using stuff like bindless textures without thousands of support tickets and negative reviews.
I think this puts a floor on supported hardware though, like Nvidia 30xx and Radeon 5xxx. And of course motherboard support is a crapshoot until 2020 or so.
This is not really directly about resizable BAR. you could do mostly the same api without it. Resizable bar simplifies it a little bit because you skip manual transfer operations, but its not completely required as you can write things to a cpu-writeable buffer and then begin your frame with a transfer command.
Bindless textures never needed any kind of resizable BAR, you have been able to use them since early 2010s on opengl through an extension. Buffer pointers also have never needed it.
The article is missing this motivation paragraph, taken from the blog index:
> Graphics APIs and shader languages have significantly increased in complexity over the past decade. It’s time to start discussing how to strip down the abstractions to simplify development, improve performance, and prepare for future GPU workloads.
Thanks, I had trouble figuring out what the article was about, lost in all the "here's how I used AI and had the article screened by industry insiders".
Would this be analogous to NVMe?
Meaning ... SSDs initially reused IDE/SATA interfaces, which had inherent bottlenecks because those standards were designed for spinning disks.
To fully realize SSD performance, a new transport had to be built from the ground up, one that eliminated those legacy assumptions, constraints and complexities.
...and introduced new ones.
I have followed Sebastian Aaltonen's work for quite a while now, so maybe I am a bit biased, this is however a great article.
I also think that the way forward is to go back to software rendering, however this time around those algorithms and data structures are actually hardware accelerated as he points out.
Note that this is an ongoing trend on VFX industry already, about 5 years ago OTOY ported their OctaneRender into CUDA as the main rendering API.
There are tons of places within the GPU where dedicated fixed function hardware provides massive speedups within the relevant pipelines (rasterization, raytracing). The different shader types are designed to fit inbetween those stages. Abandoning this hardware would lead to a massive performance regression.
Just consider the sheer number of computations offloaded to TMUs. Shaders would already do nothing but interpolate texels if you removed them.
Offtop, but sorry, I can't resist. "Inbetween" is not a word. I started seeing many people having trouble with prepositions lately, for some unknown reason.
> “Inbetween” is never written as one word. If you have seen it written in this way before, it is a simple typo or misspelling. You should not use it in this way because it is not grammatically correct as the noun phrase or the adjective form. https://grammarhow.com/in-between-in-between-or-inbetween/
If enough people use it, it will become correct. This is how language evolves. BTW, there is no "official English language specification".
And linguists think it would be a bad idea to have one:
https://archive.nytimes.com/opinionator.blogs.nytimes.com/20...
Isn't this already happening to some degree? E.g. UE's Nanite uses a software rasterizer for small triangles, albeit running on the GPU via a compute shader.
Things are kind of heading in two opposite directions at the moment. Early GPU rasterization was all done in fixed-function hardware, but then we got programmable shading, and then we started using compute shaders to feed the HW rasterizer, and then we started replacing the HW rasterizer itself with more compute (as in Nanite). The flexibility of doing whatever you want in software has gradually displaced the inflexible hardware units.
Meanwhile GPU raytracing was a purely software affair until quite recently when fixed-function raytracing hardware arrived. It's fast but also opaque and inflexible, only exposed through high-level driver interfaces which hide most of the details, so you have to let Jensen take the wheel. There's nothing stopping someone from going back to software RT of course but the performance of hardware RT is hard to pass up for now, so that's mostly the way things are going even if it does have annoying limitations.
Why do you say 'albeit'? I think it's established that 'software rendering' can mean running on the GPU. That's what Octane is doing with CUDA in the comment you are replying to. But good callout on Nanite.
Impressive post, so many details. I could only understand some parts of it, but I think this article will probably be a reference for future graphics API.
I think it's fair to say that for most gamers, Vulkan/DX12 hasn't really been a net positive, the PSO problem affected many popular games and while Vulkan has been trying to improve, WebGPU is tricky as it has is roots on the first versions of Vulkan.
Perhaps it was a bad idea to go all in to a low level API that exposes many details when the hardware underneath is evolving so fast. Maybe CUDA, as the post says in some places, with its more generic computing support is the right way after all.
Very well written but I can't understand much of this article.
What would be one good primer to be able to comprehend all the design issues raised?
I miss Mantle. It had its quirks but you felt as if you were literally programming hardware using a pretty straight forward API. The most fun I’ve had programming was for the Xbox 360.
You know what else is good like that? The Switch graphics API - designed by Nvidia and Nintendo. Easily the most straightforward of the console graphics APIs
Yes but it’s so underpowered. I want RTX 5090 performance with 16 cores.
NVIDIA's NVRHI has been my favorite abstraction layer over the complexity that modern APIs bring.
In particular, this fork: https://github.com/RobertBeckebans/nvrhi which adds some niceties and quality of life improvements.
After reading this article, I feel like I've witnessed a historic moment.
Great post, it brings back a lot of memories. Two additional factors that designers of these APIs consider are:
* GPU virtualization (e.g., the D3D residency APIs), to allow many applications to share GPU resources (e.g., HBM).
* Undefined behavior: how easy is it for applications to accidentally or intentionally take a dependency on undefined behavior? This can make it harder to translate this new API to an even newer API in the future.
ironically, explaining that "we need a simpler API" takes a dense 69-page technical missive that would make the Kronos Vulkan tutorial blush.
I don't understand why you think this is ironic
And the GPU API cycle of life and death continues!
I was an only-half-joking champion of ditching vertex attrib bindings when we were drafting WebGPU and WGSL, because it's a really nice simplification, but it was felt that would be too much of a departure from existing APIs. (Spending too many of our "Innovation Tokens" on something that would cause dev friction in the beginning)
In WGSL we tried (for a while?) to build language features as "sugar" when we could. You don't have to guess what order or scope a `for` loop uses when we just spec how it desugars into a simpler, more explicit (but more verbose) core form/dialect of the language.
That said, this powerpoint-driven-development flex knocks this back a whole seriousness and earnestness tier and a half: > My prototype API fits in one screen: 150 lines of code. The blog post is titled “No Graphics API”. That’s obviously an impossible goal today, but we got close enough. WebGPU has a smaller feature set and features a ~2700 line API (Emscripten C header).
Try to zoom out on the API and fit those *160* lines on one screen! My browser gives up at 30%, and I am still only seeing 127. This is just dishonesty, and we do not need more of this kind of puffery in the world.
And yeah, it's shorter because it is a toy PoC, even if one I enjoyed seeing someone else's take on it. Among other things, the author pretty dishonestly elides the number of lines the enums would take up. (A texture/data format enum on one line? That's one whole additional Pinocchio right there!)
I took WebGPU.webidl and did a quick pass through removing some of the biggest misses of this API (queries, timers, device loss, errors in general, shader introspection, feature detection) and some of the irrelevant parts (anything touching canvas, external textures), and immediately got it down to 241 declarations.
This kind of dishonest puffery holds back an otherwise interesting article.
Who cares about dev friction in the beginning? That was a bad choice.
This seems tangentially related?
https://github.com/google/toucan
I wonder why M$ stopped putting out new Direct X? Direct X Ultimate or 12.1 or 12.2 is largely the same as Direct X 12.
Or has the use of Middleware like Unreal Engine largely made them irrelevant? Or should EPIC put out a new Graphics API proposal?
That has always been the case, it is mostly FOSS circles that argue about APIs.
Game developers create a RHI (rendering hardware interface) like discussed on the article, and go on with game development.
Because the greatest innovation thus far has been ray tracing and mesh shaders, and still they are largely ignored, so why keep on pushing forward?
I disagree that ray tracing and mesh shaders are largely ignored - at least within AAA game engines they are leaned on quite a lot. Particularly ray tracing.
Both-ish.
Yes, the centralization of engines to Unreal, Unity, etc makes it so there’s less interest in pushing the boundaries, they are still pushed just on the GPU side.
From a CPU API perspective, it’s very close to just plain old buffer mapping and go. We would need a hardware shift that would add something more to the pipeline than what we currently do. Like when tesselation shaders came about from geometry shader practices.
The frontier of graphics APIs might be the consoles and they don't get a bump until the hardware gets a bump and the console hardware is a little bit behind.
This looks very similar to the SDL3 GPU API and other RHI libraries that have been created at first glance.
I wonder if Valve might put out their own graphics API for SteamOS.
Valve seems to be substantially responsible for the mess that is Vulkan. They were one of its pioneers from what I heard when chatting with Vulkan people.
There's plenty of blame to go around, but if any one faction is responsible for the Vulkan mess it's the mobile GPU vendors and Khronos' willingness to compromise for their sake at every turn. Huge amounts of API surface was dedicated to accommodating limitations that only existed on mobile architectures, and earlier versions of Vulkan insisted on doing things the mobile way even if you knew your software was only ever going to run on desktop.
Thankfully later versions have added escape hatches which bypass much of that unnecessary bureaucracy, but it was grim for a while, and all that early API cruft is still there to confuse newcomers.
Samsung and Google also have their share, see who does most of Vulkanised talks.
the article talks a lot about PSOs but never defines the term
"Pipeline State Objects" (immutable state objects which define most of the rendering state needed for a draw/dispatch call). Tbf, it's a very common term in rendering since around 2015 when the modern 3D APIs showed up.
PSOs are Pipeline State Objects, they encapsulate the entire state of the rendering pipeline.
what level of performance improvements would this represent?
There is no implementation of it but this is how i see it, at least comparing with how things with fully extensioned vulkan work, which uses a few similar mechanics.
Per-drawcall cost goes to nanosecond scale. Assuming you do drawcalls of course, this makes bindless and indirect rendering a bit easier so you could drop CPU cost to near-0 in a renderer.
It would also highly mitigate shader compiler hitches due to having a split pipeline instead of a monolythic one.
The simplification on barriers could improve performance a significant amount because currently, most engines that deal with Vulkan and DX12 need to keep track of individual texture layouts and transitions, and this completely removes such a thing.
It's mostly not about performance, but about getting rid of legacy cruft that still exists in modern 3D APIs to support older GPU architectures.
Getting rid of cruft isn't really a goal in and of itself, it's a goal in service of other goals. If it's not about performance, what else would be accomplished?
A simplified API means higher programmer productivity, higher robustness, simplified debugging and testing, and also less internal complexity in the driver. All this together may also result in slightly higher performance, but it's not the main goal. You might gain a couple hundred microseconds per frame as a side effect of the simpler code, but if your use case already perfectly fits the 'modern subset' of Vulkan or D3D12, the performance gains will be deep in 'diminishing returns area' and hardly noticeable in the frame rate. It's mostly about secondary effects by making the programmer's life easier on both sides of the API.
The cost/compromise is dropping support for outdated GPUs.
It would likely reduce or eliminate the "compiling shaders" step many games now have on first run after an update, and the stutters many games have as new objects or effects come on screen for the first time.
Probably mostly about quality of life. Legacy graphics APIs like Vulkan have abysmal developer UX for no good reason.
I mean sure, this should be nice and easy.
But then game/engine devs want to use the vertex shader producing a uv coordinate and a normal together with a pixel shader that only reads the uv coordinate (or neither for shadow mapping) and don't want to pay for the bandwidth of the unused vertex outputs (or the cost of calculating them).
Or they want to be able to randomly enable any other pipeline stage like tessellation or geometry and the same shader should just work without any performance overhead.