The extra cache doesn't do a damn thing (maybe +2%)
The lower leakage currents at lower voltages allowed them to implement a far more aggressive clock curve from the factory. That's where the higher allcore clock comes from (+30W TDP)
I'm not complaining at all, I think this is an excellent way to leverage binning to sell leftover cache.
Though if I may complain, Ars used to actually write about such things in their articles instead of speculate in a way that suspiciously resembles what an AI would write.
> The extra cache doesn't do a damn thing (maybe +2%)
It depends on the task. For some memory-bound tasks the extra cache is very helpful. For CFD and other simulation workloads the benefits are huge.
For other tasks it doesn't help at all.
If someone wants a simple gaming CPU or general purpose CPU they don't need to spend the money for this. They don't need the 16-core CPU at all. The 9850X3D is a better buy for most users who aren't frequently doing a lot of highly parallel work
Probably fun for those who already bought DDR5 memory... still kicking myself for not just pulling the trigger on that 128GB dual stick kit I looked at for $600 back in September. Now it's listed at $4k...
Meanwhile I hope my AM4 will chug along a few more years.
I really want a x3d because a game I play is heavily single threaded, I have the income and the financial stability but I can't in any good conscious upgrade to am5 with the ram prices. It's insane
Context: Early in the firmware boot process the memory controller isn't configured yet so the firmware uses the cache as RAM. In this mode cache lines are never evicted since there's no memory to evict them to.
In my case it began with 16K (yes, 161024 bytes) and 90K (yes, 901024 bytes) 5.25" floppy disks (although the floppies were a few months after the computer). Eventually upgraded to 48K RAM and 180K double density floppy disks. The computer: Atari 800.
I'll see your Atari 800 and raise you my Atari 2600 with its whopping 128 bytes of RAM. Bytes with a B. I can kinda sorta call it a computer because you could buy a BASIC cartridge for it (I didn't and stand by that decision - it was pretty bad).
Maybe in 50 years the cache of CPUs and GPUs will be 1TB. Enough to run multiple LLMs (a model entirely run for each task). Having robots like in the movies would need LLMs much much faster than what we see today.
KolibriOS would fit in there, even with the data in memory. You cannot load it into the cache directly, but when the cache capacity is larger than all the data you read there should be no cache eviction and the OS and all data should end up in the cache more or less entirely. In other words it should be really, really fast, which KolibriOS already is to begin with.
Unless you lay everything out continuously in memory, you’ll still get cache eviction due to associativty and depending on the eviction strategy of the CPU. But certainly DOS or even early Windows 95 could conceivably just run out of the cache
Windows 95 only needed 4MB RAM and 50 MB disk, so that's certainly doable. The trick is to have a hypervisor spread that allocation across cache lines.
Yeah, cache eviction is the reason I was assuming it is "probably not possible architecturally", but I also figured there could be features beyond my knowledge that might make it possible.
Edit: Also this 192MB of L3 is spread across two Zen CCDs, so it's not as simple as "throw it all in L3" either, because any given core would only have access to half of that.
Well, yeah, reality strikes again. All you need is an exploit in the microcode to gain access to AMD's equivalent to the ME and now you can just map the cache as memory directly. Maybe. Can microcode do this or is there still hardware that cannot be overcome by the black magic of CPU microcode?
Given that the dies still have L3 on them does this count as L4 or does the hardware treat it as a single pool of L3?
Would be neat to have an additional cache layer of ~1 GB of HBM on the package but I guess there's no way that happens in the consumer space any time soon.
Per compute die it functions as one 96M L3 with uniform latency. It is 4 cycles more latency than the configuration with smaller 32M L3. But there are two compute dies, each with their own L3. And like the 9950X coherency between these two L3 is maintained over global memory interconnect to the third (IO) die.
That's what's different about this one. "Enter the Ryzen 9 9950X3D2 Dual Edition, a mouthful of a chip that includes 64MB of 3D V-Cache on both processor dies, without the hybrid arrangement that has defined the other chips up until now."
The extra cache doesn't do a damn thing (maybe +2%)
The lower leakage currents at lower voltages allowed them to implement a far more aggressive clock curve from the factory. That's where the higher allcore clock comes from (+30W TDP)
I'm not complaining at all, I think this is an excellent way to leverage binning to sell leftover cache.
Though if I may complain, Ars used to actually write about such things in their articles instead of speculate in a way that suspiciously resembles what an AI would write.
> The extra cache doesn't do a damn thing (maybe +2%)
It depends on the task. For some memory-bound tasks the extra cache is very helpful. For CFD and other simulation workloads the benefits are huge.
For other tasks it doesn't help at all.
If someone wants a simple gaming CPU or general purpose CPU they don't need to spend the money for this. They don't need the 16-core CPU at all. The 9850X3D is a better buy for most users who aren't frequently doing a lot of highly parallel work
It's very workload dependent. It certainly does more than 2% on many workloads.
See https://www.phoronix.com/review/amd-ryzen-9-9950x3d-linux/10
> Here is the side-by-side of the Ryzen 9 9950X vs. 9950X3D for showing the areas where 3D V-Cache really is helpful:
Most are greater than 2%. The biggest speedup is 58.1%, and that's just 3d vcache on half the chip.
Probably fun for those who already bought DDR5 memory... still kicking myself for not just pulling the trigger on that 128GB dual stick kit I looked at for $600 back in September. Now it's listed at $4k...
Meanwhile I hope my AM4 will chug along a few more years.
I really want a x3d because a game I play is heavily single threaded, I have the income and the financial stability but I can't in any good conscious upgrade to am5 with the ram prices. It's insane
9950X3D2? AMD, who is making you name your products like this? At some point just give up and name the chip a UUID already.
I actually don't mind this one, 9950 is the actual chip, x3d is the cache (where it's larger) and the 2 stands for it being on both chiplets.
Crazy to think that my first personal computer's entire storage (was 160MB IIRC?) could fit into the L3 of a single consumer CPU!
It's probably not possible architecturally, but it would be amusing to see an entire early 90's OS running entirely in the CPU's cache.
My first PC had a 20MB HDD with 512Kb of RAM. So yeah that could fit into cache 10 times now.
https://github.com/coreboot/coreboot/blob/main/src/soc/intel...
Context: Early in the firmware boot process the memory controller isn't configured yet so the firmware uses the cache as RAM. In this mode cache lines are never evicted since there's no memory to evict them to.
In my case it began with 16K (yes, 161024 bytes) and 90K (yes, 901024 bytes) 5.25" floppy disks (although the floppies were a few months after the computer). Eventually upgraded to 48K RAM and 180K double density floppy disks. The computer: Atari 800.
I'll see your Atari 800 and raise you my Atari 2600 with its whopping 128 bytes of RAM. Bytes with a B. I can kinda sorta call it a computer because you could buy a BASIC cartridge for it (I didn't and stand by that decision - it was pretty bad).
You had ~160,000 times more storage than I did for my first personal computer.
Maybe in 50 years the cache of CPUs and GPUs will be 1TB. Enough to run multiple LLMs (a model entirely run for each task). Having robots like in the movies would need LLMs much much faster than what we see today.
KolibriOS would fit in there, even with the data in memory. You cannot load it into the cache directly, but when the cache capacity is larger than all the data you read there should be no cache eviction and the OS and all data should end up in the cache more or less entirely. In other words it should be really, really fast, which KolibriOS already is to begin with.
Unless you lay everything out continuously in memory, you’ll still get cache eviction due to associativty and depending on the eviction strategy of the CPU. But certainly DOS or even early Windows 95 could conceivably just run out of the cache
Windows 95 only needed 4MB RAM and 50 MB disk, so that's certainly doable. The trick is to have a hypervisor spread that allocation across cache lines.
Yeah, cache eviction is the reason I was assuming it is "probably not possible architecturally", but I also figured there could be features beyond my knowledge that might make it possible.
Edit: Also this 192MB of L3 is spread across two Zen CCDs, so it's not as simple as "throw it all in L3" either, because any given core would only have access to half of that.
Well, yeah, reality strikes again. All you need is an exploit in the microcode to gain access to AMD's equivalent to the ME and now you can just map the cache as memory directly. Maybe. Can microcode do this or is there still hardware that cannot be overcome by the black magic of CPU microcode?
IIRC some relatively strange CPUs could run with unbacked cache.
Intel's platform, at the very least, use cache-as-ram during the boot phase before the DDR interface can be trained and started up. https://github.com/coreboot/coreboot/blob/main/src/soc/intel...
I wonder how much faster dos would boot, especially with floppy seek times...
Instantly.
If you run a VM on a CPU like this, using a baremetal hypervisor, you can get very close to "everything in cache".
Given that the dies still have L3 on them does this count as L4 or does the hardware treat it as a single pool of L3?
Would be neat to have an additional cache layer of ~1 GB of HBM on the package but I guess there's no way that happens in the consumer space any time soon.
Per compute die it functions as one 96M L3 with uniform latency. It is 4 cycles more latency than the configuration with smaller 32M L3. But there are two compute dies, each with their own L3. And like the 9950X coherency between these two L3 is maintained over global memory interconnect to the third (IO) die.
Can someone explain if the 3D Vcache are stacked on top of each other or side by side.
If they are stacked then why not 9800X3D2?
The 99xx chips have two CPU dies, and one cache die is on each CPU die.
The 3D V-Cache sits underneath only one of the CCDs. See https://en.wikipedia.org/wiki/Ryzen#Ryzen_9000.
That's what's different about this one. "Enter the Ryzen 9 9950X3D2 Dual Edition, a mouthful of a chip that includes 64MB of 3D V-Cache on both processor dies, without the hybrid arrangement that has defined the other chips up until now."
Did you forget which thread we are on?
Breakdown of the (semi-clickbait) 208MB cache: 16MB L2 (8MB per die?) + 32MB L3 * 2 dies + 64MB L3 Stacked 3D V-cache * 2
For comparison, 9950X3D have a total cache of 144MB.
> 16MB L2 (8MB per die?)
It is indeed 8MB per compute die but really 1MB per core. Not shared among the entire CCD.
I wouldn’t be caught dead with less than 200MB of cache in my desktop in 2026.
It's disappointing that they had this for years but didn't release it until now.
I think it’s mostly that they had leftover cache.
I have a gigabyte of cache on my 9684x at home!