One thing I don't understand about Nvidia’s valuation is that right now a small number of algorithms have 'won,' such as Transformers. The data is very important. Compared to the past where customized code was much more common, such as modeling code and HPC, the ecosystem was very important and it was almost impossible to implement all CUDA and related code.
Competitors now only need to optimize for a narrow set of algorithms. If a vendor can run vLLM and Transformers efficiently, a massive market becomes available. Consequently, companies like AMD or Huawei should be able to catch up easily. What, then, is Nvidia’s moat? Is InfiniBand enough?"
Infiniband is being replaced with UEC (and it isn't needed for inference). For inference there is no moat and smart players are buying/renting AMD or Google TPUs.
The thing the "just optimize AI" crowd misses is that this isn't like optimizing a programming language implementation, where even the worst implementation is likely only 100x slower than a good implementation.
AI is millions of times slower than optimal algorithms for most things.
To rephrase the OP's point: transformers et al are worth trillions. All the other CUDA uses are worth tens or hundreds of billions. They've totally got that locked up, but researchers is a smaller market than video games.
AI is going to be so ubiquitous, something principled and open is going to supersede cuda at some point, as HTML5 did for Flash. CUDA isn't like an x86 vs ARM situation where they can use hardware dominance for decades, it's a higher level language, and being compatible with a wide range of systems benefits NVIDIA and their competitors. They're riding out their relative superiority for now, but we're going to see a standards and interoperability correction sometime soon, imo. NVIDIA will drive it, and it will gain them a few more years of dominance, but afaik nothing in their hardware IP means CUDA compatibility sacrifices performance or efficiency. They're also going to want to compete in the Chinese market, so being flexible about interoperability with their systems gains them a bit of market access that might otherwise be lost.
There's a ton of pressure on the market to decouple nvidia's proprietary software from literally everything important to AI, and they will either gracefully transition and control it, or it will reach a breaking point and someone else will do it for (and to) them. I'm sure they've got finance nerds and quants informing and minmaxing their strategy, so they probably know to the quarter when they'll pivot and launch their FOSS, industry leading standards narrative (or whatever the strategy is.)
You'd think AMD would swing in on something like this and fund it with the money needed to succeed. I have no knowledge of it but my guess is no, AMD never misses an opportunity to miss an opportunity - when it comes to GPUs and AI.
AMD pays the bare minimum in software to get a product out the door. The company does not even have working performance testing and regressions routinely get shipped to customers. Benchmarks the executives see are ad hoc and not meaningful.
HipKittens is an improvement but AMD does not have the ability to understand or track kernel performance so it'll be ignored.
This isn't fixable overnight. Company-wide DevOps and infrastructure is outsourced to TCS in India who have no idea what they're doing. Teams with good leadership maintain their own shadow IT teams. ROCm didn't have such a team until hyperscalers lost their shit over our visibly poor development practices.
Even if AMD did extend an offer to hire all the people in the article, it would be below-market since the company benchmarks against Qualcomm, Broadcom, and Walmart, instead of Google, Nvidia, or Meta.
We haven't had a fully funded bonus in the past 4+ years.
The MBAs are in charge, and now AMD is the new Intel?
It's not only not fixable overnight, but it's not fixable at all if the leadership thinks they can coast on simply being not as bad as Intel, and Intel has a helluva lot of inertia and ability to simply sell OEM units on autopilot.
Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
But the real issue is we don't want to invest in beating Nvidia on quality. Otherwise we wouldn't be doing stock buybacks and instead use the money on poaching engineers.
The mindset is that we maintain a comfortable second place by creating a shittier but cheaper product. That is how AMD has operated since 1959 as a second source to Fairchild Semiconductor and Intel. It's going to remain the strategy of the company indefinitely with Nvidia. Attempting to become better would cost too much.
> Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
Knocking out Lisa Su would be stupid, since she has the loyalty of the whole company and is generally competent.
What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers. Or phase in the same over a longer period of time. The company is full of people that do nothing because we've paid under market for so long. That's fine when competing against Intel, it's not acceptable when competing against Microsoft, Amazon, OpenAI, Google, and Nvidia.
Lisa Su is the only CEO in the S&P500 who can get away with mass layoffs and still have the loyalty of the rest of the employees.
> AMD never misses an opportunity to miss an opportunity
Well said, their Instinct parts are actually, at a hardware level, very very capable pieces of kit that - ignoring software/dev ecosystem - are very competitive with NVidia.
Problem is, AMD has a terrible history of supporting it's hardware (either just outright lack of support, cough Radeon VII; or constantly scrapping things and starting over and thus the ecosystem never matured) and is at a massive deficit behind the CUDA ecosystem meaning that a lot of that hardware's potential is squandered by the lack of compatibility with CUDA and/or a lack of investment in comparable alternative. Those factors has given NVidia the momentum it has because most orgs/devs will look at the support/ecosystem delta, and ask themselves why they'd expend the resources reinventing the CUDA wheel to leverage AMD hardware when they can just spend that money/time investing in CUDA and NVidia instead.
To their credit, AMD it seems has learned it's lesson as they're actually trying to invest in ROCm and their Instinct ecosystem and seem to be sticking to their guns on it and we're starting to see people pick it up but they're still far behind Nvidia and CUDA.
One key area that Nvidia is far ahead of AMD on in the hardware space is networking.
From the performance comparison table, basically AMD could be NVIDIA right now, but they aren’t because… software?
That’s a complete institutional and leadership failure.
Ironically, building chips is the actual _hard_ part. The software and the compilers are not trivial but the iteration speed is almost infinite by comparison.
It goes to show that some companies just don’t “get” software. Not even AMD!
Ahh, composable-kernel. The highest offender in the list of software that have produced unrecoverable OOMs in my Gentoo system (it’s actually Clang while compiling CK, which uses upwards of 2.5GB per thread).
One thing I don't understand about Nvidia’s valuation is that right now a small number of algorithms have 'won,' such as Transformers. The data is very important. Compared to the past where customized code was much more common, such as modeling code and HPC, the ecosystem was very important and it was almost impossible to implement all CUDA and related code.
Competitors now only need to optimize for a narrow set of algorithms. If a vendor can run vLLM and Transformers efficiently, a massive market becomes available. Consequently, companies like AMD or Huawei should be able to catch up easily. What, then, is Nvidia’s moat? Is InfiniBand enough?"
Infiniband is being replaced with UEC (and it isn't needed for inference). For inference there is no moat and smart players are buying/renting AMD or Google TPUs.
Do you have evidence for this? I don’t think Nvidia is switching to Ultra Ethernet, just adding it to the product line-up
I didn't know you can you buy Google TPUs now?
The thing the "just optimize AI" crowd misses is that this isn't like optimizing a programming language implementation, where even the worst implementation is likely only 100x slower than a good implementation.
AI is millions of times slower than optimal algorithms for most things.
The vast amount of CUDA libraries for anything you can think of. I think there’s where they have the biggest leverage.
To rephrase the OP's point: transformers et al are worth trillions. All the other CUDA uses are worth tens or hundreds of billions. They've totally got that locked up, but researchers is a smaller market than video games.
AI is going to be so ubiquitous, something principled and open is going to supersede cuda at some point, as HTML5 did for Flash. CUDA isn't like an x86 vs ARM situation where they can use hardware dominance for decades, it's a higher level language, and being compatible with a wide range of systems benefits NVIDIA and their competitors. They're riding out their relative superiority for now, but we're going to see a standards and interoperability correction sometime soon, imo. NVIDIA will drive it, and it will gain them a few more years of dominance, but afaik nothing in their hardware IP means CUDA compatibility sacrifices performance or efficiency. They're also going to want to compete in the Chinese market, so being flexible about interoperability with their systems gains them a bit of market access that might otherwise be lost.
There's a ton of pressure on the market to decouple nvidia's proprietary software from literally everything important to AI, and they will either gracefully transition and control it, or it will reach a breaking point and someone else will do it for (and to) them. I'm sure they've got finance nerds and quants informing and minmaxing their strategy, so they probably know to the quarter when they'll pivot and launch their FOSS, industry leading standards narrative (or whatever the strategy is.)
You'd think AMD would swing in on something like this and fund it with the money needed to succeed. I have no knowledge of it but my guess is no, AMD never misses an opportunity to miss an opportunity - when it comes to GPUs and AI.
AMD pays the bare minimum in software to get a product out the door. The company does not even have working performance testing and regressions routinely get shipped to customers. Benchmarks the executives see are ad hoc and not meaningful.
HipKittens is an improvement but AMD does not have the ability to understand or track kernel performance so it'll be ignored.
This isn't fixable overnight. Company-wide DevOps and infrastructure is outsourced to TCS in India who have no idea what they're doing. Teams with good leadership maintain their own shadow IT teams. ROCm didn't have such a team until hyperscalers lost their shit over our visibly poor development practices.
Even if AMD did extend an offer to hire all the people in the article, it would be below-market since the company benchmarks against Qualcomm, Broadcom, and Walmart, instead of Google, Nvidia, or Meta.
We haven't had a fully funded bonus in the past 4+ years.
The MBAs are in charge, and now AMD is the new Intel?
It's not only not fixable overnight, but it's not fixable at all if the leadership thinks they can coast on simply being not as bad as Intel, and Intel has a helluva lot of inertia and ability to simply sell OEM units on autopilot.
Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
The MBAs have always been in charge to an extent.
But the real issue is we don't want to invest in beating Nvidia on quality. Otherwise we wouldn't be doing stock buybacks and instead use the money on poaching engineers.
The mindset is that we maintain a comfortable second place by creating a shittier but cheaper product. That is how AMD has operated since 1959 as a second source to Fairchild Semiconductor and Intel. It's going to remain the strategy of the company indefinitely with Nvidia. Attempting to become better would cost too much.
> Sounds like the AMD board needs to get their heads out of their asses and shake up leadership.
Knocking out Lisa Su would be stupid, since she has the loyalty of the whole company and is generally competent.
What they should do is bump TC by 60-70% and simultaneously lay off 50% of the engineers. Or phase in the same over a longer period of time. The company is full of people that do nothing because we've paid under market for so long. That's fine when competing against Intel, it's not acceptable when competing against Microsoft, Amazon, OpenAI, Google, and Nvidia.
Lisa Su is the only CEO in the S&P500 who can get away with mass layoffs and still have the loyalty of the rest of the employees.
> AMD never misses an opportunity to miss an opportunity
Well said, their Instinct parts are actually, at a hardware level, very very capable pieces of kit that - ignoring software/dev ecosystem - are very competitive with NVidia.
Problem is, AMD has a terrible history of supporting it's hardware (either just outright lack of support, cough Radeon VII; or constantly scrapping things and starting over and thus the ecosystem never matured) and is at a massive deficit behind the CUDA ecosystem meaning that a lot of that hardware's potential is squandered by the lack of compatibility with CUDA and/or a lack of investment in comparable alternative. Those factors has given NVidia the momentum it has because most orgs/devs will look at the support/ecosystem delta, and ask themselves why they'd expend the resources reinventing the CUDA wheel to leverage AMD hardware when they can just spend that money/time investing in CUDA and NVidia instead.
To their credit, AMD it seems has learned it's lesson as they're actually trying to invest in ROCm and their Instinct ecosystem and seem to be sticking to their guns on it and we're starting to see people pick it up but they're still far behind Nvidia and CUDA.
One key area that Nvidia is far ahead of AMD on in the hardware space is networking.
From the performance comparison table, basically AMD could be NVIDIA right now, but they aren’t because… software?
That’s a complete institutional and leadership failure.
Ironically, building chips is the actual _hard_ part. The software and the compilers are not trivial but the iteration speed is almost infinite by comparison.
It goes to show that some companies just don’t “get” software. Not even AMD!
I'd go so far as to say it's the exact opposite. It's faster and easier to change the hardware than the software.
CUDA was started in 2004. AMD was basically broke until they hit a home run with Ryzen in 2017.
It is now funded and working.
First rule of AMD stock is nobody understands AMD stock. I guess it’s also the same for AMD’s software endeavors.
Ahh, composable-kernel. The highest offender in the list of software that have produced unrecoverable OOMs in my Gentoo system (it’s actually Clang while compiling CK, which uses upwards of 2.5GB per thread).