There seems to be some confusion here on what PTX is -- it does not bypass the CUDA platform at all. Nor does this diminish NVIDIA's monopoly here. CUDA is a programming environment for NVIDIA GPUs, but many say CUDA to mean the C/C++ extension in CUDA (CUDA can be thought of as a C/C++ dialect here.) PTX is NVIDIA specific, and sits at a similar level as LLVM's IR. If anything, DeepSeek is more dependent on NVIDIA than everyone else, since PTX is tightly dependent on their specific GPUs. Things like ZLUDA (effort to run CUDA code on AMD GPUs) won't work. This is not a feel good story here.
This specific tech is, yes, nvidia dependent. The game changer is that a team was able to beat the big players with less than 10 million dollars. They did it by operating at a low level of nvidia's stack, practically machine code. What this team has done, another could do. Building for AMD GPU ISA would be tough but not impossible.
I don't think anyone is saying CUDA as in the platform, but as in the API for higher level languages like C and C++.
PTX is a close-to-metal ISA that exposes the GPU as a data-parallel computing device and, therefore, allows fine-grained optimizations, such as register allocation and thread/warp-level adjustments, something that CUDA C/C++ and other languages cannot enable.
Even if they get banned, any startup could replicate their work if it is truly open source. The best thing about their solution is that it breaks the CUDA monopoly that NVDA has enjoyed. Buy your puts when NVDA bounces because that stock is GOING DOWN. There’s no world where a company that makes GPU’s is worth more than both Apple and Microsoft. It’s inevitable.
This sounds like good engineering, but surely there's not a big gap with their competitors. They are spending tens of millions on hardware and energy, and this is something a handful of (very good) programmers should be able to pull off.
Unless I'm missing something, It's the sort of thing that's done all the time on console games.
I think more like was done all the time for console games. These days that doesn't happen as much anymore as far as I know. But I think this shows that CUDA is not a good enough abstraction for modern GPUs or the compilers are not as good as expected. There should be no way they got that much optimization out of hand written/optimized code these days.
Eh, even for many console games it's not optimised that much.
Check out Kaze Emanaur's (& co) rewrite of the N64s Super Mario 64 engine. He's now building an entirely new game on top of that engine, and it looks considerably better than SM64 did and runs at twice the FPS on original hardware.
But you're probably right that today it happens even less than before.
Part of this was an optimization that was necessary due to their resource restrictions. Chinese firms can only purchase H800 GPUs instead of H200 or H100. These have much slower inter-GPU communication (less than half the bandwidth!) as a result of export bans by the US government, so this optimization was done to try and alleviate some of that bottleneck. It's unclear to me if this type of optimization would make as big of a difference for a lab using H100s/H200s; my guess is that it probably matters less.
The big win I see here is the amount of optimisation they achieved by moving from the high-level CUDA to lower-level PTX. This suggests that developing these models going forward can be made a lot more energy-efficient, something I hope can be extended to their execution as well. As it stands currently, "AI" (read: LLMs and image generation models) consumes way too many resources to be sustainable.
What I'm curious to see is how well these types of modifications scale with compute. DeepSeek is restricted to H800s instead of H100s or H200. These are gimped cards to get around export controls, and accordingly they have lower memory bandwidth (~2 vs ~3 TB/s) and most notably, much slower GPU to GPU communication (something like 400 GB/s vs 900 GB/s). The specific reason they used PTX in this application was to help alleviate some of the bottlenecks due to the limited inter-GPU bandwidth, so I wonder if that would still improve performance on H100 and H200 GPUs where bandwidth is much higher.
Kind of the opposite actually. PTX is in essence nvidia specific assembly. Just like how arm or x86_64 assembly are tied to arm and x86_64.
At least with cuda there are efforts like zluda. Cuda is more like objective-c was on the mac. Basicly tied to platform but at least you could write a compiler for another target in theory.
Reminds me of the Bitcoin mining and how askii miners overtook graphic card mining practically overnight. It would not surprise me if this goes the same way.
It's already happening. This article takes a long look at many of the rising threats to nvidia. Some highlights:
Google has been running on their own homemade TPUs (tensor processing units) for years, and say they on the 6th generation of those.
Some AI researchers are building an entirely AMD based stack from scratch, essentially writing their own drivers and utilities to make it happen.
Cerebras.ai is creating their own AI chips using a unique whole-die system. They make an AI chip the size of entire silicon wafer (30cm square) with 900,000 micro-cores.
So yeah, it's not just "China AI bad" but that the entire market is catching up and innovating around nvidia's monopoly.