The story with Intel around these times was usually that AMD or Cyrix or ARM or Apple or someone else would come around with a new architecture that was a clear generation jump past Intel's, and most importantly seemed to break the thermal and power ceilings of the Intel generation (at which point Intel typically fired their chip design group, hired everyone from AMD or whoever, and came out with Core or whatever). Nvidia effectively has no competition, or hasn't had any - nobody's actually broken the CUDA moat, so neither Intel nor AMD nor anyone else is really competing for the datacenter space, so they haven't faced any actual competitive pressure against things like power draws in the multi-kilowatt range for the Blackwells.
The reason this matters is that LLMs are incredibly nifty often useful tools that are not AGI and also seem to be hitting a scaling wall, and the only way to make the economics of, eg, a Blackwell-powered datacenter make sense is to assume that the entire economy is going to be running on it, as opposed to some useful tools and some improved interfaces. Otherwise, the investment numbers just don't make sense - the gap between what we see on the ground of how LLMs are used and the real but limited value add they can provide and the actual full cost of providing that service with a brand new single-purpose "AI datacenter" is just too great.
So this is a press release, but any time I see something that looks like an actual new hardware architecture for inference, and especially one that doesn't require building a new building or solving nuclear fusion, I'll take it as a good sign. I like LLMs, I've gotten a lot of value out of them, but nothing about the industry's finances add up right now.
Nvidia = flexible, general-purpose GPUs that excel at training and mixed workloads. Furiosa = purpose-built inference ASICs that trade flexibility for much better cost, power efficiency, and predictable latency at scale.
You can see them admit that RNGD will be slower than a setup with H100 SXM cards, but at the same time the tokens per second per watt is way better!
Actually, I wonder how different that is from Cerebras chips, since they're very much optimized for speed and one would think that'd also affect the efficiency a whole bunch: https://www.cerebras.ai/
The reasons why this almost never works is one of the following:
- They assume they can move hardware complexity (scheduling etc, access patterns into software). The magic compiler/runtime never arrives.
- They assume their hard-to-program but faster architecture will get figured out by devs. It won't.
- They assume a certain workload. The workload changes, and their arch is no longer optimal or possibly even workable.
- But most importantly, they don't understand the fundamental bottlenecks, which is usually memory bandwidth. Even if you increase the paper specs, like FLOPS total, FLOPS/W etc. youre usually limited by how much you can read from memory. Which is exactly as much as their competitors. The way you can overcome this is by cleverness and complexity (cache lines, smarter algorithms, acceleration structures etc), but all these require a complex computer to run with all those coherent cache hierarchies, branching and synchronization logic etc. Which is why folks like NVIDIA keep going on despite facing this constant barrage of would-be disruptors.
In fact this continue to be more and more true - memory bandwidth relies on transcievers on the chip edge, and if the size of the chips doesn't increase, bandwidth doesn't increase automatically on newer process nodes. Latency doesn't improve at all. But you get more transistors to play with, which you can use to run your workload more cleverly.
In fact I don't rule out the possibility of CPU based massively parallel compute making a comeback.
they're trying to compare at iso-power? I just want to see their box vs a box of 8 h100s b/c that's what people would buy instead, and they can divide tokens and watts if that's the pitch.
Seems like it would obviously be in TSMCs interest to give preferential taping to nvidia competitors, they benefit from having a less consolidated customer base bidding up their prices.
Targeting power, cooling, and TCO limits for inference is real, especially in air-cooled data centers.
But the benchmarks shown are narrow, and it’s unclear how well this generalizes across models and mixed production workloads. GPUs are inefficient here, but their flexibility still matters.
Also, there is no mention of the latest-gen NVDA chips: 5 RNGD servers generate tokens at 3.5x the rate of a single H100 SXM at 15 kW. This is reduced to 1.5x if you instead use 3 H100 PCIe servers as the benchmark.
Edit: from comments and reading the one page that loads, this is still the 5nm tech they announced in 2024, hence the H100 comparison, which feels dated given the availability of GB300.
After I read the article :) The improvements in FuriosaAI's NXT RNGD Server are primarily driven by hardware innovations, not software or code changes.
Maybe they are cheap.