I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.
It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.
Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/abou...
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/...
Eight NeuronCore-v4 cores that collectively deliver:
2,517 MXFP8/MXFP4 TFLOPS
671 BF16/FP16/TF32 TFLOPS
2,517 FP16/BF16/TF32 sparse TFLOPS
183 FP32 TFLOPS
HBM: 144 GiB HBM @ 4.9 TB/sec (4 stacks)SRAM: 32 MiB * 8 = 256 MiB (ignoring 2 MiB * 8 = 16 MiB of PSUM which is not really general-purpose nor DMA-able)
Interconnect: 2560 GB/s (I think bidirectional, i.e. Jensen Math™)
----
At 3nm process node the FLOP/s is _way_ lower than competition. Compare to B200 which does 2250 BF16, x2 FP8, x4 FP4. TPU7x does 2307 BF16, x2 FP8 (no native FP4). HBM also lags behind (vs ~192 GiB in 6 stacks for both TPU7x and B200).
The main redeeming qualities seem to be: software-managed SRAM size (double of TPU7x; GPUs have L2 so not directly comparable) and on-paper raw interconnect BW (double of TPU7x and more than B200).
Pretty accurate in my experience, especially re: the neuron sdk. Do not use.
AMD felt like they were so close to nabbing the accelerator future back in HyperTransport days. But the recent version Infinity Fabric is all internal.
There's Ultra Accelerator Link (UALink) getting some steam. Hypothetically CXL should be good for uses like this, using PCIe PHY but lower latency lighter weight; close to ram latency, not bad! But still a mere PCIe speed, not nearly enough, with PCIe 6.0 just barely emerging now. Ideally IMO we'd also see more chips come with integrated networking too: it was so amazing when Intel Xeon's had 100Gb Omni-Path for barely any price bump. UltraEthernet feels like it should be on core, gratis.
The sole reason amazon is throwing any money at this is because they think they can do to AI what they did with logistics and shipping in an effort to slash costs leading into a recession (we cant fire anyone else.) The hubris is magnanimous to say the least.
but the total confidence is very low...so "Nvidia friendly" is face saving to ensure no bridges they currently cross for AWS profit get burned.