FRESH

Hacker News

Amazon launches Trainium3

203 points by thnaks

by ZeroCool2u

4 subcomments

I've had to repeatedly tell our AWS account reps that we're not even a little interested in the Trainium or Inferentia instances unless they have a provably reliable track record of working with the standard libraries we have to use like Transformers and PyTorch.
I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.
It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.

by cmiles8

2 subcomments

AWS keeps making grand statements about Trainium but not a single customer comes on stage to say how amazing it is. Everyone I talked to that tries it says there were too many headaches and they moved on. AWS pushes it hard but “more price performant” isn’t a benefit if it’s a major PITA to deploy and run relative to other options. Chips without a quality developer experience isn’t gonna work.
Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.

by deepsquirrelnet

0 subcomment

Heavens to Betsy, I don’t know if you can hear me, But try supporting these things if you actually want them to be successful. About the 3rd day into trying to roll your own LMI container in sagemaker because they haven’t updated the vLLM version in 6 months and you can’t run a regular sagemaker endpoint because of a ridiculous 60s timeout that was determined to be adequate 8 years ago. I can only imagine the hell that awaits the developer that decides to try their custom silicon.

by smilekzs

0 subcomment

by landl0rd

0 subcomment

Anyone considering using trainium should view this Completely Factual Infomercial: https://x.com/typedfemale/status/1945912359027114310
Pretty accurate in my experience, especially re: the neuron sdk. Do not use.

by jauntywundrkind

1 subcomments

Amazon aside, interesting future here with NVLink getting more and more folks using it. Intel is also onboard with NVlink. This is like an PCI -> AGP moment, but Nvidia's AGP.
AMD felt like they were so close to nabbing the accelerator future back in HyperTransport days. But the recent version Infinity Fabric is all internal.
There's Ultra Accelerator Link (UALink) getting some steam. Hypothetically CXL should be good for uses like this, using PCIe PHY but lower latency lighter weight; close to ram latency, not bad! But still a mere PCIe speed, not nearly enough, with PCIe 6.0 just barely emerging now. Ideally IMO we'd also see more chips come with integrated networking too: it was so amazing when Intel Xeon's had 100Gb Omni-Path for barely any price bump. UltraEthernet feels like it should be on core, gratis.

by trebligdivad

0 subcomment

The Block floating point (MXFP8/4 stuff) is interesting; the AI stuff is really pushing basic data types that haven't moved for decades.
https://en.wikipedia.org/wiki/Block_floating_point

by aaa_aaa

5 subcomments

Interesting that in the article, they do not say what the chip actually does. Not even once.

by mlmonkey

2 subcomments

by ChrisArchitect

0 subcomment

by nimbius

0 subcomment

the real news is: "and teases an Nvidia-friendly roadmap"
The sole reason amazon is throwing any money at this is because they think they can do to AI what they did with logistics and shipping in an effort to slash costs leading into a recession (we cant fire anyone else.) The hubris is magnanimous to say the least.
but the total confidence is very low...so "Nvidia friendly" is face saving to ensure no bridges they currently cross for AWS profit get burned.

by parkersweb

0 subcomment

by regnull

0 subcomment

These product might be great, but seriously, who's choosing those names? Trainium, Inferentia? It's like let's just take the words from what they do, and put a little Latin twist on them? I know naming things is one of the great problems in computer science, but really they could come up with something a little better.

by hackermeows

0 subcomment

at some point the cost of transferring will dwarf the cost you pay to NVIDA. I bet that is their bet

0 subcomment