FRESH

Hacker News

Home

Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

520 points by hd4

by kilotaras

5 subcomments

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.

by djoldman

1 subcomments

Key paragraph:
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.

by majke

1 subcomments

better link https://www.tomshardware.com/tech-industry/semiconductors/al...
paper https://dl.acm.org/doi/10.1145/3731569.3764815

by hunglee2

17 subcomments

The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping

by braza

2 subcomments

Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.

by ddelnano

0 subcomment

Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?
The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.

by checker659

1 subcomments

They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).

by jeffybefffy519

0 subcomment

I still think nVidia has the most to loose in the AI race, optimisations like this will continue coupled with better ASIC's.

by catigula

0 subcomment

Sounds like they stopped doing something stupid.

by ibejoeb

0 subcomment

Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.

by shoeb00m

0 subcomment

Would this make cloud providers running low volume fine-tuned models more economically viable?

by lnxg33k1

0 subcomment

Lots of shareholders here, move along, there is nothing to read

by throwaway48476

0 subcomment

Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.

by mighmi

1 subcomments

To what extent is this practice applicable to other loads?

by nickysielicki

0 subcomment

> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.

by wslh

1 subcomments

How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?

by t0lo

0 subcomment

Is this another nail in the gpu/ai stock market bubble coffin?

by muddi900

4 subcomments

[flagged]