by kilotaras
5 subcomments
- Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
- Key paragraph:
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
- better link https://www.tomshardware.com/tech-industry/semiconductors/al...
paper
https://dl.acm.org/doi/10.1145/3731569.3764815
by hunglee2
17 subcomments
- The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping
- Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
- Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?
The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
by checker659
1 subcomments
- They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).
by jeffybefffy519
0 subcomment
- I still think nVidia has the most to loose in the AI race, optimisations like this will continue coupled with better ASIC's.
- Sounds like they stopped doing something stupid.
- Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.
- Would this make cloud providers running low volume fine-tuned models more economically viable?
- Lots of shareholders here, move along, there is nothing to read
by throwaway48476
0 subcomment
- Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.
- To what extent is this practice applicable to other loads?
by nickysielicki
0 subcomment
- > Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.
- How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?
- Is this another nail in the gpu/ai stock market bubble coffin?
- [flagged]