by 7777777phil
0 subcomment
- 32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.
I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.
(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...
by reconnecting
1 subcomments
- Discussion on reddit:
https://www.reddit.com/r/LocalLLaMA/comments/1rewis9/removed...
by cipher-108
0 subcomment
- This seems excellent if not revolutionary, just what I've been looking for, but GPU support didn't work on my M1 and M1 Max. Is there a way to support Apple M series processors? That would be greatly appreciated. I don't have experience about this kind of programming and didn't get very far with ChatGPT.
On M1 Max, it says 14.8GB free / 32.0 GB total, but " No GPU detected" and "What Can You Run? (ZSE Ultra Mode)" only says "7B GPU + CPU Hybrid", nothing else.
- If you don't mind a stupid question, is this essentially dynamic quantization? I'm trying to understand how this is different from using a regular quantized model to squeeze more parameters into less RAM.
by medi_naseri
0 subcomment
- This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.
Will try getting this deployed.
Does cold start timings advertised for a condition where there is no other model loaded on GPUs?
- Are you using the Model GPU memory snapshotting for this?