FRESH

Hacker News

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

58 points by zyoralabs

by 7777777phil

0 subcomment

32B model in 19.3GB matters is really cool imo. Memory and cold start are what gate production deployments.
I did a piece (1) on how Netflix and Spotify worked this out a while ago, cheap classical methods handle 90%+ of their recommendation requests and LLMs only get called when the payoff justifies it.
(1) https://philippdubach.com/posts/bandits-and-agents-netflix-a...

by reconnecting

1 subcomments

Discussion on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1rewis9/removed...

by cipher-108

0 subcomment

This seems excellent if not revolutionary, just what I've been looking for, but GPU support didn't work on my M1 and M1 Max. Is there a way to support Apple M series processors? That would be greatly appreciated. I don't have experience about this kind of programming and didn't get very far with ChatGPT.
On M1 Max, it says 14.8GB free / 32.0 GB total, but " No GPU detected" and "What Can You Run? (ZSE Ultra Mode)" only says "7B GPU + CPU Hybrid", nothing else.

by HanClinto

0 subcomment

If you don't mind a stupid question, is this essentially dynamic quantization? I'm trying to understand how this is different from using a regular quantized model to squeeze more parameters into less RAM.

by medi_naseri

0 subcomment

This is so freaking awesome, I am working on a project trying run 10 models on two GPUs, loading/off loading is the only solution I have in mind.
Will try getting this deployed.
Does cold start timings advertised for a condition where there is no other model loaded on GPUs?

by mzl

0 subcomment