FRESH

Hacker News

Surpassing vLLM with a Generated Inference Stack

57 points by lukebechtel

by ntonozzi

3 subcomments

Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM.
It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.

by 2001zhaozhao

1 subcomments

Every example like this makes it obvious that you can now use ML-like optimization approaches on well-specified, very-well-tested software problems with a clear optimization goal. Keep if it improves the objective while maintaining correctness, discard if it doesn't. AI-descent strikes again.
Maybe I should learn more about ML to have a better instinct on optimization methods in general, so I can actually build AI optimizers like these.

by storus

2 subcomments

Does it support paged attention like vLLM though? Without that they will run into memory fragmentation quickly.

by rfw300

1 subcomments

OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.

by hoerzu

1 subcomments

What's the jitter what's the std? What about 1:1 output equality?
What's the post request latency of this part? What the ftt?

by ismailmaj

1 subcomments

by acuozzo

1 subcomments

by cermicelli

0 subcomment