FRESH

Hacker News

Home

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

194 points by yu3zhou4

by yu3zhou4

2 subcomments

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

by cookiengineer

0 subcomment

Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/

by dwa3592

1 subcomments

Very nice job on read me.
>>Physically, LLM is a file which contains a lot of float numbers.
aka atoms of the LLM.

by xuanlin314

0 subcomment

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

by GoldenJade

0 subcomment

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

by tom-wal

0 subcomment

I feel like I learned twice as much in 10 minutes reading this than I did reading LLM for Dummies. Thank you

by nazgulsenpai

0 subcomment

I love the documentation formatted in lessons. I can't wait to read through it.

by juancn

0 subcomment

Looks interesting, it reminds me of the first llama.cpp, but better documented.

by einpoklum

0 subcomment

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(

by sylware

1 subcomments

I am looking at a plain and simple C implemented LLM inference, and/or x86_64 assembly implemented, and/or AMD GPU RDNA assembly.
Anybody?

by smy_smy

0 subcomment

interesting!

by sspoisk

0 subcomment

[flagged]

by pslab

0 subcomment

[flagged]

by alexpandey

0 subcomment

[dead]

by michaelmjh

0 subcomment

[dead]

by aamir_ukmer

0 subcomment

[dead]

by harshuljain13

0 subcomment

[dead]