FRESH

Hacker News

Show HN: Librarian – Cut token costs by up to 85% for LangGraph and OpenClaw

8 points by Pinkert

by Pinkert

0 subcomment

One architectural tradeoff we are actively working on right now is the latency of the "Select" step for shorter conversations.
Currently, the open-source version of Librarian uses a general-purpose model to read the summary index and route the relevant messages. It works great for accuracy and drastically cuts token costs, but it does introduce a latency penalty for shorter conversations because it requires an initial LLM inference step before your actual agent can respond.
To solve this, we are currently training a heavily quantized, fine-tuned model specifically optimized only for this context-selection task. The goal is to push the selection latency below 1 second so the entire pipeline feels completely transparent. (We have a waitlist up for this hosted version on the site).
If anyone here has experience fine-tuning smaller models (like Llama 3 or Mistral) strictly for high-speed classification/routing over context indexes, I'd love to hear what pitfalls we should watch out for.

by findjashua

0 subcomment

won't this essentially disable prompt caching, that you get from a standard append-only chat history?