1. Separate your query analysis from retrieval. A single LLM call can classify the query type, decide whether to use hybrid search, and pick search parameters all at once. This saves a round-trip vs doing them sequentially.
2. If you add BM25 alongside vector search, the blend ratio matters a lot by query type. Exact-match queries need heavy keyword weighting, while conceptual questions need more embedding weight. A static 50/50 split leaves performance on the table.
3. For your evaluator/generator being the same model — one practical workaround is to skip LLM-as-judge evaluation entirely and use a small cross-encoder reranker between retrieval and generation instead. It catches the cases where vector similarity returns semantically related but not actually useful chunks, and it gives you a relevance score you can threshold on without needing a separate evaluation model.
4. Consider a two-level cache: exact match (hash the query, short TTL) plus a semantic cache (cosine similarity threshold on the query embedding, longer TTL). The semantic layer catches "how do I X" vs "what's the way to X" without hitting the retriever again.
What model are you using for generation on the 8GB? That constraint probably shapes a lot of the architecture choices downstream.
You’re right about my query flow: I’m still doing separate LLM calls for the router, analyzer, and rewriter. Merging that into one should cut latency a lot, especially since Qwen2.5-7B-AWQ on an RTX 4000 Ada only gives me ~15–25 tok/s.
The BM25 point is spot-on too. I’ve been running pure vector search (BGE-base-en-v1.5 + FAISS, reranked with bge-reranker-v2-m3). Adding BM25 with dynamic weighting — especially for exact-match queries like titles/authors — is something I really shouldn’t keep putting off.
Using the cross-encoder as the evaluator is probably the easiest fix. My current GOOD/UNSURE/BAD scoring uses the same Qwen model, which is the circular issue I mentioned. Since I’m already running the cross-encoder, letting it handle the thresholding would let me drop the LLM evaluator entirely.
No caching yet, but I’ll start with exact-match hashing and layer semantic caching later.
Model-wise: Qwen2.5-7B-AWQ on GPU, with Qwen2.5-14B on CPU as a slow fallback. AWQ is what makes the 8GB VRAM setup workable.
Really appreciate you taking the time — I’ll open issues for hybrid search + caching this week.