FRESH

Hacker News

Home

So you wanna build a local RAG?

375 points by pedriquepacheco

by simonw

18 subcomments

My advice for building something like this: don't get hung up on a need for vector databases and embedding.
Full text search or even grep/rg are a lot faster and cheaper to work with - no need to maintain a vector database index - and turn out to work really well if you put them in some kind of agentic tool loop.
The big benefit of semantic search was that it could handle fuzzy searching - returning results that mention dogs if someone searches for canines, for example.
Give a good LLM a search tool and it can come up with searches like "dog OR canine" on its own - and refine those queries over multiple rounds of searches.
Plus it means you don't have to solve the chunking problem!

by mips_avatar

1 subcomments

One thing I didn’t see here that might be hurting your performance is a lack of semantic chunking. It sounds like you’re embedding entire docs, which kind of breaks down if the docs contain multiple concepts. A better approach for recall is using some kind of chunking program to get semantic chunks (I like spacy though you have to configure it a bit). Then once you have your chunks you need to append context to how this chunk relates to the rest of your doc before you do your embedding. I have found anthropics approach to contextual retrieval to be very performant in my RAG systems (https://www.anthropic.com/engineering/contextual-retrieval) you can just use gpt oss 20b as the model for generation of context.
Unless I’ve misunderstood your post and you are doing some form of this in your pipeline you should see a dramatic improvement in performance once you implement this.

by abhashanand1501

0 subcomment

My advice - use same rigor as other software development for a RAG application. Have a test suite (of say 100 cases) which says for this question correct response is this. Use an LLM judge to score each of the outputs of the RAG system. Now iterate till you get a score of 85 or so. And every change of prompts and strategy triggers this check, and ensures that output of 85 is always maintained.

by nilirl

4 subcomments

Why is it implicit that semantic search will outperform lexical search?
Back in 2023 when I compared semantic search to lexical search (tantivy; BM25), I found the search results to be marginally different.
Even if semantic search has slightly more recall, does the problem of context warrant this multi-component, homebrew search engine approach?
By what important measure does it outperform a lexical search engine? Is the engineering time worth it?

by mingodad

0 subcomment

I did an experiment while learning about LLMs and llama.cpp consisting in trying to use create a Lua extension to use llama.cpp API to enhance LLMs with agent/RAG written in Lua with simple code to learn the basics and after more than 5 hours chatting with https://aistudio.google.com/prompts/new_chat?model=gemini-3-... (see the scrapped output of the whole session attached) I've got a lot far in terms of learning how to use an LLM to help develop/debug/learn about a topic (in this case agent/RAG with llama.cpp API using Lua).
I'm posting it here just in case it can help others to see and comment/improve it (it was using around 100K tokens at the end and started getting noticeable slow but still very helpful).
You can see the scrapped text for the whole seession here https://github.com/ggml-org/llama.cpp/discussions/17600

by autogn0me

0 subcomment

What we use: - https://github.com/ggozad/haiku.rag
Why?
- developer oriented (easy to read Python and uses pydantic-ai)
- benchmarks available
- docling with advanced citations (on branch)
- supports deep research agent
- real open source by long term committed developer not fly by night

0 subcomment

by nh2

2 subcomments

I'd like to have a local, fully offline and open-source software into which I can dump all our Emails, Slack, Gdrive contents, Code, and Wiki, and then query it with free form questions such as "with which customers did we discuss feature X?", producing references to the original sources.
What are my options?
I want to avoid building my own or customising a lot. Ideally it would also recommend which models work well and have good defaults for those.

by davedx

3 subcomments

> we use Sentence Transformers (all-MiniLM-L6-v2) as our default (solid all-around performer for speed and retrieval, English-only).
Huh, interesting. I might be building a German-language RAG at some point in my future and I never even considered that some models might not support German at all. Does anyone have any experience here? Do many models underperform or not support non-English languages?

by andai

0 subcomment

When I started playing with this stuff in the GPT-4 days (8K context!), I wrote a script that would search for a relevant passage in a book, by shoving the whole book into GPT-4, in roughly context sized chunks.
I think it was like a dollar per search or something in those days. We've come a long way!
Anthropic, in their RAG article, actually say that if your thing fits in context, you should probably just put it there instead of using RAG.
I don't know where the optimal cutoff is though, since quality does suffer with long contexts. (Not to mention price and speed.)
https://www.anthropic.com/engineering/contextual-retrieval
The context size and pricing has come so far! Now the whole book fits in context, and it's like 1 cent to put the whole thing in context.
(Well, a little more with Anthropic's models ;)

by Oras

1 subcomments

The hardest part in RAQ is document parsing. If you only consider text then it should be ok, but once you start having tables, tables going multiple pages, charts, ignore TOC when available, footnotes … etc, that part becomes really hard and accuracy suffers to get the context regardless of what chunking do you use.
There are some patterns to help such as RAPTOR where you make ingestion content aware and instead of just ingesting content, you start using LLMs to question and summarise the content and save that to the vector database.
But reality is, having one size fits all for RAQ is not an easy task.

by urbandw311er

1 subcomments

When it comes to the evals for this kind of thing, is there a standard set of test data out there that one can work with to benchmark against? ie a collection of documents with questions that should result in particular documents or chunks being cited as the most relevant match.

by into_the_void

0 subcomment

Interesting perspective on the use of full-text search over vector databases for RAG. I appreciate the insights on agentic tool loops and handling fuzzy searching.

by _joel

0 subcomment

You can get local RAG with Anythingllm if you want minimal effort too fwiw. Pretty much plug and play. Used it for simple testing for an idea before getting into the weeds of langchain and agentic RAG.

by JKCalhoun

2 subcomments

I kinda do want to build a local RAG? I want some significant subset of Wikipedia (I assume most people know about these) on a dedicated machine with a RAG front-end. I would have then an offline Wikipedia "librarian" I could query.
But I'm lazy and assumed that someone has already built such a thing. I'm just not aware of this "Wikipedia-RAG-in-a-box".

by mijoharas

0 subcomment

I'm interested in the embeddings models suggested. I had some good results with nomic in a small embedding based tool I built. I also heard a few good things about qwen3-embedding, though the latency wasn't great for my usecase so I didn't pursue it much further.
Similarly, I used sqlite-vec, and was very happy with it. (if I were already using postgres I'd have gone with that, but this was more of a cli tool).
If the author is here, did you try any of those models? how would you compare the ones you did use?

by barbazoo

2 subcomments

> What that means is that when you're looking to build a fully local RAG setup, you'll need to substitute whatever SaaS providers you're using for a local option for each of those components.
Even starting with having "just" the documents and vector db locally is a huge first step and much more doable than going with a local LLM at the same time. I don't know any one or any org that has the resources to run their own LLM at scale.

by kbrisso

1 subcomments

I built this for local RAG https://github.com/kbrisso/byte-vision it uses llama.cpp and Elasticsearch. On a laptop with 8 GB GPU it can handle a 30K token size and summarize a fairly large PDF.

by 0xC45

0 subcomment

For an open source, local (or cloud) vector DB, I would also recommend checking out Chroma (https://trychroma.com). It also supports full text search. Disclaimer: I work on Chroma cloud.

by dwa3592

1 subcomments

If you end up using any of the frontier models, don't forget to protect private information in your prompts - https://github.com/deepanwadhwa/zink

by johnebgd

0 subcomment

Interesting stack. I’ve been working on doing something like this with Apple specific tech. Swiftdata is not easy to work with.

by throwaway19343

0 subcomment

by ElasticBottle

0 subcomment

How does this compare with orama?

by dmezzetti

0 subcomment

Glad to see all the interest in the local RAG space, it's been something I've been pushing for a while.
I just put this example together today: https://gist.github.com/davidmezzetti/d2854ed82f2d0665ec7efd...

by spacecadet

1 subcomments

You can vibe code a local RAG with or without vectors in 5 minutes. Like another commenter pointed out, unless your corpus is huge, you do not need vectors, but hey using vectors is fun so why not.
For what its worth, I run a local first, small model, private RAG that uses LangGraph, Neo4J knowledge graphs, I swap the models around constantly. It mostly just gets called by agent tools now.

by adastra22

1 subcomments

Rust API?