Another requirement was keeping latency as low as possible (we managed to get < 5 seconds with 85%+ accuracy). Their approach seems to have very unpredictable latencies, sometimes up to thousands of seconds (may be fine for background tasks), and it scales poorly with corpus size.
Interesting research anyway, but I'd still stick with embedding/reranker-based retrieval (+BM25 for hybrid search) because you do not waste time wandering around blindly each time, trying to find the minimal context to start from, which could have been found immediately with an index. Another issue is that research papers often implement subpar baselines for the approaches they compare against. When I was implementing retrieval, the straightforward implementation gave me 40% accuracy, and various tricks/parameter tuning pushed it to 85%+ without changing the overall architecture (took about a month of experimentation).
In many cases cheap methods like grepping and BM25 just are not going to work well, so semantic similarity is the best initial retriever/filter, followed by LLM-as-judge as a second filter/reranker if you need the precision.
Is anyone using small, low-latency, fast LLMs to implement stuff like search as a RAG alternative? Could be the perfect use case for that Llama3 8B ASIC some company showed off a few months ago.
But current IR methods both lexical and semantic retrieval definitely have bottlenecks as pointed out in the the obliq-bench paper (https://arxiv.org/abs/2605.06235).
But it still has to enumerate synonyms to find things.
I would assume it's very domain dependent, like code or technical docs would have more precise terminology that is better for fixed string search. On the other hand, medical or legal text can have many many ways to say something
The constraint I, and I bet many here, have is just how much data there is. 3GB like in the 2014 article is one .pdf
Enterprise level data store is measured in hundreds of GB for a single customer, and you'll get murdered on data egress costs if you try to search an entire corpus, if you can even get through it all before the request times out or the customer decides after 5 minutes that enough is enough.
You'd need a true distributed filesystem to even start attempting what the authors suggest at any scale outside of your local machine.
Rant off. Not really related to the article.