FRESH

Hacker News

Home

Meta Superintelligence Labs' first paper is about RAG

419 points by skadamat

by ipsum2

4 subcomments

This has nothing to do with superintelligence, it's just the people that were working on the paper prior to the re-org happened to publish after the name change.
Though it is notable that contrary to many (on HN and Twitter) that Meta would stop publishing papers and be like other AI labs (e.g. OpenAI). They're continued their rapid pace of releasing papers AND open source models.

by godelski

9 subcomments

It's kinda funny, Meta has long had some of the best in the field, but left them untapped. I really think if they just took a step back and stop being so metric focused and let their people freely explore then they'd be winning the AI race. But with this new team, I feel like meta mostly hired the people who are really good at gaming the system. The people that care more about the money than the research.
A bit of this is true at every major lab. There's tons of untapped potential. But these organizations are very risk adverse. I mean why not continue with the strategy that got us to the point we're at in the first place. Labs used to hire researchers and give them a lot of free reign. But those times ended and AI progress also slowed down. Maybe if you want to get ahead you gotta stop thinking like everyone else
Well meta... you can "hold me hostage" for a lot cheaper than those guys. I'm sure this is true for hundreds of passionate ML researchers. I'd take a huge pay cut to have autonomy and resources. I know for a fact there's many working at Mets right now that would do the same. Do maybe if you're going to throw money at the problem, diversify a bit and look back at what made SV what it is today and what made AI take leaps forward

by mark_l_watson

0 subcomment

A great idea, bypassing as much conversion as possible between vector space and natural language tokens. Reminds me of a discussion of having AI’s “talk” to each other using vector space.
There was an interesting quote “plain old BM25 from 1994 outperforms vector search on recall” and super relevant to what I did yesterday. I am trying to use small local models more often and yesterday I wrote Common Lisp code that uses a large corpus of text and a user query or prompt to construct a fairly concise one-shot prompt with select context from the text corpus. This is RAG, and I used both BM25 and vector embeddings matching. I added the code and an example as a new chapter in my CL book (link directly to new material: https://leanpub.com/lovinglisp/read#leanpub-auto-autocontext...) yesterday afternoon. BM25 is fast. This is new code, and I will certainly be experimenting more with it, but as-is it is useful when working with small local LLMs.

by schmorptron

5 subcomments

One thing I don't get about the ever-reoccuring RAG discussions and hype men proclaiming "Rag is dead", is that people seem to be talking about wholly different things? My mental model is that what is called RAG can either be:
- a predefined document store / document chunk store where every chunk gets a a vector embedding, and a lookup decides what gets pulled into context as to not have to pull whole classes of document, filling it up
- the web search like features in LLM chat interfaces, where they do keyword search, and pull relevant documents into context, but somehow only ephemerally, with the full documents not taking up context in the future of the thread (unsure about this, did I understand it right?) .
with the new models with million + tokens of context windows, some where arguing that we can just throw whole books into the context non-ephemerally, but doesnt that significantly reduce the diversity of possible sources we can include at once if we hard commit to everything staying in context forever? I guess it might help with consistency? But is the mechanism with which we decide what to keep in context not still some kind of RAG, just with larger chunks of whole documents instead of only parts?
I'd be extatic if someone who really knows their stuff could clear this up for me

by zem

2 subcomments

this was really weird to read:
> But RAG is a very real world, practical topic for something as significant as a new lab’s first paper.
I would expect exactly the opposite - that a new lab would put out a few random papers that happen to be in areas their researchers were interested in and already working on, and once people had been working together a while and developed some synergy they would maybe come out with something really groundbreaking.
do people really view a "first paper" as something deeply significant and weighty? because that just seems like a good way to get bogged down in trying to second guess whether any given paper was good enough to be your all-important debut!

by elyobo

2 subcomments

Can we have a more informative, less clickbaity, title?

by jongjong

5 subcomments

Interesting. All developers I know who tinkered around with embeddings and vector similarity scoring were instantly hooked. The efficiency of computing the embeddings once and then reusing as many times as needed, comparing the vectors with a cheap <30-line function is extremely appealing. Not to mention the indexing capabilities to make it work at scale.
IMO vector embedding is the most important innovation in computing of the last decade. There's something magical about it. These people deserve some kind of prize. The idea that you can reduce almost any intricate concept including whole paragraphs to a fixed-size vector which encapsulates its meaning and proximity to other concepts across a large number of dimensions is pure genius.

by Imnimo

1 subcomments

I'm curious whether this is work that was specifically begun under the "superintelligence" umbrella, or if it's just that the people who were working on it had been shifted to the Superintelligence team by the time they wrote the paper. I would guess the former?

by pbd

0 subcomment

https://github.com/simulanics/REFRAG

by Palmik

0 subcomment

The observation about the "block-diagonal patterns" in RAG isn't new and has been exploited / explored before:
- https://arxiv.org/abs/2410.07590 (literally titled "Block-Attention for Efficient RAG")
- https://arxiv.org/abs/2409.15355v3
- https://arxiv.org/abs/2212.10947
The REFRAG paper does not cite any of these.

by CShorten

1 subcomments

Here is a video I made diving into the paper, hopefully helpful!
https://www.youtube.com/watch?v=Ek0tZootK00

by mountainriver

0 subcomment

This was a very obvious next step, I played around with implementing something similar at one point.
In general we need to make it simpler for LLMs to take in different forms of embeddings. At least frameworks that simplify it.

by yalogin

3 subcomments

I am not surprised because the culture at meta is not at all, even in the slightest, to focus on science for the sake of it. It’s actively actively purged out of you. The focus is on metrics and how the bottom line is impacted. So this is in line with that

by nmca

0 subcomment

This is not work by any of the high profile new hires, in case folks are confused.

by puttycat

4 subcomments

Seems very incremental and very far from the pompous 'superintelligence' goal.

by SknCode

0 subcomment

I am not sure if I understand things correctly.
I came to believe the LLMs work with token embeddings. Is then the REFRAG only "something" in front of the LLM, and the decoder is the RL policy which expands only some token chunk embeddings into token embeddings feedable to LLM? Or the REFRAG needs you to 'tune' the LLM to be able to work with both token embeddings and token chunk embeddings?

by macleginn

0 subcomment

So this looks essentially like continuous prompting (see prefix tuning) with RL-driven selection of what to present as tokens and what as continuous inputs (embeddings).

by bigcat12345678

0 subcomment

https://docs.lamini.ai/memory_rag/ Similar approaches have been tried before already

by aurohacker

0 subcomment

Figure 1 in the paper is all about the encoder and how the context and query is packaged and sent to the decoder. I wish it were more complete...

by armcat

0 subcomment

I couldn't immediately see in their graphs/tables any comparison against simple lexical/statistical based context compression, such as candidate selection of chunks using TF-IDF, word overlap etc. For most of us in the industry we need to find these quick wins that give us equivalent performance to sending huge amount of information to the LLM, while compressing by 10x.

by naasking

0 subcomment

> the core insight here is actually: if embeddings are generated by layers within the LLM, it makes no sense to convert them back to natural language, just for another LLM to compress those tokens back to embeddings.
Doesn't this tie the two layers together in a way that they can't evolve separately?

by asim

1 subcomments

This was inevitable. You can't keep training LLMs and expect that's the answer to the evolution of AI. Yes it'll happen and we'll keep creating new more refined and bigger models but it's like DNA or something like the cortex of the brain. After that you need these systems that essentially "live" for years digesting information and develop a more refined way to process, store and retrieve the information. Compression of RAG was also inevitable. It's like the btree index of a database. The thing is, we're probably one or two iterations away from being good enough on the RAG pipeline and then we'll need to focus more on the other pieces of sensory input that need to be connected and processed at higher throughput. Right now it's not fast or efficient enough. This is where the likes of Google will shine. They are probably two decades ahead of everyone on internal technology and there is some team with the breakthrough but it hasn't seen the light of day yet. What's coming out of DeepMind is really a forced effort in productization and publication of work in a consumable format but internally they are likely way ahead. I don't have as much faith in Meta's efforts despite seeing things like this. Quite frankly those people, the ones doing the work should move to more honourable companies. Not feed crack addiction in the form of Meta's universe.

by koolala

0 subcomment

Did a "superintelligence" lab publish a superintelligence related paper with no results for intelligence? What measured improvements did this proposal make in their LLM's intelligence?

by mikepalmer

0 subcomment

I hate articles that don't define their acronyms! Lazy? Intentionally exclusive?
So that others don't also have to look it up, it's Retrieval-Augmented Generation (RAG).
They even say it's "a topic that we didn’t expect"... so... perhaps many people wouldn't have heard of it?

by bigyabai

0 subcomment

> Long awaited first paper from Meta Superintelligence Labs is not a model layer innovation. What does this mean?
It means you're reading into it too much and need to be let down, gently, from the hype train.

0 subcomment

by RataNova

0 subcomment

Refreshing (and slightly unexpected) to see Meta Superintelligence start with something this practical instead of a headline-grabbing new model

by singularity2001

0 subcomment

somewhere in my hacker news comment history I presented this very idea

by foldl2022

0 subcomment

So, show me the model weights, please.

by i5heu

0 subcomment

Can we please get rid of the clickbait titles?

by pppoe

2 subcomments

I find it absurd that, compared to the past, large companies now have more abundant stock prices and cash than ever before, yet nearly every AI Lab in these companies is facing greater pressure than ever, being asked to generate short-term profits. In the midst of AI's unprecedented boom, the research environment and atmosphere in the industry seem to have worsened compared to the past.

by cm2012

0 subcomment

At first I thought the super intelligence wrote a novel scientific paper

by dangsecondalt

0 subcomment

[dead]

by nine_k

0 subcomment

A great post, it starts with this:
TL;DR
• MSI’s first paper, REFRAG, is about a new way to do RAG.
• This slightly modified LLM converts most retrieved document chunks into compact, LLM-aligned chunk embeddings that the LLM can consume directly.
• A lightweight policy (trained with RL) decides which chunk embeddings should be expanded back into full tokens under a budget; the LLM runs normally on this mixed input.
• The net effect is far less KV cache and attention cost, much faster first-byte latency and higher throughput, while preserving perplexity and task accuracy in benchmarks.
I wish more long posts followed this model of a scientific paper.

by xvector

4 subcomments

Working in big tech it's pretty wild to see how integral AI has become to our work internally, vs the public perception of it. People are NOT prepared.