I also have my gripes about the way 2 hop is mentioned here. With figure 3 being the canonical example of what I would consider too trivial/misleading (The exact text match of "Eric Watts" being in the question and in the context). It leads to the natural question of how does it do compared to an LLM with a grep tool.
What I would consider more interesting is practical synthesis over such a large context where you can't just string lookup answers. For example maybe dumping all of Intel's x86 manuals into context and then asking an LLM to try to write assembly or something.
I also think some of the benchmarks are misleading. Getting a RAG system to do an attention benchmark and then comparing it against a model without RAG just isn't fair. It is obviously better but it's not apples to apples. Some of the benchmarks compare against model+RAG and there the delta in performance is much smaller.