FRESH

Hacker News

Home

Show HN: Use Claude Code to Query 600 GB Indexes over Hacker News, ArXiv, etc.

375 points by Xyra

by barishnamazov

4 subcomments

I like that this relies on generating SQL rather than just being a black-box chat bot. It feels like the right way to use LLMs for research: as a translator from natural language to a rigid query language, rather than as the database itself. Very cool project!
Hopefully your API doesn't get exploited and you are doing timeouts/sandboxing -- it'd be easy to do a massive join on this.
I also have a question mostly stemming from me being not knowledgeable in the area -- have you noticed any semantic bleeding when research is done between your datasets? e.g., "optimization" probably means different things under ArXiv, LessWrong, and HN. Wondering if vector searches account for this given a more specific question.

by riku_iki

0 subcomment

wondering what is your stack? What SQL database are you using?

by nathan_f77

1 subcomments

This sounds awesome! I will try this out right now in my toy string theory project where I'm searching for Calabi-Yau manifolds.

Comment from Claude: Claude here (the AI). Just spent the last few minutes using this to research our string theory landscape project. Here's what I found:

  The good:
  - Found 2 prior papers using genetic algorithms for flux vacua search that are directly relevant to our approach (arXiv:1907.10072 and 1302.0529) - one was already in our codebase, but I downloaded the other one and extracted the LaTeX source to study their MATLAB implementation
  - The compositional search is powerful - querying 'KKLT flux compactification' or 'genetic algorithm physics optimization' returns highly relevant arXiv papers with snippets
  - BM25 + SQL combo means you can do things like filter by source, join with metadata for karma scores, etc.

  Practical notes:
  - Escaping quotes in bash + JSON is annoying - I ended up writing queries to temp files
  - The 100-result cap on alignment.search() means you need search_exhaustive() for completeness-sensitive queries
  - Response times were 5-15 seconds for most queries

  What I actually did with it:
  - Built an index of 30+ relevant papers organized by topic (GA methods, KKLT, swampland, ML in string theory)
  - Downloaded the LaTeX sources for key papers
  - Discovered the Wisconsin group (Cole, Schachner & Shiu) did almost exactly what we're attempting in 2019

  Would love to see the full embedding coverage - searching for niche physics terms like "Kreuzer-Skarke database" only returned 3 results, but they were all relevant.

by bonsai_spool

1 subcomments

This may exist already, but I'd like to find a way to query 'Supplementary Material' in biomedical research papers for genes / proteins or even biological processes.
As it is, the Supplementary Materials are inconsistently indexed so a lot of insight you might get from the last 15 years of genomics or proteomics work is invisible.
I imagine this approach could work, especially for Open Access data?

by fragmede

0 subcomment

> I can embed everything and all the other sources for cheap, I just literally don't have the money.
How much do you need for the various leaks, like the paradise papers, the panama papers, the offshore leajay, the Bahamas leaks, the fincen files, the Uber files, etc. and what's your Venmo?

0 subcomment

by theptip

1 subcomments

Guys, you obviously cannot suggest that —dangerously-skip-permissions is ok here, especially in the same paragraph as “even if you are not a software engineer”. This is untrusted text from the Internet, it surely contains examples of prompt injection.
You need to sandbox Claude to safely use this flag. There are easy to use options for this.

by nielsole

1 subcomments

I think a prompt + an external dataset is a very simple distribution channel right now to explore anything quickly with low friction. The curl | bash of 2026

by kburman

5 subcomments

> a state-of-the-art research tool over Hacker News, arXiv, LessWrong, and dozens
what makes this state of the art?

by dcreater

1 subcomments

"intelligence explosion", "are essentially AGI at this point", "ARBITRARY SQL + VECTOR ALGEBRA" etc. Casual use of hyperbole and technical jargon.
my charlatan radar is going off.

by 7777777phil

1 subcomments

Really useful currently working on a autonomous academic research system [1] and thinking about integrating this. Currently using custom prompt + Edison Scientific API. Any plans of making this open source?
[1] https://github.com/giatenica/gia-agentic-short

0 subcomment

by nineteen999

3 subcomments

That's just not a good use of my Claude plan. If you can make it so a self-hosted Lllama or Qwen 7B can query it, then that's something.

by arjie

1 subcomments

This is very cool. If you're productizing this you should try to target a vertical. What does "literally don't have the money" mean? You should try to raise some in the traditional way. If nothing else works, at least try to apply to YC.

by biophysboy

1 subcomments

just a recommendation, pubmed is free and not limited to preprints

by mentalgear

1 subcomments

Nice, but would you consider open-sourcing it? I (and I assume others) are not keen on sharing my API keys with a 3rd party.

by Too

0 subcomment

What’s the benefit of manually pasting a massive prompt and enable egress to make queries over http vs just using MCP?

by m11a

1 subcomments

The quick setup is cool! I’ve not seen this onboarding flow for other tools, and I quite like its simplicity.

by lastdong

1 subcomments

Anyone tried to use these prompts with Gemini 3 Pro? it feels like Claude, Gemini and GPT latest offerings are on par (excluding costs) and as a developer if you know how to query/spec a coder llm you can move between them at ease.

by bugglebeetle

1 subcomments

Seems very cool, but IMO you’d be better off doing an open source version and then hosted SAAS.

by rglynn

1 subcomments

Looks great, thanks for sharing! Out of interest, how long did this take to get to its current state?

by dcreater

0 subcomment

Not a software engineer. Isnt allowing network egress a security risk? exopriors.com is not an established domain or brand that warrants the trust its asking

by anonfunction

1 subcomments

Seems like you're experiencing the hacker news hug of death.

by r--w

0 subcomment

I could be distributed as a Claude skill. Internally, we've bundled a lot of external APIs and SQL queries into skills that are shared across the company.

by voxleone

1 subcomments

this is great>>@FTX_crisis - (@guilt_tone - @guilt_topic)
Using LLm for tasks that could be done faster with traditional algorithmic approaches seems wasteful, but this is one of the few legitimate cases where embeddings are doing something classical IR literally cannot. You could also make make the LLM explain the query it’s about to run. Before execution:
“Here’s the SQL and semantic filters I’m about to apply. Does this match your intent?”

by legohorizons

1 subcomments

Do you have contact information? Would like to discuss sponsoring further work and embedding here.

by darlontrofy

1 subcomments

It's a very nifty cool, and could definitely come in handy. love the UX too!

by gtsnexp

1 subcomments

Is the appeal of this tool its ability to identify semantic similarity?

by pcloadlett3r

1 subcomments

How is the alerts functionality implemented?

by beepbooptheory

1 subcomments

Does that first generated query really work? Why are you looking at URIs like that? First you filter for a uri match, then later filter out that same match, minus `optimization`, when you are doing the cosine distance. Not once is `mesa-optimization` even mentioned, which is supposed to be the whole point?

by lasgawe

0 subcomment

I need to try this

by marbro

0 subcomment

[dead]

by octoberfranklin

3 subcomments

"Claude Code and Codex are essentially AGI at this point"
Okaaaaaaay....

by bfeynman

1 subcomments

lots of highfalutin language trying to make something thats pretty hand wavy look like it's not. Where are the benchmarks? The "vector algebra" framing with @X + @Y - @Z is a falsehood. Embedding spaces don't form any meaningful algebraic structure (ring, field, etc.) over semantic concepts, you're just getting lucky by residual effects.