FRESH

Hacker News

Show HN: Research Hacker News, ArXiv & Google with Hierarchical Bayesian Models

83 points by kianN

by kianN

2 subcomments

Some statistical notes for those interested:
Under the hood, this model resembles LDA, but replaces its Dirichlet priors with Pitman–Yor Processes (PYPs), which better capture the power-law behavior of word distributions. It also supports arbitrary hierarchical priors, allowing metadata-aware modeling.
For example, in an earnings-transcript corpus, a typical LDA might have a flat structure: Prior → Document
Our model instead uses a hierarchical graph: Uniform Prior → Global Topics → Ticker → Quarter → Paragraph
This hierarchical structure, combined with the PYP statistics, consistently yields more coherent and fine-grained topic structures than standard LDA does. There’s also a “fast mode” that collapses some hierarchy levels for quicker runs; it’s a handy option if you’re curious to see the impact hierarchy has on the model results (or in a rush).

by mkmccjr

1 subcomments

Just tried this out, and my mind is blown: https://platform.sturdystatistics.com/deepdive?fast=0&q=camp...
I did a google search for "camping with dogs" and it organized the results into a set of about ~30 results which span everything I'd want to know on the topic: from safety and policies to products and travel logistics.
Does this work on any type of data?

by aster0id

1 subcomments

This could become the missing piece for RAG with LLMs for company data. Every query that requires a lookup can use this model and then an agentic LLM can crawl through the hierarchy of results to extract the relevant information for the user's query. I suspect that'll work much better than the current methods of chunking and storing data with metadata like title and author in a vector database and then performing a hybrid search

by robrenaud

1 subcomments

The relevance here is pretty weak.
https://sturdystatistics.com/deepdive?fast=0&q=reinforcement...
I think only 1/10 of the articles is really on topic.

by novoreorx

1 subcomments

I love this concept! I have always believed that the old methodologies used in NLP and statistics can be better and faster than new LLM technologies like embeddings, depending on the scenario. Will the code be open-sourced someday? I'm thrilled to learn from it.

by jcynix

2 subcomments

Nice and interesting. I'm still investigating so might refine that later ;-) Can the search result be saved somehow for later use?
BTW:, the circular graphics of the result are really cool! How did you do this?

by kianN

0 subcomment

Quick update: I ran into a rate limit issue for one of my data sources. Apologies to anyone who has hit errors in the past 15 minutes. I think the issue should be resolved.

by swansonreed

0 subcomment