FRESH

Hacker News

Language Support for Marginalia Search

176 points by Bogdanp

by ofalkaed

1 subcomments

Surprisingly informative for what is pretty much a press release, learned a good deal about search engines.

by mariusor

1 subcomments

Off topic, but would there be a way to integrate marginalia with a specific website? Similarly to how people use google search for their forums or how HN uses algolia?
I'm asking this as one of my projects is a link aggregator similar to old reddit (and HN to some extent) and I would like to be able to present to users a search box, but without having to implement document indexing and search. (I assume ad principio that the website is already aligned ethically and technologically with what Marginalia stands for :D)

by smoghat

3 subcomments

I’m a little confused by Marginalia. I looked to find out what its purpose was, but couldn’t find it. My bad, I guess, but then again I’m not a search engine. It is pretty cool for a DIY project but the results were really off, especially for searches for individuals. Like take Ezra Klein as an example. Sure there is a link to his show from castbox, a service I have never heard of, and then a bunch of anti Ezra Klein articles. Wikipedia shows up, the last link of the first page is to Abundance. But no NYT? That seems like a big problem. I thought I’d look up Daring Fireball and the only link to his site was a ways down and was to a list of links in 2008. These are just two random searches. I did others, starting with myself, and my results were similar.
Likely I am totally not understanding what this search engine is for. I see this a lot on submissions here. I find something interesting sounding but I don’t understand the context. Maybe it’s just me, but it’s confusing.

by atombender

1 subcomments

> Thankfully the BM-25 model used in ranking is robust to this, as it relies on live data from the index itself.
I'm confused by this. TD-IDF incorporates the term frequency (the IDF part), which search engines precompute for the index as a whole. But so does BM25; its IDF formula is slightly different, but also relies on term frequencies. What's the difference?

by vintermann

1 subcomments

This is never going to work. The author is apparently against AI in search in favor of "simplicity", but this sort of thing
> Sentences are stemmed and POS-tagged. Sentences, with stemming and POS-tag data is fed into keyword extraction algorithms
IS AI, it's just old fashioned and bad AI. What he's trying will never work well, for the same reason rule-based machine translation never worked well: there are just too many rules and exceptions. Simplicity is great when you can have it, but with human language, simplicity was never on the table.
He's going to have to bite the bullet and use document embedding models sooner or later.

by reedf1

1 subcomments

Took me too long to realize this wasn't a tool to search for marginalia in scanned manuscripts.

by internet_points

1 subcomments

What tools/data do you use for pos-tagging? I'm guessing it has to be fast, to run without a google data center :)

by juliend2

0 subcomment

I remember asking you for this, so Thank you so much! It works quite well from what I can see.
Small UI issue: on Desktop, the left sidebar should be scrollable, because now on Firefox I can't reach the "Language" menu item in the search results view, unless I zoom-out.