FRESH

Hacker News

Home

Waiting for dawn in search: Search index, Google rulings and impact on Kagi

480 points by josephwegner

by ghm2199

10 subcomments

> Building a comparable one from scratch is like building a parallel national railroad..
Not too be pedantic here but I do have a noob question or two here:
1. One is building the index, which is a lot harder without a google offering its own API to boot. If other tech companies really wanted to break this monopoly, why can't they just do it — like they did with LLM training for base models with the infamous "pile" dataset — because the upshot of offering this index for public good would break not just google's own monopoly but also other monopolies like android, which will introduce a breath of fresh air into a myriad of UX(mobile devices, browsers, maps, security). So, why don't they just do this already?
2. The other question is about "control", which the DoJ has provided guidance for but not yet enforced. IANAL, but why can't a state's attorney general enforce this?

by WhyNotHugo

4 subcomments

The statistics in this article sound like garbage to me.
Google used by 90% or the world?
~20% of the human population lives in countries where Google is blocked.
OTOH, Baidu is the #1 search engine in China, which has over 15% of the world’s population… but doesn’t reach 1%?
These stats are made measuring US-based traffic, rather than “worldwide” as they claim.

by pfist

1 subcomments

I am rooting for Kagi here, and I applaud their transparency on such matters. It is quite enlightening for someone like me who understands technology but knows little about the inner workings of search.
It remains to be seen how or if the remedies will be enforced, and, of course, how Google will choose to comply with them. I am not optimistic, but at least there is some hope.
As an aside: The 1998 white paper by Brin and Page is remarkable to read knowing what Google has become.

by whs

0 subcomment

>Google: Google does not offer a public search API. The only available path is an ad-syndication bundle with no changes to result presentation - the model Startpage uses. Ad syndication is a non-starter for Kagi’s ad-free subscription model.[^1]
>Because direct licensing isn’t available to us on compatible terms, we - like many others - use third-party API providers for SERP-style results (SERP meaning search engine results page). These providers serve major enterprises (according to their websites) including Nvidia, Adobe, Samsung, Stanford, DeepMind, Uber, and the United Nations.
The customer list matches what is listed on SerpAPI's page (interestingly, DeepMind is on Kagi's list while they're a Google company...). I suppose Kagi needs to pen this because if SerpAPI shuts down they may lose access to Google, but they may already have utilize multiple providers. In the past, Kagi employees have said that they have access to Google API, but it seems that it was not the case?
As a customer, the major implication of this is that even if Kagi's privacy policy says they try to not log your queries, it is sent to Google and still subject to Google's consumer privacy policy. Even if it is anonymized, your queries can still end up contributing to Google Trends.

by xnx

9 subcomments

> Because direct licensing isn’t available to us on compatible terms, we - like many others - use third-party API providers for SERP-style results
Crazy for a company to admit: "Google won't let us whitelabel their core product so we steal it and resell it."

by ajdude

13 subcomments

Does anyone else use the phrase "I'm going to google XYZ" while referring to actually searching it up on Kagi, DDG, or another search engine?

by sabslikesobs

0 subcomment

I like that there's a list of primary sources at the bottom.
Kagi's AI assistant has been satisfying compared to Claude and ChatGPT, both of which insisted on having a personality no matter what my instructions said. Trying to do well-sourced research always pissed me off. With Kagi it gives me a summary of sources it's found and that's it!

by 1vuio0pswjnm7

0 subcomment

Google has appealed and moved for a partial stay re: the remedies discussed in this blog post
https://storage.courtlistener.com/recap/gov.uscourts.dcd.223...
Will Kagi file an amicus brief in support of the plaintiffs
Perhaps Google will fund amici in support of their position as they did in the Epic appeal
https://www.law.com/nationallawjournal/2025/01/10/fight-over...

by ApolloFortyNine

1 subcomments

With Google's search engine making almost $200 billion a year in revenue, I'm not sure Kagi could afford what market rates would be here. They also spent billions developing the technology to crawl, index, and rank billions of pages, factoring that in, again I don't think a good price can be put on it.
What even is market rate? Kagi themselves admits there's no market, the one competitor quit providing the service.
Obviously Google doesn't want to become an index provider.

by thisislife2

4 subcomments

> Layer 3: Paid, subscription-based search
Should actually be - Layer 3: Paid, ad-free, subscription-based search. (It's a subtle omission that indicates the direction Kagi search will eventually take).

by Nevermark

1 subcomments

> A government-backed, ad-free, intermediary-free, taxpayer-funded search service providing baseline, non-discriminatory access to information. Imagine search.org.
There is no way the government provides a search engine that doesn’t become a political football or weapon.
Maybe in a different age.
I completely agree that monopoly remedies, such as fair open paid licensing, are needed. I prefer that to breakups, when this kind of cooperative/competitive leveling works.

by cush

0 subcomment

The idea of a search index being a public utility is an interesting idea but I’m not sure what it would do for trust. Governance is the biggest question mark, and with the current administration I’d say let Google run it and have less restrictive access to the index. My Google search usage has dropped probably 99% over the last two years.
My hope is that the powers that be figure out how to monetize these products with dollars instead of attention. Google’s ad-driven business model ruined the internet - we don’t need that in our AI products too.

by adsharma

0 subcomment

Why didn't I see anything about common crawl?
Exa, Parallel and a whole bunch of companies doing information retrieval under the "agent memory" category belong to this discussion.

by direwolf20

3 subcomments

I hope they cache search results to further reduce the number of calls to Google.
And Marginalia Search was not mentioned? Marginalia Search says they are licensing their index to Kagi. Perhaps it's counted under "Our own small-web index" which is highly misleading if true.

by jeffbee

0 subcomment

"We will simply access the index" has always struck me as wild hand-waving that would instantly crumble at first contact with technical reality. "At marginal cost" is doing a huge amount of work in this article.

by stacktraceyo

1 subcomments

Is there a crowd indexed style search index? Like instead of relying on the crawling completely you rely on a maybe like an extension in your browser that indexes as people are using their browser. Or maybe indexing your site to this index instead of waiting to be crawled.

by senko

2 subcomments

A full up-to-date index of the searchable web should be a public commons good.
This would not only allow better competition in search, but fix the "AI scrapers" problem: No need to scrape if the data has already been scraped.
Crawling is technically a solved problem, as witnessed by everyone and their dog seemingly crawling everything. If pooled together, it would be cheaper and less resource intensive.
The secret sauce is in what happens afterwards, anyway.
Here's the idea in more detail: https://senkorasic.com/articles/ai-scraper-tragedy-commons
I'm under no illusion something like that will happen .. but it could.

by luk4

0 subcomment

I think it's worth mentioning the Open Web Search initiative [1] and the Open Web Index [2] specifically.
> 14 renowned European research and computing centers have joined forces to develop an open European infrastructure for web search. The initiative is contributing to Europe’s digital sovereignty as well as promoting an open human-centered search engine market. [1]
> The Open Web Index (OWI) is a European open source web index pilot that is currently in Beta testing phase. The idea: Collaboratively and transparently secure safe, sovereign and open access to the internet for European organisations and civil society. The index stores well structured open web data, making it available for search applications and LLMs. [3]
[1] https://openwebsearch.eu/
[2] https://openwebindex.eu/
[3] https://openwebsearch.eu/open-webindex/

by stephen_cagle

1 subcomments

One interesting point was the original PageRank algorithm greatly benefited from the fact that we kinda only had "text matching" search before Google (my memory was AltaVista at the time).
Because text matching was so difficult to search with, whenever you went to a site, it would often have a "web of trust" at the bottom where an actual human being had curated a list of other sites that you might like if you liked this site.
So you would often search with keywords (often literals), then find the first site, then recursively explore the web of trust links to find the best site.
My suspicion has always been that Google (PageRank) benefited greatly from the human curated "web of trust" at the bottom of pages. But once Google came out, search was much better, and so human beings stopped creating "web of trust" type things on their site.
I am making the point that Google effectively benefited from the large amount of human labor put into connecting sites via WOT, while simultaneously (inadvertently) destroying the benefit of curating a WOT. This means that by succeeding at what they did, they made it much more difficult for a Google#2 to come around and run the exact same game plan with even the exact same algorithm.
tldr; Google harvested the links that were originally curated by human labor, the incentive to create those links are gone now, so the only remaining "links" between things are now in the Google Index.
Addendum: I asked claude to help me think of a metaphor, and I really liked this one as it is so similar.
``` "The railroad and the wagon trails"
Before railroads, collective human use created and maintained wagon trails through difficult terrain. The railroad company could survey these trails to find optimal routes. Once the railroad exists, the wagon trails fall into disuse and the pathfinding knowledge atrophies. A second railroad can't follow trails that are now overgrown. ```

by user3939382

1 subcomments

For anyone not acquainted Kagi is excellent and the people who work there strike me as nice and competent. I’m a harsh critic usually. Highly recommended.

by zvqcMMV6Zcr

0 subcomment

Recently I encounter "no results" screen when using Google that I am starting to suspect the problem will solve itself. And by solve I mean open parts of internet will die off completely, and only owners of silos like Facebook will be able to provide data for search indexes.

by keeda

2 subcomments

Google's advantage is not just in its index and algorithms, it is that it has built a self-reinforcing flywheel that data mines human attention at massive scale to improve their search results.
This comment (https://news.ycombinator.com/item?id=46709957) points out that Google got its start via PageRank, which essentially ranked sites based on links created by humans. As such, its primary heuristic was what humans thought was good content. Turns out, this is still how they operate.
Basically, as people search and navigate the results, Google harvests their clicks, hovers, dwell-time and other browsing behavior -- i.e. tracking what they pay attention to -- to extract critical signals to "learn" which pages the users actually found useful for the given query. This helps it rank results better and improve search overall, which keeps people coming back, which in turns gives them more queries and data, which improves their results... a never-ending flywheel.
And competitors have no hope of matching this, because if you look at the infrastructure Google has built to harvest this data, it is so much bigger than the massive index! They harvest data through Chrome, ad tracking, Android, Google Analytics, cookies (for which they built Gmail!), YouTube, Maps, and so much more. So to compete with Google Search, you don't need just a massive index, you also need the extensive web infra footprint to harvest user interactions at massive scale, meaning the most popular and widely deployed browser, mobile OS, ad footprint, analytics, email provider, maps...
This also explains why Google spends so many billions in "traffic acquisition costs" (i.e. payments for being the Search default) every year, because that is a direct driver to both, 1) ad revenue, and 2) maintaining its search quality.
This wasn't really a secret, but it turned out to be a major point in the recent Antitrust trial, which is why the proposed remedies (as TFA mentions) include the sharing of search index and "interaction data."
We all knew "if you're not paying for it, you're the product" but the fascinating thing with Google is:
- They charge advertisers to monetize our attention;
- They harvest our attention to better rank results;
- They provide better results, which keeps us coming back, and giving them even more of our attention!
Attention is all you need, indeed.

by jxmesth

3 subcomments

Honestly, would be very cool if someone could make a search engine of only human-produced content. I know it's going to be hard and compute intensive but I don't think it's impossible. In fact, Google could do it. A paid service for only human made content. Obviously there would be a margin of error as we can never be 100% sure if something really is AI written.

by zhfanlqeo

1 subcomments

I used kagi for a while but got lazy with updating the subscription when moving and needing to change credit cards so I went back to DDG/Google and having to go back to having to skip the first result or first few results shows you just how obnoxious this practice is. When I have a few moments I'll resubscribe to kagi...

by yomismoaqui

1 subcomments

One thing I have discovered after using AI chats that include a websearch tool is that I don't want to delve on diferent blogs, Medium posts, Stack overflow threads with passive-aggresive mod comments, dismissing cookie banners... Sorry I just want the info I'm looking for, I don't care for your personal expression or need to monetize your content.
There are other times (usually not work related) when I want to explore the web and discovering some nice little blog or special corner on the net. This is what my RSS feed reader is for.

by weisnobody

0 subcomment

I think the crawled data should have to be shared, but I'm not convinced that Google should have to share their index.
It may be impracticable to share the crawled data, but from the stand point of content providers, having a single entity collecting the information (rather than a bunch of people doing) would seem to be better for everyone. Likely need to have some form of robots.txt which would allow the content provider to indicate how their content could be used (i.e research, web search, AI, etc.).
The people accessing the crawled data would end up paying (reasonable) fees to access the level of data they want, and some portion of that fee would go to the content provider (30% to the crawler and 70% to the crawler? :P maybe).
Maybe even go so far as to allow the Paywalled content providers to set a price on accessing their data for the different purposes. Should they be allowed to pick and choose who within those types should be allowed (or have it be based on violations of the terms of access)
It seems in part the content providers have the following complaints:
```
  * Too many crawlers (see note below re crawlers)
  * Crawlers not being friendly
  * Improper use of the crawled data
  * Not getting compensated for their content
```
Why not the index? The index, to me, is where a bunch of the "magic" happens and where individual companies could differentiate themselves from everyone else.
Why can't Microsoft retain Bing traffic when it's the default on stock Windows installs?
```
  * Do they not have enough crawled data?  
  * Their index isn't very good?
  * Their searching their index isn't good
  * The way they present the data is bad?
  * Google is too entrenched?
  * Combination of the above?
```
There are several entities intending to crawl all / large portions of the Internet: Baidu, Bing, Brave, Google, DuckDuckGo, Gigablast, Mojeek, Sogou and Yandex [1]. That does not include any of the smaller entities, research projects, etc.
[1] https://en.wikipedia.org/wiki/Search_engine#2000s–present:_P... (2019)

by hsuduebc2

1 subcomments

It is even worse that the Google search become shit in last years. So they gate keep only relevant information for themselves and not using them with intent to improve search quality. As always if you have no competition your innovation goes only towards cost reduction. Not product improvement.

by jiehong

1 subcomments

I think one side problem is that part of the web is not even searchable with a search engine.
Here are some examples:
- Discord
- WeChat (is it the web?)
- Rednote
- TikTok (partially)
- X (partially)
- JSTOR (it finds daily, but you find more stuff on the website directly)
- any stuff with a login, obviously.

by sharpshadow

1 subcomments

If Google provides a Search Index it will be the censored version therefore still politically acceptable. The “Layer 1” idea will not happen.

by WhereIsTheTruth

5 subcomments

Kagi's "waiting for dawn" is just waiting for Google to legitimize their reseller business
Meanwhile, users pay a premium to pretend they're not using Google
Fascinating delusion

by 1970-01-01

0 subcomment

>The problem: A search monopoly
...
>We tried to do it the right way
This sign-up to retrieve better information idea will never take-off the way they think it will. A white label search will get you nowhere. They are silently failing because they're just too stubborn to do it the hard way. Kagi needs to pivot and succeed on useful and interesting edge cases first. Build us out a subject-relevant search, such as displaying vetted content from forums when searching a product/service, and then tying it into Facebook Marketplace for local items or services and Amazon for new. That is called building a product for yourself that others will use. Now you have your very own cashflow for clicks; use that cashflow to buy more corporate access, thereby proving you can succeed without any other search business propping you up and into relevancy. You don't need to start with the giants either. Start with something that works on local hunting, fishing, shooting, and knitting forums. When grandmothers need high quality green yarn today, make their muscle memory point to Kagi local, not Google.

by maelito

1 subcomments

We need a european Kagi.

by grayhatter

0 subcomment

Kagi uses Brave search index? huh, TIL... that's very disappointing. And it's the kinda thing that would prevent me from ever paying for Kagi. Brave's crawler, is agressive, dumb (it doesn't appear to back off if it hits a number of 503s), and critically, it ignores robots.txt. They even admit they choose to ignore it. To top that off their crawler doesn't identify itself, instead masquerading as a real browser. I've had to ban the entire Hetzner ASN from my site to get them to stop.
On one hand, I really want Kagi to succeed. They very often, do seem to care about the parts of the world and internet that I care about. But on the other... to me, willingly associating, and financing a company that willingly brags about ignoring consent, is a non-starter for me.

by nige123

0 subcomment

The user data (anonymised) and analytics also needs to be shared.

by echelon

0 subcomment

If there are any Kagi folks here, I've come up with a new angle to attack Google's anti-competitive position that could be incredibly effective:
https://news.ycombinator.com/item?id=46681985
https://news.ycombinator.com/item?id=44546519
I'm going to send this idea to my legislators, the EU, Sam Altman, Tim Sweeny, and Elon Musk, et al., I just haven't had time to put this together yet.
Google is a monopolist scourge and needs to be knocked down a peg or two.
This should also apply to the iPhone and Android app stores.

by ares623

2 subcomments

Kagi should start building an index of sites that are trying to escape the current slop internet. It’s know they have the Small Web thing. But I’d like to see an index of a “neo internet” that blocks Google et al.

by OGEnthusiast

2 subcomments

Sounds like we need a nationalized search engine company then?

by bilekas

0 subcomment

I've tried Kagi and while it is better than google these days, to be fair that's not hard with the enshitification slop that's out there.
But Kagi funds Yandex which fund the RU government, and I think it should be known to anyone looking to use it.
https://ounapuu.ee/posts/2025/07/17/kagi/
https://kagifeedback.org/d/5445-reconsider-yandex-integratio...

by ssoid

0 subcomment

[dead]

by the_arun

6 subcomments

If google is serving 90% traffic & others are unable to enter - Doesn't that mean google is doing something right for the customer and others are unable to outcompete it? Isn't this how life works?