FRESH

Hacker News

Home

We can't have nice things because of AI scrapers

461 points by LorenDB

by dannyobrien

7 subcomments

Metabrainz is a great resource -- I wrote about them a few years ago here: https://www.eff.org/deeplinks/2021/06/organizing-public-inte...
There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.
It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."
Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

by akuyou

2 subcomments

AI is destroying the free internet along with everything else
My web host suspended my website account last week due to a sudden large volume of requests to it - effectively punishing me for being scraped by bots.
I've had to move to a new host to get back up, but what hope does the little guy have? it's like GPU and ram prices, it doesn't matter if I pay 10x 100x or 1000x more than I did, the AI companies have infinite resources, and they don't care what damage they do in the rush to become the no 1 in the industry
The cynic in me would say it's intentional, destroy all the free sites so you have to get your info from their ai models, price home users out of high end hardware so they have to lease the functions from big companies

by Jeremy1026

1 subcomments

I sysadmin my kids' PTA website. OpenAI was scraping it recently. I saw it looking at the event calendar, request after request to random days. I saw years 1000 through 3000 scroll by. I changed the response to their user agent to an access denied, but it still took about 4 hours for them to stop.

by chlorion

3 subcomments

I self host a small static website and a cgit instance on an e2-micro VPS from Google Cloud, and I have got around 8.5 million requests combined from openai and claude over around 160 days. They just infinitely crawl the cgit pages forever unless I block them!
```
    (1) root@gentoo-server ~ # egrep 'openai|claude' -c /var/log/lighttpd/access.log
    8537094
```
So I have lighttpd setup to match "claude|openai" in the user agent string and return a 403 if it matches, and a nftables firewall seutp to rate limit spammers, and this seems to help a lot.

by SchemaLoad

11 subcomments

Cloudflare has a service for this now that will detect AI scrapers and send them to a tarpit of infinite AI generated nonsense pages.

by tensegrist

7 subcomments

the more time passes the more i'm convinced that the solution is to—somehow—force everyone to have to go through something like common crawl
i don't want people's servers to be pegged at 100% because a stupid dfs scraper is exhaustively traversing their search facets, but i also want the web to remain scrapable by ordinary people, or rather go back to how readily scrapable it used to be before the invention of cloudflare
as a middle ground, perhaps we could agree on a new /.well-known/ path meant to contain links to timestamped data dumps?

by randyl

2 subcomments

The SQLite team faced a similar problem last year, and Richard Hipp (the creator of SQLite) made almost the same comment:
"The malefactor behind this attack could just clone the whole SQLite source repository and search all the content on his own machine, at his leisure. But no: Being evil, the culprit feels compelled to ruin it for everyone else. This is why you don't get to keep nice things...."
https://sqlite.org/forum/forumpost/7d3eb059f81ff694

by falloutx

3 subcomments

Its not just AI scrappers doing it by themselves but now users are also being trained to put the link in the claude chat/chat gpt and ask it to summarise it. And off course that would show up on the website end as a scraper.
In fact firefox now allows you to preview the link and get key points without ever going to the link[1]
[1] https://imgur.com/a/3E17Dts

by arjie

2 subcomments

Someone convinced me last time[0] that these aren't the well-known scrapers we know but other actors. We wouldn't be able to tell, certainly. I'd like to help the scrapers be better about reading my site, but I get why they aren't.
I wish there were an established protocol for this. Say a $site/.well-known/machine-readable.json that instructs you on a handful of established software or allows pointing to an appropriate dump. I would gladly provide that for LLMs.
Of course this doesn't solve for the use-case where the AI companies are trying to train their models on how to navigate real world sites so I understand it doesn't solve all problems, but one of the things I think I'd like in the future is to have my own personal archive of the web as I know it (Internet Archive is too slow to browse and has very tight rate-limits) and I was surprised by how little protocol support there is for robots.
robots.txt is pretty sparse. You can disallow bots and this and that, but what I want to say is "you can get all this data from this git repo" or "here's a dump instead with how to recreate it". Essentially, cooperating with robots is currently under-specified. I understand why: almost all bots have no incentive to cooperate so webmasters do not attempt to. But it would be cool to be able to inform the robots appropriately.
To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.
0: https://news.ycombinator.com/item?id=46352723

by k310

0 subcomment

Metabrainz hosed?
> There has been a critical error on this website.
> Learn more about troubleshooting WordPress.
https://blog.metabrainz.org/2025/12/11/we-cant-have-nice-thi...

by bodantogat

2 subcomments

I feel the pain — it’s very difficult to detect many of the less ethical scrapers. They use residential IP pools, rotate IPs, and provide valid user agents.

by jmward01

1 subcomments

Bummer. I have used them a lot when I was ripping my cds. Anonymity is a massive value of the web (at least the appearance of anonymity). I wonder if there is a way to have a central anonymous system that just relays trust, not identity.
So maybe something like you can get a token but its trust is very nearly zero until you combine it with other tokens. Combining tokens combines their trust and their consequences. If one token is abused that abuse reflects on the whole token chain. The connection can be revoked for a token but trust takes time to rebuild so it would take a time for their token trust value to go up. Sort of the 'word of mouth' effect but in electronic form. 'I vouch for 2345asdf334t324sda. That's a great user agent!'
A bit (a lot) elaborate but maybe there is a beginning of an idea there, maybe. Definitely I don't want to loose anonymity (or the perception there of) for services like musicbrainz but at the same point they need some mechanism that gives them trust and right now I just don't know of a good one that doesn't have identity attached.

by nottorp

0 subcomment

Okay, it's been established that "AI" crawlers are a pest. One of the reasons being that they don't actually run any "AI", that would be too expensive.
You can't ban by user agent because that will only catch the few crawlers that are actually honest about it.
Aren't there rate limiting solutions built into at least some web servers? At least if you control your own web server, can't you do it through some reverse proxy?
Cut off IPs that make more than NN requests in a minute? Require some kind of login to allow more, if you do have endpoints that are designed to be bulk hit?
There should be ready made solutions for this still. In spite of the current answer being "lulz it's too hard, just use cloudflare".

by elestor

0 subcomment

I am getting a wordpress error so used this: https://web.archive.org/web/20260114073704/https://blog.meta...

by OuterVale

0 subcomment

The site is down. You can read the article via the Wayback Machine here: https://web.archive.org/web/20260112201803/https://blog.meta...

by tooltower

2 subcomments

> Rather than downloading our dataset in one complete download, they insist on loading all of MusicBrainz one page at a time.
Is there a standard mechanism for batch-downloading a public site? I'm not too familiar with crawlers these days.

by 1vuio0pswjnm7

0 subcomment

No Anubis:
https://web.archive.org/web/20251211141351if_/https://blog.m...

by adrianwaj

1 subcomments

Look no further than x402 micropayments as both the solution and opportunity here.
And then a way to return a portion to humans.
These AI companies are loaded too (maybe not the long-tail as yet) and the crypto ecosystem is mature.
Come one, come all. Make money.
Need a Wordpress plugin to start the ball rolling and provide ping endpoints for the AI companies to leach from. They can pay to get those pings too.
Give them what they want and charge them. Lower their costs by making their scraping more efficient.

by rurban

0 subcomment

True. I had to kill my dynamic service of collected film festival ratings, because the bots drained the memory of the still available free hosters. I fought it for two years, with user agent and IP ranges, but eventually gave up. So I had to revert to static pages hosted on GitHub pages. The bots cannot kill that. But very limited features

by lysace

1 subcomments

At some point they must become more cost efficient by pure market economics mechanisms. That implies less load on sites. Much of the scraping that I see is still very dumb/repetative. Like Googlebot in like 2001.
(Blocking Chinese IP ranges with the help of some geoip db helps a lot in the short term. Azure as a whole is the second largest source of pure idiocy.)

by tommek4077

6 subcomments

How do they get overloaded? Is the website too slow? I have a quite big wiki online and barely see any impact from bots.

by squigz

1 subcomments

> The /metadata/lookup API endpoints (GET and POST versions) now require the caller to send an Authorization token in order for this endpoint to work.
> The ListenBrainz Labs API endpoints for mbid-mapping, mbid-mapping-release and mbid-mapping-explain have been removed. Those were always intended for debugging purposes and will also soon be replaced with a new endpoints for our upcoming improved mapper.
> LB Radio will now require users to be logged in to use it (and API endpoint users will need to send the Authorization header). The error message for logged in users is a bit clunky at the moment; we’ll fix this once we’ve finished the work for this year’s Year in Music.
Seems reasonable and no big deal at all. I'm not entirely sure what "nice things" we can't have because of this. Unauthenticated APIs?

by kpcyrd

0 subcomment

I wish this wasn't necessary, but the next steps forward are likely:
a) Have a reverse proxy that keeps a "request budget" per IP and per net block, but instead of blocking requests, causing the client to rotate their IP, the requests get throttled/slowed down, without dropping them.
b) Write your API servers in more efficient languages. According to their Github, their backend runs on Perl and Python. These technologies have been "good enough" for quite some time, but considering current circumstances and until a better solution is found, this may not be the case anymore and performance and cpu cost per request does matter these days.
c) Optimize your database queries, remove as much code as possible from your unauthenticated GET request handlers, require authentication for the expensive ones.

by Olshansky

3 subcomments

Resurfacing a proposal I put out on llms-txt: https://github.com/AnswerDotAI/llms-txt/issues/88
We should add optional `tips` addresses in llms.txt files.
We're also working on enabling and solving this at Grove.city.
Human <-> Agent <-> Human Tips don't account for all the edge cases, but they're a necessary and happy neutral medium.
Moving fast. Would love to share more with the community.
Wrote about it here: https://x.com/olshansky/status/2008282844624216293

by Pet_Ant

0 subcomment

I wish more resources were available legitimately. There is a dataset I need for legitimate research that I cannot even find a way to contact the repo owners.
Mind you I take effort to not be burdensome by downloading only what I need and taking time between each request of a couple seconds, and the total data usage is low.
Ironically, I supposed you could call it "AI" what I'm using it for, but really it's just data analytics.

by ranger_danger

0 subcomment

Many years ago AWS came up with a "Requester Pays" model for their S3 storage, where you can make a request for someone else's object using your own account and it would charge the transfer cost to your own account instead of theirs.
I wonder if a model similar to this (but decentralized/federated or something) could be used to help fight bots?

by saxonww

0 subcomment

I haven't really looked but I wonder if there are any IP reputation services tracking AI crawlers the same way they track tor relays and vpns and such. If those databases were accurate it seems like you could prevent those crawlers from ever hitting the site. Maybe they change too often/too quickly.

by dgxyz

1 subcomments

I actually deleted my web site early 2025 and removed the A record from DNS because of AI scraper traffic. It had been up for 22 years. Nothing important or particularly useful on it but it's an indicator of the times.

by accrual

3 subcomments

Grateful for Metabrainz putting in this work to keep the service up. We really ought to have some kind of "I am an AI!" signal to route the request properly into a long queue...

by zx8080

1 subcomments

Nothing prevents scraper from creating a free account and sending auth token in API requests.
I'm not saying the API changes are pointless, but still, what's the catch?

by DanOpcode

1 subcomments

I don't get why the AI scrapers need to scrape the same sites and pages over and over again.

by blell

1 subcomments

Seems a mistake to disable the (I assume) faster-to-generate api. Bots will go back to scraping the website itself, increasing load.

by sreekanth850

0 subcomment

There has been a critical error on this website.
Learn more about troubleshooting WordPress.
Site is broken now.

by levleontiev

0 subcomment

I am terribly sorry for self-advertising, but:
I am just now busy building a solution: self-hosted sophisticated rate-limiting.
More complex than nginx, more private than cloudfare. Please joint the waitlist if you want to morally support me ;)
https://getfairvisor.com/

by aszantu

0 subcomment

there's this one guy I know who writes scripts to poison ai scrapers that ignore the robot.txt, basically creates word salad and random folders that keep going deep without giving the ai anything of value.

by observationist

0 subcomment

Some sort of hashing and incremental serial versioning type standards with http servers would allow hitting a server up for incremental udpates, allow clean access to content, with rate limits, and even keep up with live feeds and chats and so on.
Something like this in practice breaks a lot of the adtech surveillance and telemetry, and makes use of local storage, and incidentally empowers p2p sharing with a verifiable source of truth, which incentivizes things like IPFS and big p2p networks.
The biggest reason we don't already have this is the exploitation of user data for monetization and intrusive modeling.
It's easy to build proof of concept instances of things like that and there are other technologies that make use of it, but we'd need widespread adoption and implementation across the web. It solves the coordination problem, allows for useful throttling to shut out bad traffic while still enabling public and open access to content.
The technical side to this is already done. Merkle trees and hashing and crypto verification are solid, proven tech with standard implementations and documentation, implementing the features into most web servers would be simple, and it would reduce load on infrastructure by a huge amount. It'd also result in IPFS and offsite sharing and distributed content - blazing fast, efficient, local focused browsers.
It would force opt in telemetry and adtech surveillance, but would also increase the difference in appearance between human/user traffic and automated bots and scrapers.
We can't have nice things because the powers that be decided that adtech money was worth far more than efficiency, interoperability, and things like user privacy and autonomy.

by xnx

0 subcomment

I realized I couldn't think of any crawler worth allowing but Googlebot. Put this in my robots.txt:
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: *
Disallow: /

by a-dub

0 subcomment

it's funny how ai is the problem that the cryptocurrencyverse was always in search of...

by smallerfish

0 subcomment

Bear in mind that some portion of this could be human directed research. I'm doing a research project right now with 1000 things that I'm building profiles on; to build a full profile requires an agent to do somewhere around 100 different site lookups. Where APIs exist, I've registered API keys and had the agent write a script to query the data in that manner, but that required me to be deliberate about it. Non technie plebs aren't likely to be directed to use an API by the agent.

0 subcomment

by OutOfHere

0 subcomment

It is nonsense since AI is the nicest thing.

by devhouse

1 subcomments

random idea, instead of blocking scrapers, maybe detect them (via user-agent, request patterns, ignoring robots.txt) and serve them garbage data wrapped in dad jokes.

  if (isSuspiciousScraper(req)) {
   return res.json({ 
     data: getDadJoke(),
     artist: "Rick Astley", // always
     album: "Never Gonna Give You Up"
   });
  }

by lep_qq

2 subcomments

This is frustrating to watch. MetaBrainz is exactly the kind of project AI companies should be supporting—open data, community-maintained, freely available for download in bulk. Instead they’re: ∙ Ignoring robots.txt (the bare minimum web courtesy) ∙ Bypassing the provided bulk download (literally designed for this use case) ∙ Scraping page-by-page (inefficient for everyone) ∙ Overloading volunteer-run infrastructure ∙ Forcing the project to add auth barriers that hurt legitimate users The irony: if they’d just contacted MetaBrainz and said “hey, we’d like to use your dataset for training,” they’d probably get a bulk export and maybe even attribution. Instead, they’re burning goodwill and forcing open projects to lock down. This pattern is repeating everywhere. Small/medium open data projects can’t afford the infrastructure to handle aggressive scraping, so they either: 1. Add authentication (reduces openness) 2. Rate limit aggressively (hurts legitimate API users) 3. Go offline entirely (community loses the resource) AI companies are externalizing their data acquisition costs onto volunteer projects. It’s a tragedy of the commons, except the “commons” is deliberately maintained infrastructure that these companies could easily afford to support. Have you considered publishing a list of the offending user agents / IP ranges? Might help other projects protect themselves, and public shaming sometimes works when technical measures don’t

by StephenHerlihyy

1 subcomments

I don’t know why anyone would still be trying to pull data off the open internet. Too much signal to noise. So much AI influence already baked into the corpus. You are just going to be reinforcing existing bias. I’m more worried about the day Amazon or Hugging Face take down their large data sets.

by zzo38computer

0 subcomment

There some other possibilities, such as:
Require some special header for accessing them, without needing a API token if it is public data. HTTPS will not necessarily be required. Scrapers can still use it but it seems unlikely unless it becomes common enough; but if they do then you can remove that and require proper authentication.
Another is to use X.509 client certificates for authentication, which is more secure than using API keys anyways; however, this will require that you have a X.509 certificate, and some people might not want that, so due to that, perhaps it should not be mandatory.

by garganzol

4 subcomments

Nowadays people complain about AI scrapers with the same vain as they complained about search indexers a way back when. Just a few years later, people had stopped caring too much about storage access and bandwidth, and started begging search engines to visit their websites. Every trick on the planet Earth, SEO optimization, etc.
Looking forward to the time when everybody suddenly starts to embrace AI indexers and welcome them. History does not repeat itself but it rhymes.

by cookiengineer

1 subcomments

I don't understand why everyone is complaining so much about AI scrapers.
They're easily gullible free machines that can do your computational work!
Just show them a download demo link. They gonna download, install and run the binary.
Want more instagram likes? Tell them to like your instagram profile to unlock the content.
Want your emails answered? Give them access to your inbox and tell them to reply to all the spam mails.
They're free use machines. give them something to do, and they'll do it for you.