FRESH

Hacker News

The AI-Scraping Free-for-All Is Coming to an End

69 points by geox

by jsnell

5 subcomments

The headline seems pretty aspirational.
The licensing standard they're talking about will achieve nothing.
Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.
Putting the content behind a login wall can work for large sites, but not small ones.
The free-for-all will not end until adversarial scraping becomes illegal.

by WaltPurvis

1 subcomments

by xarope

0 subcomment

I can see how the AI companies would work around this though:
user queries "static" training data in LLM; LLM guesses something, then searches internet in real-time for data to support the guesses. This would be classified as "browsing" rather than trawling.
(the searched data then get added back into the corpus, thus sadly sidestepping all the anti-AI trawling mechanisms)
Kind of like the way a normal user would.
The problem is, as others have already mentioned, how would the LLMs know what is a good answer versus a bad, when a "normal" user also has this issue?

by janalsncm

0 subcomment

> There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts.
Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.
I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.

by Zigurd

0 subcomment

Sites containing original content will adopt active measures against LLM scraper bots. Unlike search indexing bots, there's much less upside to allowing scraping for LLM training material. Openly adversarial actions like serving up poisoned text that would induce LLMs to hallucinate is much more defensible.

by PolicyPhantom

0 subcomment

Free-for-All was a natural assumption in the early internet, but in the age of AI, alignment with contracts and governance becomes essential. Technical capability alone is not enough — without mechanisms like licensing or audits to ensure legitimacy, such practices may prove socially unsustainable.

by ericdotlee

1 subcomments

What a lot of these journalists don't realize is ai tools are the internet funnels of the future. People use ChatGPT not Google to source info. The way you get results is begging these tools to search for specific bits of info in order to get visibility.

by deadbabe

0 subcomment

by 1gn15

4 subcomments

Biased TL;DR: Reddit (notable for having a high stock value from their "selling data" business [1]), Medium, Quora, and Cloudflare competitor Fastly created a standard to restrict what the reader can do with the data users created, called Really Simple Licensing (RSL). Basically robots.txt but with more details, notably with details on how much you should pay Reddit/Medium/Quora.
While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.
[1] https://www.investors.com/research/the-new-america/reddit-st...

by koolhead17

0 subcomment

by ath3nd

1 subcomments

Next: the AI bubble is coming to an end. Also fingers crossed that the career and employment of Mark Zuckerberg also follow suit soon.

by aaaggg

0 subcomment