I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.
Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.
Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.
> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums
https://anubis.techaro.lol/docs/design/how-anubis-works
This is pretty cool, I have a project or two that might benefit from it.
A funny line from his docs
Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]
> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.
> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.
The goal is to make web scraping unfeasible because of computational costs for OCR. It's a cat and mouse game right now and I want to change the odds a little. The HTML source would be effectively void without the user session, meaning an OTP like behavior could also make web pages unreadable once the assets go uncached.
This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.
I sure would love to hear your input, Xe. Maybe we can combine our efforts?
My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.
Genuine question: why not leverage the proof-of-work challenge literally into mining that generates some revenue for a website? Not a new idea, but when I looked at the docs it didn't seem like this challenge was tied to any monetary coin value.
This is coming from someone who is NOT a big crypto person, but it strikes me that this would be a much better way to monetize organic high quality content in this day and age. Basically the idea that Brave browser started with, meeting it's moment.
I'm sure Xe has already considered this. Do they have a blog post about this anywhere?
It is really sad that the worldwide web has been taken to the point where this is needed.
Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.
* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.
Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...
What is the problem with bots asking for traffic, exactly?
Context of my perspective: I am a contractor for a team that hosts thousands of websites on a Kubernetes cluster. All of the websites are on a storage cluster (combination of ZFS and Ceph) with SATA and NVMe SSDs. The machines in the storage cluster and also the machines the web endpoints run on have tons of RAM.
We see a lot of traffic from what are obviously scraping bots. They haven't caused any problems.
Will be interested to hear of that. In the meantime, at least I learned of JShelter.
Edit:
Why not use the passage of time as the limiter? I guess it would still require JS though, unless there's some hack possible with CSS animations, like request an image with certain URL params only after an animation finishes.
This does remind me how all of these additional hoops are making web browsing slow.
Edit #2:
Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.
I'm also not sure if time would make at-scale scraping as much more expensive as PoW does. Time is money, sure, but that much? Also, the UX of it I'm not sold on, but could be mitigated somewhat by doing news website style "I'm only serving the first 20% of my content initially" stuff.
So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?
Edit #3:
Turning reasoning on for a moment, this whole thing is a bit iffy.
First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training. In principle, this is nonsensical. The goal of sharing information with the general public (so, people) involves said information eventually traversing through a non-technological medium (air, as light), to reach a non-technological entity (a person). This means that any technological measure will be limited to before that medium, and won't be able to affect said target either. Put differently, I can rote copy your website out into a text editor, or hold up a camera with OCR and scan the screen, if scale is needed.
So in principle we're definitely hosed, but in practice you can try to hold onto the modality of "scraping for AI training" by leveraging the various technological fingerprints of such activity, which is how we get to at-scale PoW. But then this also combats any other kind of at-scale scraping, such as search engines. You could whitelist specific search engines, but then you're engaging in anti-competitive measures, since smaller third party search engines now have to magically get themselves on your list. And even if they do, they might be lying about being just a search engine, because e.g. Google may scrape your website for search, but will 100% use it for AI training then too.
So I don't really see any technological modality that would be able properly discriminate AI training purposed scraping traffic for you to use PoW or other methods against. You may decide to engage in this regardless based on statistical data, and just live with the negative aspects of your efforts, but then it's a bit iffy.
Finally, what about the energy consumption shaped elephant in the room? Using PoW for this is going basically exactly against the spirit of wanting less energy to be spent on AI and co. That said, this may not be a goal for the author.
The more I think about this, the less sensible and agreeable it is. I don't know man.
... yeah, that will totally work.
"If you are using Anubis .. please donate on Patreon. I would really love to not have to work in generative AI anymore..."
This is what it actually does: Instead of only letting the provider bear the cost of content hosting (traffic, storage), the client also bears costs when accessing in form of computation. Basically it runs additional expansive computation on the client, which makes accessing 1000s of your webpages at high interval expansive for crawlers.
> Anubis uses a proof of work in order to validate that clients are genuine. The reason Anubis does this was inspired by Hashcash, a suggestion from the early 2000's about extending the email protocol to avoid spam. The idea is that genuine people sending emails will have to do a small math problem that is expensive to compute, but easy to verify such as hashing a string with a given number of leading zeroes. This will have basically no impact on individuals sending a few emails a week, but the company churning out industrial quantities of advertising will be required to do prohibitively expensive computation.
Wouldn't it be ironic if the amount of JS served to a "bot" costs even more bandwidth than the content itself? I've seen that happen with CF before. Also keep in mind that if you anger the wrong people, you might find yourself receiving a real DDoS.
If you want to stop blind bots, perhaps consider asking questions that would easily trip LLMs but not humans. I've seen and used such systems for forum registrations to prevent generic spammers, and they are quite effective.