FRESH

Hacker News

Home

How I protect my Forgejo instance from AI web crawlers

189 points by todsacerdoti

by pedrozieg

4 subcomments

What I like about this approach is that it quietly reframes the problem from “detect AI” to “make abusive access patterns uneconomical”. A simple JS+cookie gate is basically saying: if you want to hammer my instance, you now have to spin up a headless browser and execute JS at scale. That’s cheap for humans, expensive for generic crawlers that are tuned for raw HTTP throughput.
The deeper issue is that git forges are pathological for naive crawlers: every commit/file combo is a unique URL, so one medium repo explodes into Wikipedia-scale surface area if you just follow links blindly. A more robust pattern for small instances is to explicitly rate limit the expensive paths (/raw, per-commit views, “download as zip”), and treat “AI” as an implementation detail. Good bots that behave like polite users will still work; the ones that try to BFS your entire history at line rate hit a wall long before they can take your box down.

by BLKNSLVR

3 subcomments

I really don't know how effective my little system would be against these scrapers, but I've setup a system that blocks IP addresses if they've attempted to connect to ports on my system(s) behind which there are no services, and therefore their connections must be 'uninvited', which I classify as malicious.
Since I do actually host a couple of websites / services behind port 443, it means I can't just block everything that tries to scan my ip address at port 443. However, I've setup Cloudflare in front of those websites, so I do log and block any non-Cloudflare (using Cloudflare's ASN: 13335) traffic coming into port 443.
I also log and block IP address attempting to connect on port 80, since that essentially deprecated.
This, of course, does not block traffic coming via the DNS names of the sites, since that will be routed through Cloudflare - but as someone mentioned, Cloudflare has its own anti-scraping tools. And then as another person mentioned, this does require the use of Cloudflare, which is a vast centralising force on the Internet and therefore part of a different problem...
I don't currently split out a separate list for IP addresses that have connected to HTTP(S) ports, but maybe I'll do that over Christmas.
This is my current simple project: https://github.com/UninvitedActivity/UninvitedActivity
Apologies if the README is a bit rambling. It's evolved over time, and it's mostly for me anyway.
P.S. I always thought it was Yog Sothoth (not Sototh). Either way, I'm partial to Nyarlathotep. "The Crawling Chaos" always sounded like the coolest of the elder gods.

by immibis

3 subcomments

My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.

by Simplita

1 subcomments

We ran into similar issues with aggressive crawling. What helped was rate limiting combined with making intent explicit at the entry point, instead of letting requests fan out blindly. It reduced both load and unexpected edge cases.

by maelito

3 subcomments

I'm having lots of connections every day from Singapor. It's now the main country... despite the whole website being French-only. AI crawlers, for sure.
Thanks for this tip.

by andai

6 subcomments

Can someone help me understand where all this traffic is coming from? Are there thousands of companies all doing it simultaneously? How come even small sites get hammered constantly? At some point haven't you scraped the whole thing?

by everfrustrated

2 subcomments

I think what gets lost in this is that we should expect a lot more traffic from AI if simply for the reason that if I ask AI to answer my question it will do a lot more work and fetch from a lot of websites in generating a reply to me. And yes searching over git repos will absolutely be part of that.
This is all "legitimate" traffic in that it isn't about crawling the internet but in service of a real human.
Put another way, search is moving from a model of crawl the internet and query on cached data to being able to query on live data.

by petee

0 subcomment

You can also add a honeypot urls to your robots.txt to trap bots that are using it as an index

by flexagoon

0 subcomment

Oh hey, I wrote the "you don't need anubis" post you (or the post author, if that's not you) got inspiration from! Glad to hear it helped!

by s_ting765

1 subcomments

I use the same exact trick from the source the article mentions.
I call it `temu` anubis. https://github.com/rhee876527/expert-octo-robot/blob/f28e48f...
Jokes aside, the whole web seems to be trending towards some kind of wall (pay, login, app etc.) and this ultimately sucks for the open internet.

by userbinator

0 subcomment

Unfortunately this means, my website could only be seen if you enable javascript in your browser.
Or have a web-proxy that matches on the pattern and extracts the cookie automatically. ;-)

by apples_oranges

1 subcomments

HTTP 412 would be better I guess..

by justsomehnguy

0 subcomment

A similar approach can be done by writing a cookie by the proxy/webserver itself by visiting some path ie: example.net/sesame/open.
For a single user or a small team this could be enough.

by frogperson

1 subcomments

I think it would be really cool if someone built a reverse proxy just for dealing with these bad actors.
I would really like to easily serve some markov chain non-sense to Ai bots.

by reconnecting

3 subcomments

tirreno (1) guy here.
Our open-source system can block IP addresses based on rules triggered by specific behavior.
Can you elaborate on what exact type of crawlers you would like to block? Like, a leaky bucket of a certain number of requests per minute?
1. https://github.com/tirrenotechnologies/tirreno

by stronglikedan

0 subcomment

> Unfortunately this means, my website could only be seen if you enable javascript in your browser. I feel this is acceptable.
I wouldn't be surprised if all this AI stuff was just a global conspiracy to get everyone to turn on JS.

by KronisLV

0 subcomment

We should just have some standard for crawlable archived versions of pages with no back end or DB interaction behind them etc., for example if there's a reverse proxy, whatever it outputs is archived and it wouldn't actually pass on any call in the archive version. Same for translating the output of any dynamic JS into fully static HTML. Then add some proof-of-work that works without JS and is a web standard (e.g. server sends header, client sends correct response, gets access to archive) and mainstream the culture for low-cost hosting for such archives and you're done, also make sure that this sort of feature is enabled in the most basic configuration for all web servers and such, logged separately.
Obviously such a thing will never happen, because the web and culture went in a different direction. But if it were a mainstream thing, you'd get easy to consume archives (also for regular archival and data hoarding) and the "live" versions of sites wouldn't have their logs be bogged down by stupid spam.
Or if PoW was a proper web standard with no JS, then ppl who want to tell AI and other crawlers to fuck off, they could at least make it uneconomical to crawl their stuff en masse. In my view, proof of work that would work through headers in the current day world should be as ubiquitous as TLS.

by agentifysh

1 subcomments

never heard of forgejo, should one switch from gitea

by Roark66

3 subcomments

I'm glad the author clarified he wants to prevent his instance from crashing not simply "block robots and allow humans".
I think the idea that you can block bots and allow humans is fallacious.
We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.
Also, perhaps I'm biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.
I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).
The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.
The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.

by mintflow

0 subcomment

recently I just noticed github trying(but failed) to charge the self host runners, I find a afternoon to setup a mini PC to install freeBSD and gitaea on it, then setup tailscale to let it only listen on the 100.64.x.x IP address.
Since I do not make this node public accessable, so no worry for AI web crawlers:)

by csilker

2 subcomments

Cloudflare has a solution to protect routes from crawlers.
https://blog.cloudflare.com/introducing-pay-per-crawl/