FRESH

Hacker News

Home

Aggressive bots ruined my weekend

207 points by shaunpud

by asplake

6 subcomments

> What's wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I'm still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.
Wild indeed, and potentially horrific for the owners of the affected devices also! Any corroboration for that out there?

by ItsBob

3 subcomments

I had a website earlier this year running on Hetzner. It was purely experimenting with some ASP.NET stuff but when looking at the logs, I noticed a shit-load of attempts at various WordPress-related endpoints.
I then read something about a guy who deliberately put a honeypot in his robots.txt file. It was pointing to a completely bogus endpoint. Now, the theory was, humans won't read robots.txt so there's no danger, but bots and the like will often read robots.txt (at least to figure out what you have... they'll ignore the "deny" for the most part!) and if they try and go to that fake endpoint you can be 100% sure (well, as close as possible) that it's not a human and you can ban them.
So I tried that.
I auto-generated a robots.txt file on the fly. It was cached for 60 seconds or so as I didn't want to expend too many resource on it. When you asked for it, you either got the cached one or I created a new one. The CPU-usage was negligible.
However, I changed the "deny" endpoint each time I built the file in case the baddies cached it, however, it still went to the same ASP.NET controller method. By hitting it, I sent a 10GB zip bomb and your IP was automatically added to the FW block list.
It was quite simple: anyone that hit that endpoint MUST be dodgy... I believe I even had comments for the humans that stumbled across it letting them know that if they went to this endpoint in their browser it was an automatic addition to the firewall blocklist.
Anyway... at first I caught a shit load of bad guys. There were thousands at first and then the numbers dropped and dropped to only tens per day.
Anyway, this is a single data point but for me, it worked... I have no regrets about the zip bomb either :)
I have another site that I'm working on so I may evolve it a bit so that you are banned for a short time and if you come back to the dodgy endpoint then I know you're a bot so into the abyss with you!
It's not perfect but it worked for me anyway.

by cupofjoakim

3 subcomments

We feel this at work too. We run a book streaming platform with all books, booklists, authors, narrators and publishers available as standalone web pages for SEO, in the multiple millions. Last 6 months have turned into a hellscape - for a few reasons:
1. It's become commonplace to not respect rate limits
2. Bots no longer identify themselves by UA
3. Bots use VPNs or similar tech to bypass ip rate limiting
4. Bots use tools like NobleTLS or JA3Cloak to go around ja3 rate limiting
5. Some valid LLM companies seem to also follow the above to gather training data. We want them to know about our company, so we don't necessarily want to block them
I'm close to giving up on this front tbh. There's no longer safe methods of identifying malignant traffic at scale, and with the variations we have available we can't statically generate these. Even with a CDN cache (shoutout fastly) our catalog is simply too broad to fully saturate the cache while still allowing pages to be updated in a timely manner.
I guess the solution is to just scale up the origin servers... /shrug
In all seriousness, i'd love if we somehow could tell the bots about more efficient ways of fetching the data. Use our open api for fetching book informations instead of causing all that overhead by going to marketing pages please.

by pingoo101010

0 subcomment

You may want to take a look at Pingoo (https://github.com/pingooio/pingoo), a reverse proxy with automatic TLS that can also block bots with advanced rules that go beyond simple IP blocking.

by uvaursi

1 subcomments

Do we shift over everything to le Dark Web and let the corpos use this one for selling their shit to consumers? These toys don’t want to play nice and there’s no real way to stop them without bringing in things like Real ID and other verifications that infringe on anonymity.

by y-zon128

1 subcomments

> Auto-restart the reverse-proxy if bandwidth usage drops to zero for more than 2 minutes
It's understandable in your case as you have traffic coming in constantly, but first thing that came to my mind is a loop of contant reboots - again, very unlikely in your case. Sometimes such blanket rules hit me due to most unexpected reasons, like the proxy somehow failed to start serving traffic in the given timeframe.
Though I completely appreciate and agree with the 'ship now something that works now' approach!

by r_singh

3 subcomments

The Internet isn’t possible without scraping. For all the sentiment against scraping public data, doing so remains legal and essential to a lot of the services we use everyday. I think setting guidelines and shaping the web for reduced friction aimed at fair usage rather than turning it political would be the right thing to do.

by Havoc

0 subcomment

> monetise their apps by offering them for free, and selling tunnel access to scrapers.
I bet it’s free VPN apps

by kwa32

1 subcomments

crazy how scraping became an industry

by 2OEH8eoCRo0

1 subcomments

Why don't we sue the abusive scrapers? Scraping is legal but DDoSing is not!

by kelvinjps10

1 subcomments

Maybe moving the blog service to completely static and letting cloudfare pages handle it, could help?

by deepstateisfbi

0 subcomment

[flagged]

0 subcomment

by TimorousBestie

3 subcomments

I think he should consider getting out of the indie blog hosting business. It’s only going to get worse as the internet continues to decay and he can’t be making all that much off the service.