Ask HN: Scaling a targeted web crawler beyond 500M pages/day
22 points by honungsburk
by faangguyindia
1 subcomments
If you want to access data from websites which prevent it, you gotta use a headless browser with Residential Proxy Network Like Bright Data (formerly Luminati).
by 4lx87
2 subcomments
I'm curious, how do you deal with Cloudflare and similar anti-bot systems? Just keep shopping the job around to different proxies?
by
0 subcomment
by fragmede
1 subcomments
have you already incorporated common crawl into your index?