It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
Then all they know is the main domain, and you can somewhat hide in obscurity.
I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?
Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!
They would be far from first. Any time I create a Wildcard cert in LE I immediately see a ton of sub-domain enumeration in my DNS query logs. Just for fun I create a bunch of wildcard certs for domains I do not even use just to keep their bots busy ... not used as in parked domains. This has been going on about as long as the CT logs have existed.
>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;
There's some security downside there if my web servers get hacked and my certs exfiltrated, but for a lot of stuff that tradeoff seems reasonable. I wouldn't recommend this approach of you were a bank or a government security agency or a drug cartel.
Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.
That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.
- X happened
- Person P says "Ah, X happened."
- Person Q interprets this in a particular way
and says "Stop saying X is BAD!"
- Person R, who already knows about X...
(and indifferent to what others notice
or might know or be interested in)
...says "(yawn)".
- Person S narrowly looks at Person R and says
"Oh, so you think Repugnant-X is ok?"
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum
* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.
privacy doesnt exist in this world