FRESH

Hacker News

Home

It seems that OpenAI is scraping [certificate transparency] logs

211 points by pavel_lishin

by 827a

8 subcomments

Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.

by Aurornis

3 subcomments

This could be OpenAI, or it could be another company using their header pattern.
It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.

by bombcar

3 subcomments

If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...
Then all they know is the main domain, and you can somewhat hide in obscurity.

by throwaway150

1 subcomments

I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!
I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?
Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!

by Bender

0 subcomment

It seems that OpenAI is scraping [certificate transparency] logs
They would be far from first. Any time I create a Wildcard cert in LE I immediately see a ton of sub-domain enumeration in my DNS query logs. Just for fun I create a bunch of wildcard certs for domains I do not even use just to keep their bots busy ... not used as in parked domains. This has been going on about as long as the CT logs have existed.

by poormathskills

1 subcomments

Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?

by bigbuppo

0 subcomment

For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.

by throwaway613745

2 subcomments

OpenAI is scraping everything that is publicly accessible. Everything.

by jcims

4 subcomments

Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?
>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;

by toddgardner

0 subcomment

If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs

by bigiain

0 subcomment

I usually get a cert for my public domain (root and usually with www. as a Subject Alternate Name (SAN)) and if I'm going to use subdomains I don't intend to become widely public, I'll add a wildcard SAN of *.example.com so I don't have to expose subdomains in transparency logs.
There's some security downside there if my web servers get hacked and my certs exfiltrated, but for a lot of stuff that tradeoff seems reasonable. I wouldn't recommend this approach of you were a bank or a government security agency or a drug cartel.

by 8cvor6j844qw_d6

0 subcomment

Anyone went with wildcard certificates to avoid disclosing subdomains in certificate transparency logs?

by basilikum

0 subcomment

They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.
Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.
That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.

by mxlje

0 subcomment

So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.

by accelbred

0 subcomment

Seems like you could set up a cert for a honeypot domain to collect ips of bots running off of the certificate transparency logs. If domain isnt linked from anywhere, then its pretty sure to be a bot isn't it?

by drwhyandhow

0 subcomment

This has been long the case! I think there whole business model is based off scraping lol

by _pdp_

0 subcomment

I wonder if this can be used to contaminate OpenAI search indexes?

by gmerc

0 subcomment

Let's prompt inject it

by xpe

1 subcomments

Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:
```
    - X happened
    - Person P says "Ah, X happened."
    - Person Q interprets this in a particular way
        and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
        (and indifferent to what others notice
         or might know or be interested in)
        ...says "(yawn)".
    - Person S narrowly looks at Person R and says
        "Oh, so you think Repugnant-X is ok?"
```
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.
See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum
* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.

by kirito1337

1 subcomments

yawn, i saw this more than 1000 times
privacy doesnt exist in this world

by matt3210

3 subcomments

Your content is stolen for training the moment you put it up