This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.
SELECT
id,
text,
`by` AS username,
FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'comment'
AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
ORDER BY
time DESC
LIMIT
100
https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.
So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.
My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.
It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.
Thanks for sharing the projects. It is really interesting.
Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.
My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.
They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")
My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.
In case anyone cares.
You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.
( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )
When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.
Probably would be good to exclude those most common words.
https://antirez.com/hnstyle?username=pg&threshold=20&action=...
Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.
Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.
I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.
I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.
- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data
- remove stop words (NLP definition of stop words)
- perform stemming/tokenization/depluralization etc (again, NLP standard)
- implement commutativity and transitivity in the similarity function
- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity
- consider word bigrams, etc
- weight variations and misspellings higher as distinguishing signals
What are your ideas ?
https://antirez.com/hnstyle?username=dang&threshold=20&actio...
Anyway, I guess this would be useful cluster the "Matt Walsh"-y commenters together.
Secondly, if you want to make an alt account harder to cross-correlate with your main, would rewriting your comments with an LLM work against this method? And if so, how well?
don't site comment we here post that users against you're
Quite a stance, man :)
And me clearly inarticulate and less confident than some:
it may but that because or not and even these
I noticed that randomly remembered usernames tend to produce either lots of utility words like the above, or very few of them. Interestingly, it doesn't really correlate with my overall impression about them.
Is there anything that can be inferred from that? Is my writing less unique, so ends up being more similar to more people?
Also, someone like tptacek has a top 20 with matches all >0.87. Would this be a side-effect of his prolific posting, so matches better with a lot more people?
- aaronsw and jedberg share danielweber
- aronsw and jedberg share wccrawford
- aaronsw and pg share Natsu
- aaronsw and pg share mcphage
Well, and worked a lot with americans over text based communication...
This is impressive and scary. Obviously I had to create a throwaway to say this.
don't +0.9339
It's also a tool for wannabe impersonators to hoan their writing style mimic skills!
tablespoon is close, but has a missing top 50 mutual (mikeash). In some ways, this is an artefact of the "20, 30, 50, 100" scale. Is there a way to describe the degree to which a user has this "I'm a relatively closer neighbour to them than they are to me" property? Can we make the metric space smaller (e.g. reduce the number of Euclidean dimensions) while preserving this property for the points that have it?
Edit: ChatGTP, my bad
not very useful for more newer users like me :/
https://antirez.com/hnstyle?username=gfd&threshold=20&action...
zawerf (Similarity: 0.7379)
ghj (Similarity: 0.7207)
fyp (Similarity: 0.7197)
uyt (Similarity: 0.7052)
I typically abandon an account once I reach 500 karma since it unlocks the ability to downvote. I'm now very self conscious about the words I overuse...
I suspect, antirez, that you may have greater success removing some of the most common English words in order to find truly suspicious correlations in the data.
cocktailpeanuts and I for example, mutually share some words like:
because, people, you're, don't, they're, software, that, but, you, want
Unfortunately, this is a forum where people will use words like "because, people, and software."
Because, well, people here talk about software.
<=^)
Edit: Neat work, nonetheless.