FRESH

Hacker News

Home

XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

284 points by summarity

by hinterlands

5 subcomments

Xbow has really smart people working on it, so they're well-aware of the usual 30-second critiques that come up in this thread. For example, they take specific steps to eliminate false positives.
The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.
But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.
I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.

by tecleandor

2 subcomments

First:
> To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.
Is it dogfooding if you're not doing it to yourself? I'd considerit dogfooding only if they were flooding themselves in AI generated bug reports, not to other people. They're not the ones reviewing them.
Also, honest question: what does "best" means here? The one that has sent the most reports?

by vmayoral

1 subcomments

It’s humans who:
- Design the system and prompts
- Build and integrate the attack tools
- Guide the decision logic and analysis
This isn’t just semantics — overstating AI capabilities can confuse the public and mislead buyers, especially in high-stakes security contexts.
I say this as someone actively working in this space. I participated in the development of PentestGPT, which helped kickstart this wave of research and investment, and more recently, I’ve been working on Cybersecurity AI (CAI) — the leading open-source project for building autonomous agents for security:
- CAI GitHub: https://github.com/aliasrobotics/cai
- Tech report: https://arxiv.org/pdf/2504.06017
I’m all for pushing boundaries, but let’s keep the messaging grounded in reality. The future of AI in security is exciting — and we’re just getting started.

by Sytten

0 subcomment

Since I am the cofounder of a mostly manual based testing in that space we do follow the new AI hackbots closely. There is a lot of money being raised (Horizon3 at 100M, Xbow at 87M, Mindfort will probably soon raise).
The future is definitely a combination of human and bots like anything else, it won't replace the humans just like coding bots won't replace devs. In fact this will allow humans to focus ob the fun/creative hacking instead of the basic/boring tests.
What I am worried about is on the triage/reproduction side, right now it is still mostly manual and it is a hard problem to automate.

by mellosouls

1 subcomments

Have XBow provided a link to this claim, I could only find:
https://hackerone.com/xbow?type=user
Which shows a different picture. This may not invalidate their claim (best US), but a screenshot can be a bit cherry-picked.

by martinald

1 subcomments

This does not surprise me. In a couple of 'legacy' open source projects I found DoS attacks within 10 minutes, with a working PoC. It crashed the server entirely. I suspect with more prompting it could have found RCE but it was an idle shower thought to try.
While niche and not widely used; there are at least thousands of publicly available servers for each of these projects.
I genuinely think this is one of the biggest near term issues with AI. Even if we get great AI "defence" tooling, there are just so many servers and (IoT or otherwise) devices out there, most of which is not trivial to patch. While a few niche services getting pwned isn't probably a big deal, a million niche services all getting pwned in quick succession is likely to cause huge disruption. There is so much code out there that hasn't been remotely security checked.
Maybe the end solution is some sort of LLM based "WAF" that inspects all traffic that ISPs deploy.

by keisborg

1 subcomments

«XBOW submitted nearly 1,060 vulnerabilities. All findings were fully automated, though our security team reviewed them pre-submission to comply with HackerOne’s policy on automated tools»
That seems a bit unethical. I’ve thought companies specifically deny usage of automated tools. A bit too late ey…?

by imglorp

0 subcomment

And so we've arrived at William Gibson's black ice and ice-breaker (Russian military) systems.
https://en.wikipedia.org/wiki/Burning_Chrome

by mkagenius

2 subcomments

> XBOW submitted nearly 1,060 vulnerabilities.
Yikes, explains why my manually submitted single vulnerability is taking weeks to triage.

by spacecadet

0 subcomment

Ive also invested some time in this space over the last several years. A group Im in took the approach of custom building agents for each CTF we approached. Our best so far was an agent participating in an AI CTF against current top injection/jailbreak and leakage defense techniques, the agent autonomously completed 22 of the 40 challenges and at one point held 8th place out of 380 teams. It eventually plateaued and slipped to 12th by the end.
The tooling and models are maturing quickly and there is definitely some value in autonomous security agents, both offensive and defensive- but also still requires alot of work, knowledge(my group is all ML people), skill, planning- if you want to approach anything more than bug bashing.
This recent paper from Dreadnode discusses a benchmark for this sort of challenge: https://arxiv.org/abs/2506.14682

by TZubiri

0 subcomment

This would be impressive even as a human assisted project.
But there's a claim that it is unsupervised, which I doubt. See how these two claims contradict each other.
>"XBOW is a fully autonomous AI-driven penetration tester. It requires no human input, "
>"To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks."
I mean, I doubt you deploy this thing collecting thousands of dollars in bounties and you sit there twiddling your thumbs. Whatever work you put into the AI, whether fine tuned or generic and reusable, counts as supervised, and that's ok. Take the win, don't try to sell the automated dream to get investors or whatever, don't get caught up in fraud.
As I understand it, when you discover a type of vulnerabilities, it's very common to automate the detection and find other clients with such vulnerability, these are usually short lived and the well dries up fast, you need to constantly stay on top of the latest trends. I just don't buy that if you leave this thing unattended for even 3 months it would keep finding gold, that's a property of the engineers that is not scaleable (and that's ok).

by jp0001

1 subcomments

I want to know how much they made in bounties versus how much they spent on compute.
The thing about bug bounties, the only way to win is to not play the game.

by chc4

0 subcomment

I'm generally pretty bearish on AI security research, and think most people don't know anything about what they're talking about, but XBOW is frankly one of the few legitimately interesting and competent companies in the space, and their writeups and reports have good and well thought out results. Congrats!

by ryandrake

7 subcomments

Receiving hundreds of AI generated bug reports would be so demoralizing and probably turn me off from maintaining an open source project forever. I think developers are going to eventually need tools to filter out slop. If you didn’t take the time to write it, why should I take the time to read it?

by skeptrune

1 subcomments

I'm confused on whether or not this actually outperformed humans. The more interesting statistic would be how much money it made versus the average hacker one top ranked contributor.

by moktonar

0 subcomment

While impressive, a lot of manual human work was involved both to filter the input and the output, this is not a “fully” automated workflow, sorry. But, yeah, kudos to them.

0 subcomment

by billy99k

0 subcomment

While I think this is great progress, It will be a hard sell for many business owners. I've been in this space (infosec) for awhile now and customers can be very apprehensive about non-AI/humans looking for vulnerabilities on their networks/systems.

0 subcomment

by andrewstuart

1 subcomments

All the fun vanishes.

by ikmckenz

1 subcomments

Related: https://arstechnica.com/gadgets/2025/05/open-source-project-...

by nottorp

1 subcomments

[flagged]

by jekwoooooe

2 subcomments

They should ban this or else they will get swallowed up and companies will stop working with them. The last thing I want is a bunch of llm slop sent to me faster than a human would

by bgwalter

1 subcomments

"XBOW is an enterprise solution. If your company would like a demo, email us at info@xbow.com."
Like any "AI" article, this is an ad.
If you are willing to tolerate a high false positive rate, you can as well use Rational Purify or various analyzers.

by wslh

0 subcomment

I'm looking forward to the LLM's ELI5 explanation. If I understand correctly, XBOW is genuinely moving the needle and pushing the state of the art.
Another great reading is [1](2024).
[1] "LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC": https://news.ycombinator.com/item?id=41269791