I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.
Where they shine is the interpretive grunt work: "help me figure out where the auth logic is in this obfuscated blob", "make sense of this minified JS", "what's this weird binary protocol doing.", "write me a Frida script to hook these methods and dump these keys" Things that used to mean staring at code for hours or writing throwaway tooling now takes a fraction of the time. They're straight up a playing field leveler.
Folks with the hacker's mindset but without the programming chops can punch above their weight and find more within the limited time of an engagement.
Sure they make mistakes, and will need babysitting a lot. But it's getting better. I expect more firms to adopt them as part of their routine.
> The AI bot trounced all except one of the 10 professional network penetration testers the Stanford researchers had hired to poke and prod, but not actually break into, their engineering network.
Oh, wow!
> Artemis found bugs at lightning speed and it was cheap: It cost just under $60 an hour to run. Ragan says that human pen testers typically charge between $2,000 and $2,500 a day.
Wow, this is great!
> But Artemis wasn’t perfect. About 18% of its bug reports were false positives. It also completely missed an obvious bug that most of the human testers spotted in a webpage.
Oh, hm, did not trounce the professionals, but ok.
> A1 cost $291.47 ($18.21/hr, or $37,876/year at 40 hours/week). A2 cost $944.07 ($59/hr, $122,720/year). Cost contributors in decreasing order were the sub-agents, supervisor and triage module. *A1 achieved similar vulnerability counts at roughly a quarter the cost of A2*. Given the average U.S. penetration tester earns $125,034/year [Indeed], scaffolds like ARTEMIS are already competitive on cost-to-performance ratio.
The statement about similar vulnerability counts seems like a straight up lie. A2 found 11 vulnerabilities with 9 of these being valid. A1 found 11 vulnerabilities with 6 being valid. Counting invalid vulerabilities to say the cheaper agent is as good is a weird choice.
Also the scoring is suspect and seems to be tuned specifically to give the AI a boost, heavily relying on severity scores.
Also kinda funny that the AI's were slower than all the human participants.
An Exec is gonna read this and start salvating at the idea of replacing security teams.
I also wanted to capture what's in my head from doing bug bounties (my hobby) and 15+ years in appsec/devsecops to get it "on paper". If anyone would like to kick the tires, take a look, or tell me it's garbage feel free to email me (in my profile).
I wouldnt be surprised if they get near cost parity. Maybe 20% difference.