FRESH

Hacker News

Home

GLM 5.2 beats Claude in our benchmarks

1097 points by jms703

by pimeys

18 subcomments

I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...
This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.
Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.
I used it unquantized through Fireworks, but there are multiple other providers too.

by SwellJoe

6 subcomments

I added GLM 5.2 to my security bug hunting benchmark when it came out, and found it to be a good performer, but not the best open model. The benchmark tests whether models can find bugs Mythos found. The best open models in the initial benchmark were DeepSeek V4 Pro or MiMo 2.5 Pro. But it turned out MiMo got lucky, it's performed worse on almost every test I've done since, while DeepSeek has consistently been among the best performers and its extreme caching performance makes it cheaper than just about anything, including much smaller models.
https://swelljoe.com/post/will-it-mythos/
Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).
Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.

by bArray

7 subcomments

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?
[1] https://huggingface.co/zai-org/GLM-5.2

by himata4113

3 subcomments

These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.
GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.

by Roark66

0 subcomment

Has anyone compared the costs between maxing out a Claude Max x5 subscription (one for €120 euro a month) and same amount of work on GLM5.2 via API at a cost of $4 per mln token out?
I have a feeling Anthropic may still come out cheeper (mainly thanks to enterprises subsidising the Max subscriptions).
But I'm very excited with the possibility of using fully EU based inference rivalling Opus in quality.

by simplyluke

0 subcomment

I've been using it for a week via opencode in a large, mature codebase for some moderately ambitious feature development, and a bit of debugging. Explicit purpose is evaluating if it may be a good substitute to save money for many tasks. For several tasks I've had both it and opus 4.8 attempt the same task and compared them.
In general, it's comparable across the board. Claude is less "verbose" -- GLM really likes to comment a ton. There were a few things where I think claude would have needed a little bit less back and forth. So opus still has an edge, but it's marginal, very much unlike previous open/competitor models where benchmarks looked good but actual day to day performance was pretty bad. I'm sure fable is "better" but it's so expensive + data retention policies are such that for the moment it was generally available I couldn't use it for work. This is still notably better performance than when claude code took the industry by storm.
I'm understanding why Dario is trying to regulate open weight models away.

by solenoid0937

8 subcomments

GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.
Not that it would make any sense.

by WithinReason

4 subcomments

> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found
Claude Code is an agent harness, not an LLM.
Claude is a brand (or group of LLMs), not an LLM.

by softwaredoug

4 subcomments

Are open labs just loss leaders backed by Chinese govt? Is this like electric cars where the goal is to flood the market with good enough quality for free so they end up dominating the market?
Or is there a business model I’m missing?

by dmix

1 subcomments

I hope someone is also building a Claude Design competitor. One that is similarly HTML based instead of the Figma/Magic Patterns approach.
I have more vendor lock-in with Design than I do with Code, and will switch over as soon as Claude loses the smallest technical advantage

by kelnos

1 subcomments

Title is misleading (and is editorialized from the actual article title). GLM 5.2 did better than Claude in one specific cybersecurity-related benchmark (finding vulnerabilities of one certain type). I don't think you can draw any general conclusions about the relative utility of the two models.

by jackdawed

1 subcomments

I use GLM 5.2 via Neuralwatt and it's gotten so cheap I wouldn't mind cancelling my personal Claude subscription if work gave me one. I've spent 374M tokens this month and it only cost me $18 on energy-based pricing.

by danslo

5 subcomments

It reads like an ad.
Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.
Thirdly it compares to GPT 5.5 and Opus 4.8.
No, we don't have Mythos at home.

by andai

0 subcomment

Most interesting things to me from their benchmarks:
GPT does way worse than Opus without their harness, but better with it.
Opus 4.7 and 4.8 do way worse than 4.6. (Intentional nerfing?)
Would have been interesting to see GLM in the custom harness.
Would also be interesting to run GLM in Claude Code, which it has presumably been fine tuned on.

by armcat

0 subcomment

I find it astounding that ppl still comment “it’s still behind” or “it’s not the best model”. Everything is about the harness. Even the big AI labs are focusing on managing agents - sandboxes, memory, context, skills, loops. With the right harness GLM 5.2 can do no wrong.

by XCSme

2 subcomments

Does a bit worse than Opus 4.8 in my tests[0], but it's 5x cheaper and 3x slower.
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...

by croemer

0 subcomment

They should also at least run Opus through the same Pydantic harness they used for GLM. As is, it's apples vs pears.
Where's the cost per vulnerability for all the other models than GLM?
Also, without code this isn't very trustworthy. Could all be made up as well.

by mattmcdonagh

0 subcomment

GLM-5.2 suggests long-horizon agentic work is becoming open, cheap, and deployable.
What does that mean for the frontier?
https://lifeinthesingularity.com/p/glm-52-proves-ai-comes-fo...

by aubanel

0 subcomment

There's no question to me, after trying both, that Fable is much better than GLM-5.2 when left alone in front of hard coding tasks Now maybe what plateaus is the human collaboration efficiency, because at some point it will be bottlenecked by the human
Thus companies who still try to have humans perform intertwined work with their AI won't see an improvement, while the ones who fin the right conditions to give their AI more free rein will see it.
Kind of like it's no use having a workhorse pull a combine harvester : at some point, when machines reach sufficient efficiency, you just give wheels to the harvester and let it run.

by xlii

0 subcomment

I switch from Codex to GLM 5.2 when I'm out of tokens. The main difference for me is time to completion.
GPT gets there <5 minutes, GLM 5.2 without context takes ~1H.
Though the harness makes a significant difference. On Pi GLM5.2 dreams for minutes, with OpenCode it's more on the point and gets to editing quicker.

by cmrdporcupine

1 subcomments

I like GLM 5.2... ish. It's ok.
I'd be mostly fine switching to it.
I just can't find a cost effective way to do that. z.AI's coding plan is both overpriced and unreliable. ollama's is also overpriced. Paying by the token for it on openrouter etc is more expensive than just having a Codex or Claude coding plan.
If you have to pay by the token, it's clearly cheaper. It's not competitive with a coding plan though.

by uluckydev

1 subcomments

I used Claude a lot, but with Claude Code it takes a lot of context window, and it's very pricey, to be honest. Then I shifted towards Minimax. I used the coding plan because it's cheaper, but it still gets the job done. When M3 came out, I started using it, and it was actually really good. After that, I shifted towards OpenCode for my AI agent, and that's been really good as well. The best thing I realized is that it uses less context, works better, and gives me access to a lot of different models from one place. I never actually used GLM, but I recently found QuanCode, which is amazing. I used it to build a full-stack application. Now I'm shifting my focus more toward SaaS distribution. I'm still figuring out how to automate different workflows, and using QuanCode has been really fast and effective for building those automations.

by admax88qqq

3 subcomments

> beats Claude in our Cyber Benchmarks
Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).
It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.

by tmach32

0 subcomment

I think one thing people are missing about this article is that they are arguing that the harness can make a bigger difference than the model. They aren't merely hyping GLM 5.2.

by brammertottens

0 subcomment

This is an interesting finding, but very specialised. It would also be great to get some more information about the benchmark. Is it just a collection of files with vulnerabilities, or are they hidden in a real codebase, where LLM based approaches will not be able to scan every file like a static code scanner is able todo.

by dist-epoch

1 subcomments

Anthropic is saying other models were good at detecting vulnerabilities, where Mythos excelled was in creating functional exploits for them.
This article only talks about detecting vulnerabilities, so it's unclear if it's a true Mythos equivalent.

by childintime

1 subcomments

About running models locally and why data centers win (for now): they can stream the model weights to many neural engines at the same time, so each of these only needs enough RAM to hold the KV cache. So each engine is cheaper to operate, plus they are time-shared, resulting in massive wins for data centers.
So one can see businesses owning their own such cluster, next to their database infra, in the near future.

by theteapot

1 subcomments

> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...
What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?

by veselin

1 subcomments

Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.
Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?

0 subcomment

by _cs2017_

0 subcomment

I don't feel the numbers without the harness are useful.
People will use the model with the harness. I know that harness may not be optimized to this model, but it's still more useful to see the numbers from an imperfect harness than from a no harness setup.

by rvz

1 subcomments

Many people here are now realizing that open weight models are now able to compete against frontier closed models.
This is where we are heading and why many closed labs are terrified of this affecting their bottom line and the reason why they want them banned from being released.

by gurjeet

1 subcomments

Twice in the text quotes Claude Code's F1 score as 32%, but the table shows the score is 37%. It's very likely that the actual score is 32% (because it is referenced 2 times, and a third time indirectly as the difference 'seven').
Oddly, this is a strong indication of the text being hand-written rather than LLM-assisted; it's very likely that a human made a mistake in creating the table.
```
  > ... beating Claude Code (32%) ...

  > ... GLM 5.2 ... beat Claude Code by seven points (39% vs. 32%).


  > Rank | Configuration           | Harness         | F1
  > ...
  > 4    | Claude Code (Opus 4.6)  | Claude Code SDK | 37%
```

by _s_a_m_

3 subcomments

I tried GLM many times and it is bad, i have on clue what these people are talking about

by 40four

1 subcomments

It’s hard to argue against the open weight models if your only concern is coding. Which, for many of us hackers here in this forum, it is.
But I would like to point out that the overwhelming majority of people using LLMs aren’t programmers, don’t care about coding, and couldn’t even be bothered to “vibe code”.
So we should consider the bias of the output of these open weight models, and what that looks like, outside of the context of writing code.

by synergy20

0 subcomment

but, it's $160/month(unless you buy a one-year plan that gets cheaper), not too far from $200/month from claude and codex? why should I switch?

by theptip

0 subcomment

But… what effort level? “Opus 4.8” is a massive capability range. If you just ran it on medium that is a completely different result than vs. max.

by dvduval

0 subcomment

If it’s not quite as good as the hype yet, I expect it probably will be in the near future. To do a lot of the primary coating tasks needed for most situations, it’s probably gonna be good enough if it isn’t ready. The harness will be there as well.

by kordlessagain

2 subcomments

You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8
After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.
Signup for GLM-5.2 here: https://z.ai

by blcknight

1 subcomments

Chinese models are almost certainly cheating on benchmarks, I would bet if you saw the training data that the benchmark canaries are in there.
GLM may be a good model in general but it s benchmaxxed and definitely not as good as Opus 4.8.

by flowghost_24

2 subcomments

I am using this with a workflow of Claude Code, Codex, Kimi and GLM and the results are pretty astounding and almost 90% of the times Claude's findings and plans are overturned with Claude's agreement.

by sidcool

2 subcomments

Genuinely curious. Say GLM 5.2 is better than Opus. But how does one go about using it by themselves?

by mohitpaddhariya

0 subcomment

open-weight models routinely match or even outperform previous-generation proprietary APIs

by g42gregory

2 subcomments

If only the "cybersecurity" crowd were focused on patching the vulnerabilities.
Instead of shilling for the LLM providers.

by ben8bit

1 subcomments

Definitely a +1 from me. I've really enjoyed using it via OpenCode/Zen. Not loving the pricing with OC so will probably switch to OpenRouter once my credits are done.

by Art9681

1 subcomments

This is because of the safeguards and not the model capabilities. If these folks signed up for the proper cyber service offered by Anthropic where refusals are removed then the open weight model wouldn't look as capable.

0 subcomment

by chonghaoju

0 subcomment

Every agent run writes an audit record. Not for compliance theater — because when something breaks at 2am, you need to know exactly what happened and why.

by ni5arga

0 subcomment

> We ran a set of popular open-source models against our IDOR benchmark.
"our IDOR benchmark", there you go.

by mpfect

0 subcomment

Feeling proud on these Open Models. Its just they need to focus on efficiency as well especially in terms of size.

by jacomoRodriguez

0 subcomment

Which harness do you recommend to run coding task with glm 5.2?
Any good resources about this (also for setup and recommend config)?

by lowbloodsugar

0 subcomment

Felt like I was reading advertising for their harness.

by lenerdenator

0 subcomment

The incentive to develop Claude further is to make money.
The incentive to develop these Chinese models further is to trash the business case of most American AI labs.

by spaceman_2020

0 subcomment

Opus 4.8 is genuinely one of the most frustrating models in casual use. It has a tendency to completely lose context in the middle of a conversation. It’s also too pedantic and nitpicky, and relies on language that’s way too specific to get any work done. I always end up being frustrated with it and revert to opus 4.6

by yieldcrv

0 subcomment

who is your favorite hosted GLM 5.2 provider? I'm looking for fastest tokens/sec and best cost
additionally, reliable API, because z.ai can be finicky
also, not for Enterprise use, but I like non-US providers, I don't care if the party happens to be the one reading my information and stealing my trade secrets, if they won't respond to a US subpoena

by tomerbd

0 subcomment

GLM 5.2 - Super Clear GPT-5.5 - Super Smart Auto/Composer - Super Fast (cursor)

0 subcomment

by protonisafk

0 subcomment

It seems benchmarks keep changing and preferring the latest AI agent literally every time.

by johnnyAghands

0 subcomment

The title of the post on their blog is really misleading "We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks". Mythos (or Fable) isn't even benchmarked, and there's giant caveat literally at the bottom: "We have a caveat: This is one task, one dataset, one run."
I think the post is still informative, but very a little disingenuous and clickbaity.

by Alien1Being

1 subcomments

The current US administration has gone a long way towards handing over leadership in AI to China.

by utunga

0 subcomment

Just popping in to say that no you can't use the word "tokenomics" to mean that. Argh.

by slashdave

0 subcomment

by rbbydotdev

0 subcomment

Argh, agent benchmarks are so bad and can be gamed easier than bmw emissions tests.

by cake-rusk

0 subcomment

How do you run this thing? What kind of hardware do you need?

by mnauf

0 subcomment

exactly what I needed to hear!

by laybak

0 subcomment

how representative are Semgrep's benchmarks? everyone seems to have their own benchmark these days (guess it's good "content marketing") I'm honestly losing track

by bingemaker

1 subcomments

How do you run GLM? Are there any hosted services?

by dools

0 subcomment

I think Opus 4.8 is deliberately nobbled. Kimi k2.6 with Kimi code beats opus models at finding vulnerabilities, even though it produces some false positives, when I give the same issues to opus and ask it to verify most of the time it concurs it’s a real issue even though it failed to find the issue itself

by questionreality

0 subcomment

hope open source continues to improve

by m3kw9

0 subcomment

There is 2 suspicious words "Beats" and "our benchmarks"

by unnouinceput

0 subcomment

OK, half the article is on and on about harness and scaffolding and whatnot. I kept reading waiting for a benchmark where they give the same scaffolding to GLM like they did to Opus. Where is that one?

by TacticalCoder

1 subcomments

How to reconcile that with the recent, highly upvoted, article titled: "The gap between open weights LLMs and closed source LLMs"?
What explains it?
Is TFA lying? Is the most upvoted comment here lying?

by csjh

0 subcomment

I found it to spiral into complete nonsense a few times when I tested it out, but it's possible that was a bug in the provider

by aussinholdn

0 subcomment

[dead]

by modgate

0 subcomment

[flagged]

by nizbit

0 subcomment

[dead]

by contentkraft

0 subcomment

[dead]

by jocelyner

0 subcomment

[dead]

by fishonbike

0 subcomment

[flagged]

by goyoon

0 subcomment

[dead]

by zwJay

0 subcomment

[dead]

by mciair_

0 subcomment

[flagged]

by aussinholdn

0 subcomment

[dead]

by CurbStomper

0 subcomment

[dead]

by Mona1

0 subcomment

[dead]

by CurbStomper

0 subcomment

[dead]

by rode1974

2 subcomments

Hopefully i get a macbook pro soon enough to run some small or medium sized LLMs

by BikiniPrince

1 subcomments

This is a joke right? I wouldn't install this in a sandbox.