It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"...
I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task.
λ-bench
A benchmark of 120 pure lambda calculus programming problems for AI models.
→ Live results
What is this?
λ-bench evaluates how well AI models can implement algorithms using pure lambda calculus. Each problem asks the model to write a program in Lamb, a minimal lambda calculus language, using λ-encodings of data structures to implement a specific algorithm.
The model receives a problem description, data encoding specification, and test cases. It must return a single .lam program that defines @main. The program is then tested against all input/output pairs — if every test passes, the problem is solved.
"Live results" wrongly links to https://victortaelin.github.io/LamBench/ rather than the correct https://victortaelin.github.io/lambench/An example task (writing a lambda calculus evaluator) can be seen at https://github.com/VictorTaelin/lambench/blob/main/tsk/algo_...
Curiously, gpt-5.5 is noticeably worse than gpt-5.4, and opus-4.7 is slightly worse than opus-4.6.
I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.
The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.
That said, using lambda calculus is a brilliant subject for benchmarking.
The models are reliably incorrect. [0]
Also, being from Victor Taelin, shouldn't this be benching Interaction Combinators? :)
That said, single-attempt results are a bit hard to read into. For anything code-like, things like retries, test feedback, or just letting the model iterate tend to change the outcome quite a bit.