FRESH

Hacker News

Home

Lambda Calculus Benchmark for AI

142 points by marvinborner

by NitpickLawyer

7 subcomments

New, unbenched problems are really the only way to differentiate the models, and every time I see one it's along the same lines. Models from top labs are neck and neck, and the rest of the bunch are nowhere near. Should kinda calm down the "opus killer" marketing that we've seen these past few months, every time a new model releases, esp the small ones from china.
It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"...
I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task.
[1] - https://x.com/julien_c/status/2047647522173104145

by tromp

1 subcomments

The corresponding repo https://github.com/VictorTaelin/LamBench describes this as:

    λ-bench
    A benchmark of 120 pure lambda calculus programming problems for AI models.
    → Live results
    What is this?
    λ-bench evaluates how well AI models can implement algorithms using pure lambda calculus. Each problem asks the model to write a program in Lamb, a minimal lambda calculus language, using λ-encodings of data structures to implement a specific algorithm.
    The model receives a problem description, data encoding specification, and test cases. It must return a single .lam program that defines @main. The program is then tested against all input/output pairs — if every test passes, the problem is solved.

"Live results" wrongly links to https://victortaelin.github.io/LamBench/ rather than the correct https://victortaelin.github.io/lambench/

An example task (writing a lambda calculus evaluator) can be seen at https://github.com/VictorTaelin/lambench/blob/main/tsk/algo_...

Curiously, gpt-5.5 is noticeably worse than gpt-5.4, and opus-4.7 is slightly worse than opus-4.6.

by dataviz1000

2 subcomments

lambench is single-attempt one shot per problem.
I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.
The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.
That said, using lambda calculus is a brilliant subject for benchmarking.
The models are reliably incorrect. [0]
[0] https://adamsohn.com/reliably-incorrect/

by internet_points

0 subcomment

Would love to see where the mistral stuff lands.
Also, being from Victor Taelin, shouldn't this be benching Interaction Combinators? :)

by maciejzj

1 subcomments

Can anyone more familiar with lambda calculus speculate why all models fail to implement fft? There are gazzilion fft implementations in various languages over the web and the actual cooley-tukey algorithm is rather short.

by flashdesk

0 subcomment

I like this kind of benchmark, especially since it uses problems that are harder to overfit to.
That said, single-attempt results are a bit hard to read into. For anything code-like, things like retries, test feedback, or just letting the model iterate tend to change the outcome quite a bit.

0 subcomment

by cmrdporcupine

1 subcomments

Odd to see GPT 5.5 behind 5.4?

by jakeinsdca

0 subcomment

codex 5.5 is worse then 5.4 but 10x faster?

by the_data_nerd

1 subcomments

[flagged]

by vijgaurav

0 subcomment

[dead]