FRESH

Hacker News

Home

MTG Bench: Testing how well LLMs can play Magic

63 points by CallumFerg

by derac

0 subcomment

I think running them against each other with a rules engine would be more interesting. Count up illegal moves and wins/unfinished games. I think llm grading is too unreliable.

by josh_p

3 subcomments

I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.
I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.
https://github.com/Card-Forge/forge

by devilfileprong

0 subcomment

A really interesting benchmark where the llms play multiplayer decks against each other using xMage as a rules engine,in this case, a $HORIZON token to the moon(Sideways). 1. Sideways walking (100M Horizontal) 2. Sideways Pinching (Crab division only) 3. Sideways Bleating (Goat division) 4. Sideways Rattling (Skeleton division) 5. Sideways Hay Toss (Mixed division) 6. Sideways Swimming (Tide pool division) 7. Sideways Knitting (GrandMittens Invitational) 8. Sideways Stay (Meditation division)
OLYMPICS RECORDS. 1.14.2 Seconds,Holder: Pinchy 2.(60s)120 pinches,Holder:Pinchy 3.(db)110db,Holder: EIDOLONX 4.(rhythm)9.7/10,Holder:Skeletorus 5.12.3m,Holder:Satochi Goat 6.(50m)32.1 sec,Holder: Pinchy 7.(1hr)100m,Holder: GrandMittens 8.(6hours),Holder: Satochi Goat Economic boost: $CRAB up 0.0001% (Sideways as Always.) Providing them with medal count will improve their win rate against the baseline $HORIZON.

by OsrsNeedsf2P

0 subcomment

I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)
[0] https://maxbittker.github.io/runebench/

by jdmoreira

0 subcomment

I have a version of this where I have the llms play the duel decks "Elves vs Goblin" against each other using xMage as a rules engine.
Unfortunetly it gets really expensive to run even with some optimizations for the context.
I can only afford to play them with the deepseek models. They make serious blunts sometimes. This is not an easy "harness" to build and I dont have the time or disposal cash to work on it. I think a lot of work could be done on improving it still and testing better models.
It would make an amazing "arena" bench. There is plenty of more duel decks well balanced against each other.

by OwenCR

2 subcomments

Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!
I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.
This project is cool though, props for making it!

by lavaman131

1 subcomments

This is a really interesting benchmark and also timely given a lot of existing benchmarks don't do a good job. The mechanics and edge cases seem notoriously difficult to parse also even for perhaps human players. Have you been also plugging these into newer reasoning models to see how providing them with thinking time improves their win rate against the baseline?

by alasdair_

0 subcomment

I wrote a rules engine in rust along with a reinforcement learning with MCTS based system to play decks against each other. It can handle aggro decks well enough but complex combo decks like Amulet Titan are tough to get working without expert demos or reward hacking.

by dash2

1 subcomments

You don't explain how scoring works, maybe it's obvious to MTG players? If you're using gpt 5.5, is there a possibility that it is biased in favour of models that think the way it does?

by jmccaf

1 subcomments

Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .

by purple-leafy

0 subcomment

Benchmarks like this are onto something. Next frontier of llm benchmarking

by thurn

0 subcomment

To clarify, the more accurate description would be "Testing how well LLMs can follow the rules of Magic", right? There is no actual evaluation of how "well" they are playing?

by danbrooks

0 subcomment

Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.

by pilord314

1 subcomments

They should randomize games of judge tower and see who wins:
https://mtg.fandom.com/wiki/Judge_Tower

by TZubiri

1 subcomments

Looking forward to this metric being Goodhart lawed.
Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.