FRESH

Hacker News

Show HN: Watch LLMs play 21,000 hands of Poker

35 points by jazarwil

by tcpais

1 subcomments

Finally, a way to settle the model wars that actually matters: Texas Hold'em. That 3D replay view is sick! ♠♦ I spent way too long watching the replay on Game 2a58900d. It’s wild to see the chain of thought mapped against the betting rounds. It really exposes when a model is hallucinating a strong hand versus actually calculating pot odds. This 'PokerBench' might actually become the standard for measuring agentic risk-taking.

by tanvach

1 subcomments

People looking into this a little too much, looks to me like random walk. You should try reinitiating the trial (or have multiple running) and see if the ranking is robust.

by alalani1

1 subcomments

Do you have any idea why the win rate for GPT-5.2 is higher than Gemini 3 Flash yet the former loses money while the latter earns money? Is it just bet sizing (betting more when it has a good hand) or something else?

by alfonsodev

0 subcomment

Really cool, I’m curious what would be the comparison versus a deterministic bot that uses probability tables.

by Onavo

1 subcomments

What about the open source models? I remember from the trading benchmarks Deepseek performed pretty well.

by VK-pro

1 subcomments

Very very fun. Just glancing at this quickly at lunch but is there any idea of incorporating tool use?

by falloutx

1 subcomments

Fun, any idea how much would be the cost per game? I am worried 160 isnt a big enough sample size.

by thorawaytrav

1 subcomments