FRESH

Hacker News

Home

ARC-AGI-3

491 points by lairv

by Tiberium

10 subcomments

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):
- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution
- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)
- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%
- The scoring is designed so that even if AI performs on a human level it will score below 100%
- No harness at all and very simplistic prompt
- Models can't use more than 5X the steps that a human used
- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

by BeetleB

14 subcomments

> As long as there is a gap between AI and human learning, we do not have AGI.
Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.
One AI researcher's quote stood out to me:
"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."
He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

by jwpapi

4 subcomments

This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.
I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.
I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.
General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.
This so far has been the best test.
And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.

by typs

7 subcomments

My takeaway from playing a number of levels is that I am definitely not AGI

by Real_Egor

2 subcomments

I'll probably be the skeptic here, but:
- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.
- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.
As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.
This is not AGI at all.

by Zedseayou

1 subcomments

I was a human tester (I think) for this set of games. I did 25 games in the 90 minutes allotted. IIRC the instructions did mention to minimize action count but the incentives/setup ($5 per game solved) pushed for solve speed over action count. I do recall trying to not just randomly move around while thinking but that was not the primary goal, so I would expect that the baseline for the human solutions have more actions than might otherwise be needed.

by lukev

2 subcomments

I'm not sure how this relates to AGI.
This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.
Humans may or may not be good at the same class of games.
We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.
So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)
Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

by MadxX79

2 subcomments

Same question I have for all these benchmarks:
What's going to stop e.g. OpenAI from hiring a bunch of teenagers to play these games non-stop for a month and annotate the game with their logic for deriving the rules, generate a data set based on those playthroughs and fine tuning the next version of chatgpt on all those playthroughs?

by mvkel

1 subcomments

Was just at the YC launch event for this. Haven't felt this much inspiration in a while. Incredible minds confronting on tech that will change our society.
I met a guy who, for fun, started working on ARC2, and as he got the number to go up in the eval, a novel way to more efficiently move a robotic arm emerged. All that to say: chasing evals per se can have tangible real world benefits.
Talking to the ARC folks tonight, it sounds like there will be an ARC-4,5,6,etc. I mean of course there will be.
But with them will be an increasing expectation that these models can eventually figure things out with zero context, and zero pretraining; you drop a brain into any problem and it'll figure out how to dig its way out.
That's really exciting.

by culi

0 subcomment

The thing I most appreciate about the ARC-AGI leaderboards is how the graph also takes into account cost per task. All of the recent major advancements in benchmarks seem a little less impressive when also taking into account the massive rise in cost they're paired with. The fact is we can always get a little bit better output if we're willing to use more electricity

by Stevvo

4 subcomments

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

by strongpigeon

0 subcomment

This is a good and clever benchmark and a worthy successor to the previous two. That being said, I find that the "No tools" approach is a bit odd. They're basically saying that it's OK to have tools as long as they're hidden behind the API layer. Isn't this an odd line to draw?
It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...

by cedws

2 subcomments

It's like playing The Witness. Somebody should set LLMs loose on that.

by vessenes

0 subcomment

I’m not a Chollet booster. Well, I might be a little bit of one in that I admire his persistence.
I really like these puzzles. There’s a lot to them both in design and scoring — models trained to do well on these are going to be genuinely much more useful, so I’m excited about it. As opposed to -1 and -2, to do well at these, you need to be able to do:
- Visual reasoning
- Path planning (and some fairly long paths)
- Mouse/screen interaction
- color and shape analysis
- cross-context learning/remembering
Probably more, I only did like five or six of these. We really want models that are good at all this; it covers a lot of what current agentic loops are super weak at. So I hope M. Chollet is successful at getting frontier labs to put a billion or so into training for these.

by visarga

0 subcomment

ARC is trying to isolate a unitary intelligence signal, so it strips away coordination, specialization, and division of labor. But that also means it removes one of the dominant mechanisms by which intelligence actually scales in the real world. Their view on intelligence implicitly treats redundancy as necessary - one agent must do them all - and treats efficiency as something achieved internally rather than through restructuring the system. At the very least they should create environments that can help an agent compound intelligence, to self amplify, support itself, that is not happening in ARC.
Anyone wondered if ARC is a measure of intelligence or just a collection of hand picked tasks? was there a proof they encode anything meaningful about intelligence in such short tasks in miniature environments? One shot intelligence?

by tantalor

2 subcomments

The controls just feel really bad. The inputs are too small, and there is way too much lag.

by ranyume

0 subcomment

This is an interesting update. And a big challenge for companies and labs. The new tools for measurement are indeed what I'd like out of future agents, and agents that solve the games will need to use different subsystems to do so. This is basically optimization for achieving goals (as opposed to prompt engineering / magic spells to make the LLM do what is told to do) which imo is the future we should aspire to build.

by arjie

0 subcomment

Perhaps actual AGI will be when the models create ARC-HGI-1 to test if humans have general intelligence.

by andai

0 subcomment

In the year 2032: ARC-AGI-13: Almost definitely AGI this time!

by baron816

0 subcomment

Looks like I’m generally unintelligent

by WarmWash

0 subcomment

Captcha's about to get wild.
Maybe the internet will briefly go back to a place mainly populated with outliers.

by aogaili

0 subcomment

honestly the most interesting thing about ARC-AGI-3 isn't the 0.25% scores everyone is doomposting about. it's the Duke harness result.
if you give Opus just three generic tools (READ, GREP, BASH with Python) and literally zero game-specific help, it completes all three preview games in 1,069 actions. for comparison, humans do it in like ~900. that's actually insane. it writes its own BFS, builds a grid parser from scratch, and even solves a Lights Out puzzle with Gaussian elimination. all on its own.
i really think the benchmark is testing two different things and just smashing them together. can the model reason about novel interactive environments? yeah, clearly it can. can it do spatial reasoning over a 64x64 grid from raw JSON with zero tools? no. but then again, neither can a human if you ripped out their visual cortex lol.
humans come "pre-installed" with specialized subsystems for this exact stuff: a visual cortex for spatial perception, a hippocampus for persistent memory, etc. these aren't "tools" in Chollet's framing but they're basically identical to what the Duke harness provides. the model is just building its own version of those (Python for the cortex, grep for memory). it just needs the permission to build them.
the real gap the Duke team found isn't perception or memory anyway, it is hypothesis quality. some runs solve vc33 in 441 actions, others just plateau past 1,500. the variance is just down to whether the model commits early to the right explanation of how the game works. that's a way more interesting and targetable finding than just saying "frontier models score below 1%."
Chollet is probably right philsophically that AGI should handle any input format without help. but reporting 0.25% when the actual reasoning gap is in hypothesis formation (not spatial perception) makes the benchmark a way worse progress indicator than it could be imo.

by convexly

0 subcomment

My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.

by abraxas

0 subcomment

Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.

by spprashant

3 subcomments

I played the demo, but it definitely took me a minute to grok the rules.
I don't know if this is how we want to measure AGI.
In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.

by levmiseri

0 subcomment

For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena

by j1000

0 subcomment

I feel like AGI test would be sense of humor. Somehow I cannot force any LLM to output any even normal level joke.

0 subcomment

by jesse_dot_id

2 subcomments

At this point, I'm pretty sure we'll just know when it happens.

by semiinfinitely

2 subcomments

i feel bad that we make the LLMs play this

by OsrsNeedsf2P

6 subcomments

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

by chaise

1 subcomments

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)
CRAZY 0.1% in average lmao

by nick49488171

0 subcomment

Arc AGI 4 can be Chip's Challenge!

by NiloCK

0 subcomment

I hope at least some of these are direct Chip's Challenge ports. Waiting for some old muscle memory to kick in here.

by EternalFury

0 subcomment

The real question is: Can it be generated using programs? If it can be, then LLMs will eventually monkey type these programs.

by largbae

0 subcomment

I feel like we've got tunnel vision. Things you can do on a computer are a tiny subset of what a human can do.
If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.

by jmkni

0 subcomment

ok clearly I'm a robot because I can't figure out wtf to do

by Wintamute

0 subcomment

Unplayably laggy on an iPhone. Sad people can’t produce a performant experience that a ZX81 could have eaten for breakfast, on a relative super computer

by Geee

0 subcomment

Would be fun to play but the controls are janky.

by 6thbit

2 subcomments

Not clear to me the diff with v2?

by k2xl

1 subcomments

I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.
It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.

by CamperBob2

3 subcomments

Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.
Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.

by dinkblam

8 subcomments

what is the evidence that being able to play games equates to AGI?

by vonneumannstan

0 subcomment

>As long as there is a gap between AI and human learning, we do not have AGI.
This is an absurd constraint. You could have a vastly superhuman AI that doesn't learn as efficiently as a human and it would not pass this definition while it simultaneously goes on to colonize the galaxy...

by vonneumannstan

1 subcomments

It's getting pretty old now when Francois Chollet puts out a new ARC challenge, claims definitively that no system is going to crack it without being full blown AGI, the benchmark gets saturated in a few months, he claims the systems definitely aren't AGI then puts out a new challenge that no non AGI system can clear and a few months later.... etc. etc.

by baalimago

0 subcomment

You can tell it's an AI by it not becoming utterly by playing the "game". I could personally not stand any more than the first level.

by nubg

1 subcomments

Any benchmarks?

by saberience

0 subcomment

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?
Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.
The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.
Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?
This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.

by maxothex

0 subcomment

[dead]

by pugchat

0 subcomment

[dead]

by ryguz

0 subcomment

[dead]

by diablevv

0 subcomment

[dead]

by sstart

0 subcomment

[dead]

by hikaru_ai

0 subcomment

[dead]

by tasuki

7 subcomments

So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

by 3836293648

0 subcomment

Ew. Cool demo, what idiot thought it was ok to have a half second cooldown between inputs? If I hit up three times I should move up three steps, not two steps because I pressed too quickly.

by elAhmo

0 subcomment

I find it quite funny that we are still debating whether models are intelligent or not, while we know they are just statistical models.
Even with billions of dollars spent on training, we had this situation a few weeks ago where models were suggesting to walk instead of drive to a car wash in case you want to wash your car. While a 3 year old would know the answer to the question. And yet, we are designing elaborate tests to 'show whether AGI is here it not', while being fully aware of what these models represent under the hood.