What a dick move. Making that prompt open source will probably mean that every other model that doesn't want to cheat will scrape that and accidentally cheat in the next models.
(disclaimer: i worked on early versions of agentica_sdk; but wasn't involved in recent developments and the ARC solver)
As other comments point out this is about harness development and harness efficiency. Agentica SDK is a sort of meta harness, that makes things easy: plug any "internal API" (as defined natively in your codebase) directly into your agent. Agentica SDK itself is not application specifc; but the APIs of your application are... application specific.
Re: the linked prompt. A harness is a set of tools and descriptions how to best use those tools, and sometimes some external control flow based on the outcome of using those tools. How to "best use the tools" should always be part of the prompt (like in this case).
So this work tries to answer: "short of telling the agent any solutions, make a simple but efficient API to play the games, hand it to the agent, and see how it does". In the world of harness development I think that's an interesting question to answer!
>In the world of harness development I think that's an interesting question to answer!
The challenge isn't about harness development though, and a sufficiently complex harness can solve these tasks rather easily.
And presenting it as if you've made a novel development for solving ARC-AGI-3 leads me to believe you're willing to waste all of our time for your benefit at every step in the future.
> a sufficiently complex harness can solve these tasks rather easily.
I claim this is not so easily done, and earlier iterations of ARC-AGI did not have the constraint in the first place. You want something that generalizes across all puzzles (hopefully even the private ones), and these puzzles are extremely diverse ... and hard; telling the model the controls and some basic guidelines for the game is the only "obvious" thing you can do.
The other point of my reply was efficiency, both in terms of creating and using the harness; the discussed solution is something that anyone (in fact, likely even an LLM itself) can cook up in a few minutes; it's not much more than a game control wrapper so the agent can play around with the game in live python and some generalities as laid out in the prompt.
(But I'm always happy to be proven wrong. What harnesses did you have in mind?)
Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.
The hard part of these tests isn't purely reasoning ability ffs.
So if I say "I want to measure your capability as a mechanic" but then also "to ensure an accurate score you're forbidden to use any tools" how are you the human mechanic planning to diagnose and fix the engine problem without wrenches and jack stands and the like? It makes no sense.
That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.
No, that would be analogous to disallowing customized harnesses, ie tooling specially crafted by someone else for the specific task at hand. Insisting that an LLM solve something without the ability to make use of any external tooling whatsoever is almost perfectly analogous to insisting that a human mechanic work on a car with nothing but his own bare hands.
The wrench is to the mechanic as the stock python repl is to the LLM.
Rephrase that in terms of the human mechanic and hopefully you can see the error of that reasoning. LLMs that perform tasks (as opposed to merely holding conversations) use tools just like we do. That's literally how we design them to operate.
In fact the LLMs that everyone uses today typically have access to specialized task specific tooling. Obviously specialized tools aren't appropriate for a test that measures the ability to generalize but generic tools are par for the course. Writing a bot to play a game for you would certainly serve to demonstrate an understanding of the task.
I'm pretty sure the LLM can use tools while doing arc-agi-3 but it has to the same tools available all the time not an incredibly elaborate custom harness.
To quote someone else from upthread, tool use requires a harness. Without one an LLM as commonly understood is a bare model that receives inputs and directly produces outputs the same as talking to an unaided person.
I'd like to suggest that prior to expressing disagreement you really ought to reread the comment you're replying to and make sure your understanding is correct.
Quoting this for the second time now - tool use requires a harness.
Without a harness the LLM has no ability to interact with the world. It has no agency. It's just spitting out text (or whatever else) into the void. There's no programming tools, no filesystem, no shell, nothing.
The point of this test is to check if an AI system can figure out the game. This isn't what happened here. A human figured out the game, wrote in their prompts exactly how the game works and THEN put the AI on the problem. This is 100% cheating and imo quite stupid.
> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
I think generally people regard a harness as the system instructions + tools made available to the LLM (and probably the thing that runs the LLM conversation in a loop.) An agent is collectively, the LLM plus the harness.
I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.
I'm so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I'd need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I'd love to see more of what people do with them. My use cases are admittedly lame and boring, but it's such a fun paradigm to think and develop around.
I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.
Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
Ah, it's based on this repo [0] and there's only 1 non-example submission there [1], from 2 weeks ago (so it only covers the preview games), and their schema doesn't a field to show that it's only the preview, nor does the thing properly parse the score or cost into the table. And the biggest thing is that apparently there's no validation whatsoever - submissions are not ever run on the hidden test games, so is essentially useless as a comparison.
Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.
...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.
I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.
I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.
Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.
ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.
You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.
Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.
The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.
Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.
Lots of people doing the same with extra steps (generating synthetic data from test questions with the LLM then training on it)
I wish we'd move past public test sets for LLM benchmarks: publish a plain english explanation of the tasks, allow questions and clarifications, and but never release a single question from the test set verbatim.
It made sense back when models needed to be finetuned on the task to even reliably answer. If we're saying this is the path to AGI we should be able to rely on the generalization of the model to get it right.
If a model was trained on <|begin_text|> <|end_text|> and you change the tokens passed to <|start_text|> <|end_text|>, it loses several 'IQ points' if it can even answer back at all anymore.
Synthetic data is fine. Synthetic data on very similar questions generated based on the description is typically fine. But once the shape of what you're training on gets too close to the actual holdout questions, you're getting an uplift that's not realistic for unseen tasks.
Apparently the score would be a little higher if it weren't for the fact that scores are penalized for being worse than the human baseline, but aren't rewarded for being better than the human baseline (which seems like an arbitrary decision. The human baseline is not optimal).
Once you have matched humans on a problem then further progress on that problem is not necessarily meaningful anymore, in terms of quantitative measurement of intelligence. ARC-AGI-3 is designed to compare AIs to humans, not to measure arbitrarily high levels of superhuman intelligence. For that you would want a different benchmark.
we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.
That's unsurprising given that a lot of our own abilities as humans come from having painstakingly acquired practices and methodologies and tools (like pencil and paper, note taking, let alone algebra, formal methods and electromechanical aids). We call this "education" but it works in a way that is more similar to agentic harnesses than to pretraining or fine-tuning. This is reflected in the fundamental different way in which children and adults learn new skills
Scaffolding is all you need. I am absolutely certain about that.
It's abound finding good ways to approximate the reward function being used during post-training, but at inference time. A general enough reward that can score candidates well will inevitably improve the abilities of LLMs when put inside scaffolds.
anything that doesn't touch the model parameters at all once it has been compiled. for example, in streaming ASR of an encoder-decoder you can get gains in accuracy just by enhancing the encoder-decoder orchestration and ratio, frequency of fwd passes, dynamically adjusting the length of rolling windows (if using full attention). Prompting would be part of this too, including few-shot examples. Decoding strategy is also part of this (top-k, nucleus, speculative decoding, greedy or anything else). Applying signal processing or any kind of processing to the input before getting it into the model, or to the output. There are a lot of things you can do.
Also think about the program-synthesis approach proposed by Poetiq.ai.
python programs are being generated and evaluated against previous examples.
Then in-context learning is done programmatically via prompt concatenation.
If you can "score" online the working and non working examples, then you have a very strong reward signal.
Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard
According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
It is 100% ARC-AGI-3 specific though, just read through the prompts https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
What a dick move. Making that prompt open source will probably mean that every other model that doesn't want to cheat will scrape that and accidentally cheat in the next models.
(disclaimer: i worked on early versions of agentica_sdk; but wasn't involved in recent developments and the ARC solver)
As other comments point out this is about harness development and harness efficiency. Agentica SDK is a sort of meta harness, that makes things easy: plug any "internal API" (as defined natively in your codebase) directly into your agent. Agentica SDK itself is not application specifc; but the APIs of your application are... application specific.
Re: the linked prompt. A harness is a set of tools and descriptions how to best use those tools, and sometimes some external control flow based on the outcome of using those tools. How to "best use the tools" should always be part of the prompt (like in this case).
So this work tries to answer: "short of telling the agent any solutions, make a simple but efficient API to play the games, hand it to the agent, and see how it does". In the world of harness development I think that's an interesting question to answer!
>In the world of harness development I think that's an interesting question to answer!
The challenge isn't about harness development though, and a sufficiently complex harness can solve these tasks rather easily.
And presenting it as if you've made a novel development for solving ARC-AGI-3 leads me to believe you're willing to waste all of our time for your benefit at every step in the future.
> a sufficiently complex harness can solve these tasks rather easily.
I claim this is not so easily done, and earlier iterations of ARC-AGI did not have the constraint in the first place. You want something that generalizes across all puzzles (hopefully even the private ones), and these puzzles are extremely diverse ... and hard; telling the model the controls and some basic guidelines for the game is the only "obvious" thing you can do.
The other point of my reply was efficiency, both in terms of creating and using the harness; the discussed solution is something that anyone (in fact, likely even an LLM itself) can cook up in a few minutes; it's not much more than a game control wrapper so the agent can play around with the game in live python and some generalities as laid out in the prompt.
(But I'm always happy to be proven wrong. What harnesses did you have in mind?)
this is so disingenuous on symbolica's part. these insincere announcements just make it harder for genuine attempts and novel ideas
Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.
The hard part of these tests isn't purely reasoning ability ffs.
> this uses a harness
This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.
It isn't arbitrary. They want measure the capability of the general LLM
So if I say "I want to measure your capability as a mechanic" but then also "to ensure an accurate score you're forbidden to use any tools" how are you the human mechanic planning to diagnose and fix the engine problem without wrenches and jack stands and the like? It makes no sense.
That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.
It is more like restricting the mechanic to only using commercially available tools and not allow them to create CUSTOM tools.
No, that would be analogous to disallowing customized harnesses, ie tooling specially crafted by someone else for the specific task at hand. Insisting that an LLM solve something without the ability to make use of any external tooling whatsoever is almost perfectly analogous to insisting that a human mechanic work on a car with nothing but his own bare hands.
The wrench is to the mechanic as the stock python repl is to the LLM.
They want the LLM that does the ARC-AGI-3 to be the same LLM that everyone uses.
Rephrase that in terms of the human mechanic and hopefully you can see the error of that reasoning. LLMs that perform tasks (as opposed to merely holding conversations) use tools just like we do. That's literally how we design them to operate.
In fact the LLMs that everyone uses today typically have access to specialized task specific tooling. Obviously specialized tools aren't appropriate for a test that measures the ability to generalize but generic tools are par for the course. Writing a bot to play a game for you would certainly serve to demonstrate an understanding of the task.
I'm pretty sure the LLM can use tools while doing arc-agi-3 but it has to the same tools available all the time not an incredibly elaborate custom harness.
To quote someone else from upthread, tool use requires a harness. Without one an LLM as commonly understood is a bare model that receives inputs and directly produces outputs the same as talking to an unaided person.
Then the LLM has to write the harness.
I'd like to suggest that prior to expressing disagreement you really ought to reread the comment you're replying to and make sure your understanding is correct.
Quoting this for the second time now - tool use requires a harness.
Without a harness the LLM has no ability to interact with the world. It has no agency. It's just spitting out text (or whatever else) into the void. There's no programming tools, no filesystem, no shell, nothing.
And by the rules of arc-agi-3 the LLM will have to write any harness it needs. I'm not sure what we are even arguing about this point.
Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?
Harness is fine. I think people here are arguing what provided here to take the test is not harness.
We're calling agents harnesses now?
The point of this test is to check if an AI system can figure out the game. This isn't what happened here. A human figured out the game, wrote in their prompts exactly how the game works and THEN put the AI on the problem. This is 100% cheating and imo quite stupid.
The harness would be fine if the agent coded its own harness in a controlled environment while observing the game.
Not sure if the specific rules of this prize allow that, but I would accept that
ELI5 what is a harness?
EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:
> We seek to fight two forms of overfitting that would muddy public sensefinding:
> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
I think generally people regard a harness as the system instructions + tools made available to the LLM (and probably the thing that runs the LLM conversation in a loop.) An agent is collectively, the LLM plus the harness.
I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.
I'm so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I'd need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I'd love to see more of what people do with them. My use cases are admittedly lame and boring, but it's such a fun paradigm to think and develop around.
Could you point me to some resources to learn about harnesses? I’d love to hear an example of a use case you’re thinking of.
There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.
I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.
Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
Here it is: https://arcprize.org/leaderboard/community
Ah, it's based on this repo [0] and there's only 1 non-example submission there [1], from 2 weeks ago (so it only covers the preview games), and their schema doesn't a field to show that it's only the preview, nor does the thing properly parse the score or cost into the table. And the biggest thing is that apparently there's no validation whatsoever - submissions are not ever run on the hidden test games, so is essentially useless as a comparison.
[0] https://github.com/arcprize/ARC-AGI-Community-Leaderboard [1] https://github.com/arcprize/ARC-AGI-Community-Leaderboard/bl...
The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.
What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
When you're on the hunt for VC cash "numbers go up" is the main criteria.
Would the single sentence „Imagine you are a regular computer player and accustomed to the usual elements of games“ count as a harness?
Does it matter though? If it accomplishes the task, it accomplishes the task. Everyone uses a harness anyway, and finding the best harness is relevant. Also perhaps this hints at something bigger, i.e.: we're wasting our time focusing on the model when we could be focusing on the harness.
Yes it matters, because it’s not a measurement of whether it accomplishes the task if a human tells it how to solve it.
...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.
I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.
https://en.wikipedia.org/wiki/Goodhart's_law
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
In this case the code is public and you can see they are not cheating in that sense.
The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.
Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.
Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
They aren't training new models for this. This is an agent harness for Opus 4.6.
All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.
ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.
Yes, assuming the checkpoint was before the announcement & public availability of the test set.
You live in a conspiracy world. Those AI providers don't update the models that fast. You can try ask them solve ARC-AGI-3 without harness and see them struggle as yesterday yourself.
Which part is the conspiracy? Be as concrete as possible.
They are definitely cheating, they have crafted prompts[1] that explain the game rules rather than have the model explore and learn.
1. https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
Where do you see that? I only skimmed the prompts but don't see any aspects of any of the games explained in there. There are a few hints which are legitimate prior knowledge about games in general, though some looks too inflexible to me. Prior knowledge ("Core priors") is a critical requirement of the ARC series, read the reports.
Uses public dataset to evaluate which is not meant for evaluation. Writes super specific prompt[1] and claims eye catching results.
This is the state of "AI" these days I guess...
[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
The dataset miscomparison is a big problem. The prompt is super specific to ARC-AGI-3, which is perfectly fine to do, but skimming it I saw nothing that appears specific to the 25 games in the dataset. Especially considering they've only had one day for overfitting. Could be quite subtle leakage though.
Of course it is... we are in an era where a well-timed blog post showing "SOTA results" on a benchmark can net millions in funding
Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.
Lots of people doing the same with extra steps (generating synthetic data from test questions with the LLM then training on it)
I wish we'd move past public test sets for LLM benchmarks: publish a plain english explanation of the tasks, allow questions and clarifications, and but never release a single question from the test set verbatim.
It made sense back when models needed to be finetuned on the task to even reliably answer. If we're saying this is the path to AGI we should be able to rely on the generalization of the model to get it right.
You have a problem with generating synthetic data from test questions? Humans simulate experiences in their mind. What's the problem?
Models don't generalize as well as humans.
If a model was trained on <|begin_text|> <|end_text|> and you change the tokens passed to <|start_text|> <|end_text|>, it loses several 'IQ points' if it can even answer back at all anymore.
Synthetic data is fine. Synthetic data on very similar questions generated based on the description is typically fine. But once the shape of what you're training on gets too close to the actual holdout questions, you're getting an uplift that's not realistic for unseen tasks.
Humans who have played games should also not be allowed to test in ARC AGI. Cavemen only.
Apparently the score would be a little higher if it weren't for the fact that scores are penalized for being worse than the human baseline, but aren't rewarded for being better than the human baseline (which seems like an arbitrary decision. The human baseline is not optimal).
Once you have matched humans on a problem then further progress on that problem is not necessarily meaningful anymore, in terms of quantitative measurement of intelligence. ARC-AGI-3 is designed to compare AIs to humans, not to measure arbitrarily high levels of superhuman intelligence. For that you would want a different benchmark.
Anybody used this Agentica of theirs?
we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.
That's unsurprising given that a lot of our own abilities as humans come from having painstakingly acquired practices and methodologies and tools (like pencil and paper, note taking, let alone algebra, formal methods and electromechanical aids). We call this "education" but it works in a way that is more similar to agentic harnesses than to pretraining or fine-tuning. This is reflected in the fundamental different way in which children and adults learn new skills
Scaffolding is all you need. I am absolutely certain about that. It's abound finding good ways to approximate the reward function being used during post-training, but at inference time. A general enough reward that can score candidates well will inevitably improve the abilities of LLMs when put inside scaffolds.
what exactly does scaffolding mean in this context? genuine question
anything that doesn't touch the model parameters at all once it has been compiled. for example, in streaming ASR of an encoder-decoder you can get gains in accuracy just by enhancing the encoder-decoder orchestration and ratio, frequency of fwd passes, dynamically adjusting the length of rolling windows (if using full attention). Prompting would be part of this too, including few-shot examples. Decoding strategy is also part of this (top-k, nucleus, speculative decoding, greedy or anything else). Applying signal processing or any kind of processing to the input before getting it into the model, or to the output. There are a lot of things you can do.
Also think about the program-synthesis approach proposed by Poetiq.ai. python programs are being generated and evaluated against previous examples. Then in-context learning is done programmatically via prompt concatenation. If you can "score" online the working and non working examples, then you have a very strong reward signal.
[dead]