Even before this, Gemini 3 has always felt unbelievably 'general' for me.
It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:
1. It's an LLM, not something trained to play Balatro specifically
2. Most (probably >99.9%) players can't do that at the first attempt
3. I don't think there are many people who posted their Balatro playthroughs in text form online
I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
I ask because I cannot distinguish all the benchmarks by heart.
François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.
But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?
I tell this as a person who really enjoys AI by the way.
Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes
humans are the same way, we all have a unique spike pattern, interests and talents
ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.
Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.
I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
Children have great levels of fluid intelligence, that's how they are able to learn to quickly navigate in a world that they are still very new to. Seniors with decreasing capacity increasingly rely on crystallised intelligence, that's why they can still perform tasks like driving a car but can fail at completely novel tasks, sometimes even using a smartphone if they have not used one before.
Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had Kimi K2.5.
I think it is because of the Chinese new year.
The Chinese labs like to publish their models arround the Chinese new year, and the US labs do not want to let a DeepSeek R1 (20 January 2025) impact event happen again, so i guess they publish models that are more capable then what they imagine Chinese labs are yet capable of producing.
I'm having trouble just keeping track of all these different types of models.
Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier "deep think" models have been increasingly requiring the use of their own platform.
OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.
Their models might be impressive, but their products absolutely suck donkey balls. I’ve given Gemini web/cli two months and ran away back to ChatGPT. Seriously, it would just COMPLETELY forget context mid dialog. When asked about improving air quality it just gave me a list of (mediocre) air purifiers without asking for any context whatsoever, and I can list thousands of conversations like that. Shopping or comparing options is just nonexistent.
It uses Russian propaganda sources for answers and switches to Chinese mid sentence (!), while explaining some generic Python functionality.
It’s an embarrassment and I don’t know how they justify 20 euro price tag on it.
I agree. On top of that, in true Google style, basic things just don't work.
Any time I upload an attachment, it just fails with something vague like "couldn't process file". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.
I also tried having it read and write stuff to "my stuff" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.
Their models are seriously impressive. But as usual Google sucks at making them work well in real products.
Not a single person is using it for coding (outside of Google itself).
Maybe some people on a very generous free plan.
Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
But that isn’t “the model” that’s an old model backed by massive money.
But wait two hours for what OpenAI has! I love the competition and how someone just a few days ago was telling how ARC-AGI-2 was proof that LLMs can't reason. The goalposts will shift again. I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.
"AGI" doesn't mean anything concrete, so it's all a bunch of non-sequiturs. Your goalposts don't exist.
Anyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured.
I agree. I think the emergence of LLMs have shown that AGI really has no teeth. I think for decades the Turing test was viewed as the gold standard, but it's clear that there doesn't appear to be any good metric.
Gemini's UX (and of course privacy cred as with anything Google) is the worst of all the AI apps. In the eyes of the Common Man, it's UI that will win out, and ChatGPT's is still the best.
Been using Gemini + OpenCode for the past couple weeks.
Suddenly, I get a "you need a Gemini Access Code license" error but when you go to the project page there is no mention of this or how to get the license.
You really feel the "We're the phone company and we don't care. Why? Because we don't have to." [0] when you use these Google products.
PS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a "near" monopoly but close enough).
They don't even let you have multiple chats if you disable their "App Activity" or whatever (wtf is with that ass naming? they don't even have a "Privacy" section in their settings the last time I checked)
and when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.
ChatGPT and Grok work so much better without accounts or with high privacy settings.
You mean AI Studio or something like that, right? Because I can't see a problem with Google's standard chat interface. All other AI offerings are confusing both regarding their intended use and their UX, though, I have to concur with that.
No projects, completely forgets context mid dialog, mediocre responses even on thinking, research got kneecapped somehow and is completely uses now, uses propaganda Russian videos as the search material (what’s wrong with you, Google?), janky on mobile, consumes GIGABYTES of RAM on web (seriously, what the fuck?). Left a couple of tabs over night, Mac is almost complete frozen because 10 tabs consumed 8 GBs of RAM doing nothing. It’s a complete joke.
Trick? Lol not a chance. Alphabet is a pure play tech firm that has to produce products to make the tech accessible. They really lack in the latter and this is visible when you see the interactions of their VP's. Luckily for them, if you start to create enough of a lead with the tech, you get many chances to sort out the product stuff.
it is interesting that the video demo is generating .stl model.
I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash).
But they probably decided to market it as Deep Think because why not charge more for it.
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
The problem here is that it looks like this is released with almost no real access. How are people using this without submitting to a $250/mo subscription?
According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
> Trouble is some benchmarks only measure horse power.
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.
The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent.
Its possibly label noise. But you can't tell from a single number.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.
Arc-AGI score isn't correlated with anything useful.
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.
Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?
If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
Most folks don't seem to think that far down the line, or they haven't caught on to the reality that the people who actually make decisions will make the obvious kind of decisions (ex: fire the humans, cut the pay, etc) that they already make.
Well I honestly think this is the solution. It's much harder to do French Revolution V2 though if they've used ML to perfect people's recommendation algorithms to psyop them into fighting wars on behalf of capitalists.
I imagine llm job automation will make people so poor that they beg to fight in wars, and instead of turning that energy against he people who created the problem they'll be met with hours of psyops that direct that energy to Chinese people or whatever.
The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
For every combination of animal and vehicle? Very unlikely.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.
In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.
It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!
Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
It’s impossible for it to do anything but cut code down, drop features, lose stuff and give you less than the code you put in.
It’s puzzling because it spent months at the head of the pack now I don’t use it at all because why do I want any of those things when I’m doing development.
I’m a paid subscriber but there’s no point any more I’ll spend the money on Claude 4.6 instead.
It seems to be adept at reviewing/editing/critiquing, at least for my use cases. It always has something valuable to contribute from that perspective, but has been comparatively useless otherwise (outside of moats like "exclusive access to things involving YouTube").
I need to test the sketch creation a s a p. I need this in my life because learning to use Freecad is too difficult for a busy person like me (and frankly, also quite lazy)
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.
Does anyone actually use Gemini 3 now? I cant stand its sleek salesy way of introduction, and it doesnt hold to instructions hard – makes it unapplicable for MECE breakdowns or for writing.
Praying this isn't another Llama4 situation where the benchmark numbers are cooked. 84.6% on Arc-AGI is incredible!
Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)
Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...
Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:
1. It's an LLM, not something trained to play Balatro specifically
2. Most (probably >99.9%) players can't do that at the first attempt
3. I don't think there are many people who posted their Balatro playthroughs in text form online
I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
[0]: https://balatrobench.com/
But... there's Deepseek v3.2 in your link (rank 7)
> . I don't think there are many people who posted their Balatro playthroughs in text form online
There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.
Yeah, or just the steam text guides would be a huge advantage.
I really doubt it's playing completely blind
DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years
What about Kimi and GLM?
Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
I ask because I cannot distinguish all the benchmarks by heart.
François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
https://x.com/fchollet/status/2022036543582638517
I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.
But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?
How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?
I tell this as a person who really enjoys AI by the way.
* that you weren't supposed to be able to
Could it also be that the models are just a lot better than a year ago?
https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3
I don't understand what you want to tell us with this image.
they're accusing GGP of moving the goalposts.
Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
Here's a good thread over 1+ month, as each model comes out
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.
the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes
humans are the same way, we all have a unique spike pattern, interests and talents
ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
You can get more spiky with AIs, whereas with human brain we are more hard wired.
So maybe we are forced to be more balanced and general whereas AI don't have to.
I suspect the non-spikey part is the more interesting comparison
Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.
There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai
I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.
The average ARC AGI 2 score for a single human is around 60%.
"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
https://arcprize.org/arc-agi/2/
Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
That's a bit like saying just give blind people cameras so they can see.
https://arcprize.org/leaderboard
$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
But the real question is if they just fit the model to the benchmark.
Why 5-10 years?
At current rates, price per equivalent output is dropping at 99.9% over 5 years.
That's basically $0.01 in 5 years.
Does it really need to be that cheap to be worth it?
Keep in mind, $0.01 in 5 years is worth less than $0.01 today.
Wow that's incredible! Could you show your work?
What’s reasonable? It’s less than minimum hourly wage in some countries.
Burned in seconds.
That's not a long time in the grand scheme of things.
Speak for yourself. Five years is a long time to wait for my plans of world domination.
Yes, you better hurry.
Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.
Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
The puzzles are calibrated for human solve rates, but otherwise I agree.
My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.
I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
You are confusing fluid intelligence with crystallised intelligence.
I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours.
There are more novel tasks in a day than ARC provides.
Children have great levels of fluid intelligence, that's how they are able to learn to quickly navigate in a world that they are still very new to. Seniors with decreasing capacity increasingly rely on crystallised intelligence, that's why they can still perform tasks like driving a car but can fail at completely novel tasks, sometimes even using a smartphone if they have not used one before.
It is over
I for one welcome our new AI overlords.
Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had Kimi K2.5.
I think it is because of the Chinese new year. The Chinese labs like to publish their models arround the Chinese new year, and the US labs do not want to let a DeepSeek R1 (20 January 2025) impact event happen again, so i guess they publish models that are more capable then what they imagine Chinese labs are yet capable of producing.
I'm having trouble just keeping track of all these different types of models.
Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
There are hints this is a preview to Gemini 3.1.
They are using the current models to help develop even smarter models. Each generation of model can help even more for the next generation.
I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
I must be holding these things wrong because I'm not seeing any of these God like superpowers everyone seem to enjoy.
There's more compute now than before.
Fast takeoff.
Anthropic took the day off to do a $30B raise at a $380B valuation.
Most ridiculous valuation in the history of markets. Cant wait to watch these compsnies crash snd burn when people give up on the slot machine.
Why? They had $10+ billion arr run rate in 2025 trippeled from 2024 I mean 30x is a lot but also not insane at that growth rate right?
WeWork almost IPO’s at $50bn. It was also a nice crash and burn.
It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier "deep think" models have been increasingly requiring the use of their own platform.
OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.
https://docs.litellm.ai/docs/
Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind.
Their models might be impressive, but their products absolutely suck donkey balls. I’ve given Gemini web/cli two months and ran away back to ChatGPT. Seriously, it would just COMPLETELY forget context mid dialog. When asked about improving air quality it just gave me a list of (mediocre) air purifiers without asking for any context whatsoever, and I can list thousands of conversations like that. Shopping or comparing options is just nonexistent. It uses Russian propaganda sources for answers and switches to Chinese mid sentence (!), while explaining some generic Python functionality. It’s an embarrassment and I don’t know how they justify 20 euro price tag on it.
I agree. On top of that, in true Google style, basic things just don't work.
Any time I upload an attachment, it just fails with something vague like "couldn't process file". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.
I also tried having it read and write stuff to "my stuff" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.
Their models are seriously impressive. But as usual Google sucks at making them work well in real products.
Their models are absolutely not impressive.
Not a single person is using it for coding (outside of Google itself).
Maybe some people on a very generous free plan.
Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
But that isn’t “the model” that’s an old model backed by massive money.
Peacetime Google is not like wartime Google.
Peacetime Google is slow, bumbling, bureaucratic. Wartime Google gets shit done.
OpenAI is the best thing that happened to Google apparently.
But wait two hours for what OpenAI has! I love the competition and how someone just a few days ago was telling how ARC-AGI-2 was proof that LLMs can't reason. The goalposts will shift again. I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.
"AGI" doesn't mean anything concrete, so it's all a bunch of non-sequiturs. Your goalposts don't exist.
Anyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured.
I agree. I think the emergence of LLMs have shown that AGI really has no teeth. I think for decades the Turing test was viewed as the gold standard, but it's clear that there doesn't appear to be any good metric.
Soon they can drop the bioweapon to welcome our replacement.
Those black nazis in the first image model were a cause of inside trading.
Gemini's UX (and of course privacy cred as with anything Google) is the worst of all the AI apps. In the eyes of the Common Man, it's UI that will win out, and ChatGPT's is still the best.
> Gemini's UX ... is the worst of all the AI apps
Been using Gemini + OpenCode for the past couple weeks.
Suddenly, I get a "you need a Gemini Access Code license" error but when you go to the project page there is no mention of this or how to get the license.
You really feel the "We're the phone company and we don't care. Why? Because we don't have to." [0] when you use these Google products.
PS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a "near" monopoly but close enough).
0 - https://vimeo.com/355556831
Google privacy cred is ... excellent? The worst data breach I know of them having was a flaw that allowed access to names and emails of 500k users.
If you consider "privacy" to be 'a giant corporation tracks every bit of possible information about you and everyone else'?
Link? Are you conflating with "500k Gmail accounts leaked [by a third party]" with Gmail having a breach?
Afaik, Google has had no breaches ever.
They don't even let you have multiple chats if you disable their "App Activity" or whatever (wtf is with that ass naming? they don't even have a "Privacy" section in their settings the last time I checked)
and when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.
ChatGPT and Grok work so much better without accounts or with high privacy settings.
You mean AI Studio or something like that, right? Because I can't see a problem with Google's standard chat interface. All other AI offerings are confusing both regarding their intended use and their UX, though, I have to concur with that.
The lack of "projects" alone makes their chat interface really unpleasant compared to ChatGPT and Claude.
No projects, completely forgets context mid dialog, mediocre responses even on thinking, research got kneecapped somehow and is completely uses now, uses propaganda Russian videos as the search material (what’s wrong with you, Google?), janky on mobile, consumes GIGABYTES of RAM on web (seriously, what the fuck?). Left a couple of tabs over night, Mac is almost complete frozen because 10 tabs consumed 8 GBs of RAM doing nothing. It’s a complete joke.
AI Studio is also significantly improved as of yesterday.
Google is still behind the largest models I'd say, in real world utility. Gemini 3 Pro still has many issues.
Trick? Lol not a chance. Alphabet is a pure play tech firm that has to produce products to make the tech accessible. They really lack in the latter and this is visible when you see the interactions of their VP's. Luckily for them, if you start to create enough of a lead with the tech, you get many chances to sort out the product stuff.
You sound like Russ Hanneman from SV
It's not about how much you earn. It's about what you're worth.
it is interesting that the video demo is generating .stl model. I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.
Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.
I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.
edit: they just removed the reference to "3.1" from the pdf
I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.
The Deep Think moniker is for parallel compute models though, not long CoT like pro models.
It's possible though that deep think 3 is running 3.1 models under the hood.
That's odd considering 3.0 is still labeled a "preview" release.
The rumor was that 3.1 was today's drop
Where are these rumors floating around?
One of many https://x.com/synthwavedd/status/2021983382314660075
> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
They never will do on private set, because it would mean its being leaked to google.
OT but my intuition says that there’s a spectrum
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
> best of N models like deep think an gpt pro
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
> can a sufficiently large non thinking model perform the same as a smaller thinking?
Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
its interesting that opus 4.6 added a paramter to make it think extra hard.
The problem here is that it looks like this is released with almost no real access. How are people using this without submitting to a $250/mo subscription?
According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior.
The general purpose ChatGpt 5.3 hasn’t been released yet, just 5.3-codex.
It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.
> Trouble is some benchmarks only measure horse power.
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic
I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages
It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent
The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems.
I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
[1] https://1stproof.org/
The 1st proof original solutions are due to be published in about 24h, AIUI.
I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.
And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)
The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent.
You just pipe it to another agent to do the reduce step (i.e. fan-in) of the mapreduce (fan-out)
It's agents all the way down.
Start with 1024 and use half the number of agents each turn to distill the final result.
Less than a year to destroy Arc-AGI-2 - wow.
It's still useful as a benchmark of cost/efficiency.
I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month
The AGI bar has to be set even higher, yet again.
wow solving useless puzzles, such a useful metric!
But why only a +0.5% increase for MMMU-Pro?
Its possibly label noise. But you can't tell from a single number.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.
It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.
Arc-AGI score isn't correlated with anything useful.
how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?
Give it a prompt like
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.
I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5
Not trained for agentic workflows yet unfortunately - this looks like it will be fantastic when they have an agent friendly one. Super exciting.
Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?
If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
Most folks don't seem to think that far down the line, or they haven't caught on to the reality that the people who actually make decisions will make the obvious kind of decisions (ex: fire the humans, cut the pay, etc) that they already make.
Or we just end capitalism.
French revolution style.
shrugs
Well I honestly think this is the solution. It's much harder to do French Revolution V2 though if they've used ML to perfect people's recommendation algorithms to psyop them into fighting wars on behalf of capitalists.
I imagine llm job automation will make people so poor that they beg to fight in wars, and instead of turning that energy against he people who created the problem they'll be met with hours of psyops that direct that energy to Chinese people or whatever.
We will see.
The pelican riding a bicycle is excellent. I think it's the best I've seen.
https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/
Is there a list of these for each model, that you've catalogued somewhere?
How likely this problem is already on the training set by now?
If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
Why would they train on that? Why not just hire someone to make a few examples.
I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.
For every combination of animal and vehicle? Very unlikely.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
You can always ask for a tyrannosaurus driving a tank.
I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)
It's not actually, look up some photos of the sun setting over the ocean. Here's an example:
https://stockcake.com/i/sunset-over-ocean_1317824_81961
That’s only if the sun is above the horizon entirely.
No, it's not.
https://stockcake.com/i/serene-ocean-sunset_1152191_440307
Do you have to still keep trying to bang on about this relentlessly?
It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
Again, like I said before, it's also a terrible benchmark.
It being a terrible benchmark is the bit.
Eh, i find it more of a not very informative but lighthearted commentary
Highly disagree.
I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.
In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.
It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!
Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
top 10 elo in codeforces is pretty absurd
Gemini was awesome and now it’s garbage.
It’s impossible for it to do anything but cut code down, drop features, lose stuff and give you less than the code you put in.
It’s puzzling because it spent months at the head of the pack now I don’t use it at all because why do I want any of those things when I’m doing development.
I’m a paid subscriber but there’s no point any more I’ll spend the money on Claude 4.6 instead.
It seems to be adept at reviewing/editing/critiquing, at least for my use cases. It always has something valuable to contribute from that perspective, but has been comparatively useless otherwise (outside of moats like "exclusive access to things involving YouTube").
I never found it useful for code. It produced garbage littered with gigantic comments.
Me: Remove comments
Literally Gemini: // Comments were removed
It would make more sense to me if it had never been awesome.
Gemini 3 Pro/Flash is stuck in preview for months now. Google is slow but they progress like a massive rock giant.
Unfortunately, it's only available in the Ultra subscription if it's available at all.
I need to test the sketch creation a s a p. I need this in my life because learning to use Freecad is too difficult for a busy person like me (and frankly, also quite lazy)
FWIW, the FreeCAD 1.1 nightlies are much easier and more intuitive to use due to the addition of many on-canvas gizmos.
Always the same with Google.
Gemini has been way behind from the start.
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.
Why a Twitter post and not the official Google blog post… https://blog.google/innovation-and-ai/models-and-research/ge...
Just normal randomness I suppose. I've put that URL at the top now, and included the submitted URL in the top text.
The official blog post was submitted earlier (https://news.ycombinator.com/item?id=46990637), but somehow this story ranked up quickly on the homepage.
@dang will often replace the post url & merge comments
HN guidelines prefer the original source over social posts linking to it.
Agreed - blog post is more appropriate than a twitter post
Does anyone actually use Gemini 3 now? I cant stand its sleek salesy way of introduction, and it doesnt hold to instructions hard – makes it unapplicable for MECE breakdowns or for writing.
I do. It's excellent when paired with an MCP like context7.
I dont agree, Gemini 3 is pretty good, even the Lite version.
What do you use it for and why? Genuinely curious
It indeed departs from instructions pretty regularly. But I find it very useful and for the price it beats the world.
"The price" is the marginal price I am paying on top of my existing Google 1, YouTube Premium, and Google Fi subs, so basically nothing on the margin.