Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.
Sadly, the answer was wrong.
It also returned 8 "sources", like stackexchange.com,
youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.
Still a useful tool though. It definitely gets the majority of the insights.
Exactly this - and how chatGPT behaves too. After a few conversations with search enabled you figure this out, but they really ought to make the distinction clearer.
The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.
On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.
If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.
It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.
As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.
Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.
The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need a real throat to choke to show your leadership skills.
I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.
Rooting is useless. We should be taking conscious action to reduce the bosses' manipulation of our lives and society. We will not be saved by hoping to sabotage a genuinely useful technology.
Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.
A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.
You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.
Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low
I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.
To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.
But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.
I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.
Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.
Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.
I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.
---
[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.
Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.
Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?
From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".
So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.
That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.
This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.
Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.
I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.
It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.
But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.
I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).
That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)
The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.
If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.
You explained yourself right. The issue is that you keep qualifying your statements.
> it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...
> ... in tournament conditions.
I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?
The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.
My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.
Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.
I just had chatgpt explain that problem to me (I was unfamiliar with the mathematical background). It showed how to solve closed form answers for H(2) and H(3) and then numerical solutions using RK4 for higher values. Truly impressive, and it explained the derivations beautifully. There are few maths experts I've encountered who could have hand-held me through it as good.
The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.
It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).
You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.
gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.
According to a bunch of philosophers (https://ai-2027.com/), doom is likely imminent. Kokotajlo was on Breaking Points today. Breaking Points is usually less gullible, but the top comment shows that "AI" hype strategy detection is now mainstream (https://www.youtube.com/watch?v=zRlIFn0ZIlU):
AI researcher: "Just another trillion dollars. This time we'll reach superintelligence, I swear."
I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.
Does it matter if it is out of the training data? The models integrate web search quite well.
What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.
They would surely add the latest Euler problems with solutions in order to show off in benchmarks.
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
In its defence, the code actually specifically calls that edge case out and justifies it:
// Calculate rotations
// We use a cumulative calculation logic mentally, but here simple degrees work because of the transition reset trick or specific animation style.
// To prevent the "spin back" glitch at 360->0, we can use a simple tick without transition for the wrap-around,
// but for simplicity in this specific React rendering, we will stick to standard 0-360 degrees.
// A robust way to handle the spin-back on the second hand is to accumulate degrees, but standard clock widgets often reset.
The video shows closer to 2 seconds for it to finally throw itself over in what could only be described as a "Thunk". I figured it would be a little more smooth.
Station clocks in Switzerland receive a signal from a master clock each minute that advances the minute hand, the seconds hand moves completely independent from the minute hand. This allows them to sync to the minute.
> The station clocks in Switzerland are synchronised by receiving an electrical impulse from a central master clock at each full minute, advancing the minute hand by one minute. The second hand is driven by an electrical motor independent of the master clock. It takes only about 58.5 seconds to circle the face; then the hand pauses briefly at the top of the clock. It starts a new rotation as soon as it receives the next minute impulse from the master clock.[3] This movement is emulated in some of the licensed timepieces made by Mondaine.
in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.
as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.
I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.
I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")
That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.
The prompt the other person was using is:
```
Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.
```
"Allow access to Google Drive to load this Prompt."
.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.
Well, you also have to allow it to train on your data. Although this is not explicitly about your Google drive data, and probably requires you to submit a prompt yourself, the barriers here are way to weak/fuzzy for me consider granting access via any account with private info.
Because most likely (at least according to Hanlon's razor) they somehow decided that using Google Drive as the only persistent storage backing AI studio was a reasonable UX decision.
It probably makes some sense internally in big tech corporation logic (no new data storage agreements on top of the ones the user has already agreed to when signing up for Drive etc.), but as a user, I find it incredibly strange too – especially since the text chats are in some proprietary format I can't easily open on my local GDrive replica, but the images generated or uploaded just look like regular JPEGs and PNGs.
This isn't using the same prompt or stack as the page from that post the other day; on aistudio it builds a web app across a few different files. It's still fairly concise but I don't think it's that much so.
Considering how important this benchmark has become to the judgement of state of the art AI models, I imagine each AI lab has a dedicated 'pelican guy', a a highly accomplished and academically credentialed person, who's working around the clock on training the model to make better and better SVG pelicans on bikes.
Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.
I don't think the big labs would waste their time on it. If a model is great at making the pelican but sucks at all other svg it becomes obvious. But so far the good pelicans are strong indicators of good general SVG ability.
Unless training on the pelican increases all SVG ability, then good job.
It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?
I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.
Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?
Maybe that date is a rule of thumb for when AI generated content became so widespread that it is likely to have contaminated future data. Given that people have spoofed authentic Reddit users with Markov chains, it probably doesn’t go back nearly far enough.
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.
I like to compare them using chathub using the same prompts
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.
This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.
I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.
Using a single custom benchmark as a metric seems pretty unreliable to me.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.
This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
my bad to the google team for the cursory brush off.
> This probably means my test is a little too niche.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.
I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.
Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.
It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.
The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.
That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").
I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.
I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?
I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.
That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.
But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.
Problem solving ability is largely not from the pretraining data.
I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)
Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.
> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.
A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....
I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.
I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".
Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.
Then make it imagine playing against you and it's gonna be fast and reliable.
you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.
I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?
I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).
I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:
- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts
- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.
- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.
Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.
I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues.
Something like
1.) Split audio into multiple smaller tracks.
2.) Perform first pass audio extraction
3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off)
4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it
Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.
I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.
I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.
3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
[00:00] Greg: Hello.
[00:01] X: You great?
[00:02] Greg: Hi.
[00:03] X: I'm X.
[00:04] Y: I'm Y.
...
I made a simple webpage to grab text from YouTube videos:
https://summynews.com
Great for this kind of testing?
(want to expand to other sources in the long run)
I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
"Were automating with AI" sounds better to investors than "We over hired and now need to downsize" or "We made some bad market bets, now need to free up cash flow"
> Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
> Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
I don't suppose these assertions are based on anything. If "AI" reduces the amount of time an engineer spends writing crud, boilerplate, test cases, random scripts, etc., and they have 5% more time to do other things, then all else being equal a project can be done with 5% fewer engineers.
Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
> Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
Between the disagreements regarding performance metrics, the fact that AI will happily increase its own scope of work as well as facilitate increasing any task, sprint, or projects scope of work, and Jevons Paradox, the world may never know the answer to either of these questions.
It does improve productivity, just like a good IDE. But engineers didn't get replaced by IDEs and they haven't yet been replaced by AI.
By the time its good enough to replace actual engineers, any job done in front of a computer will be at risk. I'm hoping that will happen at the same time as AI embodiment in robots, then every job will be automated, not just computer based ones.
Your assertion was not that "an engineer has never been replaced by AI". It is that no engineer has been laid off because of AI.
You agree AI improves engineer productivity. So last remaining question is, does greater productivity mean that fewer people are required to satisfy a given demand?
The answer is yes of course. So at this point, supporting the assertion requires handwaving about shortages and induced demand and demand for engineers to develop and support AI and so on. Which are all reasonable, but it should become pretty apparent that you can't be confident in an assertion like that. I would say it's pretty likely that AI has resulted in engineers being laid off in specific instances if not the net numbers.
That's because of overhiring and other non-ai related reasons (i.e. Higher interest rates means less VC funding available).
In reality, getting AI to do actual human work, as of the moment, takes much more effort and cost than you get back in cost savings. These companies will claim they are using AI, even if its just a few engineers using Windsurf.
The companies claim AI is the reason they laid off engineers to make it look like they're innovating, not downsizing, which makes them look better in the eyes of investors and shareholders.
I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.
I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.
I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.
The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.
> How do brains do it with such a low amount of energy?
Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.
There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.
This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.
My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.
My armchair is equally comfy, and I have an actual paper to point to:
Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]
They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:
"The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."
So yeah, real neurons are a bit more complex then ReLU or Sigmoid.
Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.
For the above pic I asked "What is wrong with the image?"
Result:
- It totally missed the most obvious one - six fingers.
Instead it said this:
Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:
- The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.
- Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.
- Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.
- Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.
[Edit: 3.0 is same as 2.5 - both answered almost identically]
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.
However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.
Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google
If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.
Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.
Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.
The trick is to not put more value in the score than what it is.
The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.
Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
My point is that it does not matter if the set is private or not.
If you want to train your model you'd need more data than the private set anyway.
So you have to build a very large training set on your own, using the same kind of puzzles.
It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.
Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?
I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up
I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.
If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
"thoughtSignature": "context_engineering_is_the_way_to_go"
It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.
I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.
Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.
Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
Matches my experience exactly. It's not the best at writing code but Gemini 2.5 Pro is (was) the hands-down winner in every other use case I have.
This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.
I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).
Code quality was fine for my very limited tests but I was disappointed with instruction following.
I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.
I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.
this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.
for codex those instructions make plan mode redundant.
What I usually try to test with is try to get them do full scalable SaaS application from scratch... It seemed very impressive in how it did the early code organization using Antigravity, but then at some point, all of sudden it started really getting stuck and constantly stopped producing and I had to trigger continue, or babysit it. I don't know if I could've been doing something better, but that was just my experience. Seemed impressive at first, but otherwise at least vs Antigravity, Codex and Claude Code scale more reliably.
Just early anecdote from trying to build that 1 SaaS application though.
Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.
Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.
Well, it depends. Just recently I had Opus 4.1 spend 1.5 hours looking at 600+ sources while doing deep research, only to get back to me with a report consisting of a single sentence: "Full text as above - the comprehensive summary I wrote". Anthropic acknowledged that it was a problem on their side but refused to do anything to make it right, even though all I asked them to do was to adjust the counter so that this attempt doesn't count against their incredibly low limit.
Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.
> Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
Balance is too weak of a word. OpenAI was conceived specifically to prevent Google from getting AGI first. That was its original goal. At the time of its founding Google was the undisputed leader of AI anywhere in the world. Musk was then very worried about AGI being developed behind closed doors particularly Google, which was why he was the driving force behind the founding of OpenAI.
The book Empire of AI describes him as being particularly fixated on Demis as some kind of evil genius. From the book, early OAI employees couldn’t take the entire thing too seriously and just focused on the work.
I thought it was a workaround to Google's complete disinterest in productizing the AI research it was doing and publishing, rather than a way to balance their dominance in a market which didn't meaningfully exist.
That’s how it turned out, but IIRC at the time of OpenAI’s founding, “AI” was search and RL which Google and deep mind were dominating, and self driving, which Waymo was leading. And OpenAI was conceptualized as a research org to compete. A lot has changed and OpenAI has been good at seeing around those corners.
That was actually Character.ai's founding story. Two researchers at Google that were frustrated by a lack of resources and the inability to launch an LLM based chatbot. The founders are now back at Google. OpenAI was founded based on fears that Google would completely own AI in the future.
I think that Google didn't see the business case in that generation of models, and also saw significant safety concerns. If AI had been delayed by... 5 years... would the world really be a worse place?
Elon Musk specifically gave OAI $150M early on because of the risk of Google being the only Corp that has AGI or super-intelligence. These emails were part of the record in the lawsuit.
It’s a common pattern for upstarts to embrace openness as a way to differentiate and gain a foothold then become progressively less open once they get bigger. Android is a great example.
Last I checked, Android is still open source (as AOSP) and people can do whatever-the-f-they-want with the source code. Are we defining open differently?
I think we're defining "less" differently. You're interpreting "less open" to mean "not open at all," which is not what I said.
There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open.
For example, by moving features that were in the AOSP into their proprietary Play Services instead [1].
Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2].
In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience.
Core is open source but for a device to be "Android compatible" and access the Google Play Store and other Google services, it must meet specific requirements from Google's Android Compatibility Program. These additional proprietary components are what make the final product closed source.
Was "Android" the way you define it ever open? Isnt it similar to chromium vs chrome? chromium is the core, and chrome is the product built on top of it - which is what allows Comet, Atlas, Brave to be built on.
That's the same thing what GrapheneOS, /e/ OS and others are doing - building on top of AOSP.
They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.
This is an understandable, but simplistic way of looking at the world. Are you also gonna blame Apple for mining for rare earths, because they made a successful product that requires exotic materials which needs to be mined from earth? How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.
I dont think your comment justifies calling out any form of simplistic view. It doesnt make sense. All the big players are bad. They"re companies, their one and only purpose is to make money and they will do whatever it takes to do it. Most of which does not serve human kind.
It seems okay to me to be upset with the system and also point out the specific wrongs of companies in the right context. I actually think that's probably most effective. The person above specifically singled out Google as a reply to a comment praising the company, which seems reasonable enough. I guess you could get into whether it's a proportional response; the praise wasn't that high and also exists within the context of the system as you point out. Still, their reply doesn't necessarily indicate that they're not upset with all companies or the system.
Yes, we're absolutely holding Apple accountable for outsourcing jobs, degrading the US markets, using slave and child labor, laundering cobalt from illegal "artisanal" mines in the DRC, and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods.
If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally.
The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way.
Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover.
And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition.
I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants.
If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric.
The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.
> laundering cobalt from illegal "artisanal" mines in the DRC
They don't, all cobalt in Apple products is recycled.
> and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
They don't, Apple audits their entire supply chain so it wouldn't hide anything if something moved to another subcontractor.
One can claim 100% recycled cobalt under the mass balance system even if recycled and non-recycled cobalt was mixed as long as the total amount used in production is less or equal to recycled cobalt purchased in the books.
At least here[0] they claim their recycled cobalt references are under the mass balance system.
People love getting their content for free and that's what Google does.
Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?
Yep, it's hard to believe it exists for free and with not a lot of ads when you have a good ad blocker... though the content creator's ads are inescapable, which I think is ok since they're making a little money in exchange for what, your little inconvenience for 1 minute or so - if you're not skipping the ad, which you aren't, right??) - after which you can watch some really good content.
The history channels on YT are amazing, maybe world changing - they get people to learn history and actually enjoy it. Same with some match channels like 3brown1blue which are just outstanding, and many more.
Yes, this is correct, and it happens everywhere. App Store, Play Store, YouTube, Meta, X, Amazon and even Uber - they all play in two-sided markets exploiting both its users and providers at the same time.
They're not a moral entity. corporations aren't people.
I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.
If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.
Voters are people and they are moral entities, direct any moral outrage at us.
Why should the collective of voters be any more of a moral entity than the collective of people who make up a corporation (which you may include its shareholders in if you want)?
It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries.
> It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law.
The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have.
You're trying to live in a feudal society when you treat corporations like this.
If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters.
And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.
When we criticize corporations, we really are criticizing the people who make the decisions in the corporations. I don’t see why we shouldn’t apply exactly the same moral standards to people’s decision in the context of a corporation as we do to people’s decisions made in any other context. You talk about lawfulness, but we wouldn’t talk about morals if we meant lawfulness. It’s also lawful to vote for the hyper-capitalist party, so by the same token moral outrage shouldn’t be directed towards the voters.
I get that, but those CEOs are not elected officials, they don't represent us and have no part in the discourse of law making (despite the state of things). In their capacity has executives of a company, they have no rights, no say in what we find acceptable or not in society. We tell them what they can and cannot do or else. That's the social contract we have with companies and their executives.
Being in charge of a corporation shouldn't elevate someone to a platform where they have a louder voice than the common man. They can vote just as equally as others at the voting booth. they can participate in their capacity as individuals in politics. But neither money, nor corporate influence have places in the governance of a democratic society.
I talk about lawfulness because that is the only rule of law a corporation can and should be expected to follow. Morals are for individuals. Corporations have no morals. they are neither moral or immoral. Their owners have morals, and you can criticize their greed, but that is a construct of capitalism. They're supposed to enrich themselves. You can criticize them for valuing money over morals, but that's like criticizing the ocean for being wet or the sun for being too hot. It's what they do. It's their role in society.
If a small business owner raises prices to increase revenue, that isn't immoral right? even though poor people that frequent them will be disaffected? amp that up to the scale of a megacorp, and the morality is still the same.
Corporations are entities that exist for the sole purpose of generating revenue for their owners. So when you criticize Google, you're criticizing a logical organization designed to do the thing you're criticizing it of doing. The CEO of google is acting in his official capacity, doing the job they were hired to do when they are resisting adblocking. The investors of Google are risking their money in anticipation of ROI, so their expectation from Google is valid as well.
When you find something to be immoral, the only meaningful avenue of expressing that with corporations is the law. You're criticizing google as if it was an elected official we could vote in/out of office. or as if it is an entity that can be convinced of its moral failings.
When we don't speak up and user our voice, we lose it.
Why are you directing the statement that "[Corporations are] not a moral entity" at me instead of the parent poster claiming that "[Google has] been the great balancing force (often for good) in the industry."? Saying that Google is a force "for good" is a claim by them that corporations can be moral entities; I agree with you that they aren't.
I could have just the same I suppose, but their comment was about google being a balancing force in terms of competition and monopoly. it wasn't a praise of their moral character. They did what was best for their business and that turns out to be good for reducing monopolies. If it turned out to be monopolistic, I would be wondering what congress and the DOJ are doing about it, instead of criticizing Google for trying to turn a profit.
I don't understand your logic, it seems like victim blaming. Using the internet and pointing out that targeted advertising has a negative effect on society is not "having it both ways".
Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?
You are not a "victim" for using or purchasing something which is completely unnecessary. Or if that's the case, then you have no agency and have to be medicinally declared unfit to govern yourself and be appointed a legal guardian to control your affairs.
What kind of world do you live in? Actually Google ads tend to be some of the highest ROI for the advertiser and most likely to be beneficial for the user. Vs the pure junk ads that aren't personalized, and just banner ads that have zero relationship to me. Google Ads is the enabler of free internet. I for one am thankful to them. Else you end up paying for NYT, Washinton Post, Information etc -- virtually for any high quality web site (including Search).
Most of the time, you need to pick one. Modern advertising is not based on finding the item with the most utility for the user - which means they are aimed at manipulating the user's behaviour in one way or another.
Outlook is not better in ways that email or gmail users necessarily care about, and in my experience gets in the way more than it helps with productivity or anything it tries to be good at. I've used it in office settings because it's the default, but never in my life have I considered using it by choice. If it's better, it might not matter.
For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."
Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.
If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".
All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?
They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.
Why stop at YouTube? Blame Apple for creating an additive gadget that has single handedly wasted billions of hours of collective human intelligence. Life was so much better before iPhones.
But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things.
Dont paint everything with a single, large, coarse brush stroke.
Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.
You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.
I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine).
Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions).
So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.
Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?
That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.
You could question how well this works, but it’s not like the answers are just hanging out on the public internet.
From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047
I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.
Great improvement by only adding one feedback prompt:
Change the rotation axis of the wheels by 90 degrees in the horizontal plane. Same for the legs and arms
Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?
IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?
It would be next to impossible for anyone without insider knowledge to prove that to be the case.
Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )
Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.
> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
> Gemini app surpasses 650 million users per month
Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.
I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app. There was a big "Get Started" button that I almost accidentally clicked because it was where I was about to tap for something else.
As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.
I don't know for sure but they have to be counting users like me whose phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app
This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.
For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.
Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.
I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.
For me it’s up and running.
I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more
you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.
when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.
Those numbers are too good to expect.
If 90% right 10% wrong is the baseline would you take as an improvement:
- 80% right 18% I don't know 2% wrong
- 50%/48%/2%
- 10%/90%/0%
- 80%/15%/5%
The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.
Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.
I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising.
As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.
"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.
Like replacing named concepts with nonsense words in reasoning benchmarks.
I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.
It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.
Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
> When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?
> Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
I'm sure each of the frontier labs have some secret methods, especially in training the models and the engineering of optimizing inference. That said, I don't think them saying they'd keep a big breakthrough secret would be evidence in this case of a "secret sauce" on ARC-AGI-2.
If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)
Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.
I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model. If your product doesn't work and the new architectures unlock performance that would let you have a feasible business, even a 2x on input tokens shouldn't be the dealbreaker.
If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.
Gemini 3 is crushing my personal evals for research purposes.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.
Genuinely curious here: why is the desktop app so important?
I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?
Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.
(1) The ability to add context via a local apps integration into OS level resources is big. With Claude, eg, I hit Option-SPC which brings up a prompt bar. From there, taking a screenshot that will get sent my prompt is as simple as dragging a bounding box. This is great. Beyond that, I can add my own MCP connectors and give my desktop app direct access to relevant context in a way that doesn't work via web UI. It may also be inconvenient to give context to a web UI in some case where, eg, I may have a folder of PDFs I want it to be able to reference.
(2) Its own icon that I can CMD-TAB to is so much nicer. Maybe that works with a PWA? Not really sure.
(3) Even if I can't use an LLM when offline, having access to my chats for context has been repeatedly valuable to me.
I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
> The ability to add context via a local apps integration into OS level resources is big
Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
> Its own icon that I can CMD-TAB to is so much nicer
Fair enough. I personally prefer Firefox's tab organization to my OS's window organization, but I can see how separating the LLM into its own window would be helpful.
> having access to my chats for context has been repeatedly valuable to me.
I didn't at all consider this. Point ceded.
> I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
Interesting. Why? Is it security? The ones I've listed are open source and auditable. I'm confident that they won't steal my API keys. Msty has a lot of advanced functionality that I haven't seen in other interfaces like allowing you to compare responses between different LLMs, export the entire conversation to Markdown, and edit the LLM's response to manage context. It also sidesteps the problem of '[provider] doesn't have a desktop app' because you can use any provider API.
> Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
Access to OS level resources != context pollution. You still have control, just more direct and less manual.
> The ones I've listed are open source and auditable.
Yeah I don't plan on spending who knows how much time auditing some major app's code (lol) before giving it my API keys and access to my chats. Unless there's a critical mass of people I know and trust using something like that it's not going to happen for me.
But also, I tried quickly looking up Msty to see if it is open source and what its adoption looked like and AFAICT it's not open source. Asked Gemini 3 if it was and it also said no. Frankly that makes it a very hard no for me. If you are using it because you think it's Open Source I suggest you stop.
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000.
The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles.
Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt.
The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem.
A reasonably decent programmer would solve the easiest problems in a very short amount of time.
Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph).
Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues.
For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better.
For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.
The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.
As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.
These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.
It’s not really abuse though. These markets aggregate information; when an insider takes one side of a trade, they are selling their information about the true price (probability of the thing happening) to the market (and the price will move accordingly).
You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.
Disallowing people from making terrible trades seems…paternalistic? Idk
The point of prediction markets isn't to be fair. They are not the stock market. The point of prediction markets is to predict. They provide a monetary incentive for people who are good at predicting stuff. Whether that's due to luck, analysis, insider knowledge, or the ability to influence the result is irrelevant. If you don't want to participate in an unfair market, don't participate in prediction markets.
But what's the point of predicting how many times Elon will say "Trump" on an earnings call (or some random event Kalshi or Polymarket make up)? At least the stock market serves a purpose. People will claim "prediction markets are great for price discovery!" Ok. I'm so glad we found out the chance of Nicki Minaj saying "Bible" during some recent remarks. In case you were wondering, the chance peaked at around 45% and she did not say 'bible'! She passed up a great opportunity to buy the "yes" and make a ton of money!
I agree that the "will [person] say [word]" markets are stupid. "Will Brian Armstrong say the word 'Bitcoin' in the Q4 earnings call" is a stupid market because nobody a actually cares whether or not he actually says 'Bitcoin', they care about whether or not Coinbase is focusing on Bitcoin. If Armstrong manipulates the market by saying the words without actually doing anything, nobody wins except Armstrong. "Will Coinbase process $10B in Bitcoin transactions in Q4" is a much better market because, though Armstrong could still manipulate the market's outcome, his manipulation would influence a result that people actually care about. The existence of stupid markets doesn't invalidate the concept.
And? Insider trading is bad because it's unfair, and the stock market is supposed to be fair. Prediction markets are not fair. If you are looking for a fair market, prediction markets are not that. Insider trading is accepted and encouraged in prediction markets because it makes the predictions more accurate, which is the entire point.
By 'fair', I mean 'all parties have access to the same information'. The stock market is supposed to give everyone the same information. Trading with privileged information (insider trading), is illegal. Publicly traded companies are required to file 10-Qs and 10-Ks. SEC rule 10b5-1 prohibits trading with material non-public information. There are measures and regulations in place to try to make the stock market fair. There are, by design, zero such measures with prediction markets. Insider trading improves the accuracy of prediction markets, which is their whole purpose to begin with.
>Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call.
For the kind of person playing these sorts of games, that actually really "hype".
I’m pretty sure that these model release date markets are made to be abused. They’re just a way to pay insiders to tell you when the model will be released.
The mention markets are pure degenerate gambling and everyone involved knows that
Abuse sounds bad, this is good! Now we have a sneak peek into the future, for free! Just don't bet on any markets where an insider has knowledge (or don't bet at all)
I would like to try controlling my browser with this model. Any ideas how to do this.
Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.
They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.
Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).
I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.
"Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."
Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.
Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).
entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps
Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.
Given the nature of how those models work, you don't need exact replicas.
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators
What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.
Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.
If you mean the "consumer ecosystem", then Gemini 3 should be available as an API through Google's AI Vertex platform. If you don't even want a Google Cloud account, then I think the answer is no unless they announce a partnership with an inference cloud like cerebras.
I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.
While I haven't tried leaving the field blank on every credit card form I've come across, I'm certain that at least some of them considered it required.
Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
- Anyone have any idea why it says 'confidential'?
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.
Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
Interesting that they added an option to select your own API key right in AI studio‘s input field.
I sincerely hope the times of generous free AIstudio usage are not over
I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
> We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
That feels like something between a hallucination and an intentional fallacy that popped up because you specifically said "intense discussion". The increase is 60% on input tokens from the old model, but it's not a markup, and especially not "sold back to us at X markup".
I've seen more and more of these kinds of hallucinations as these models seem to be RL'd to not be a sycophant, they're slowly inching into the opposite direction where they tell small fibs or embellish in a way that seems like it's meant to add more weight to their answers.
I wonder if it's a form of reward hacking, since it trades being maximally accurate for being confident, and that might result in better rewards than being accurate and precise
I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?
I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it
Glad to see Google still can't get out of its own way.
When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.
Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.
i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.
"gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.
I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.
That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.
If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.
By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.
Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.
It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com
I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story.
https://xcancel.com/xundecidability/status/19908286970881311...
He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.
> Eliezer Yudkowsky is a central figure in the article, mentioned multiple times as the intellectual originator of the community from which the "Zizians" splintered. His ideas and organizations are foundational to the entire narrative.
And yet you could eliminate him entirely and the story is still coherent.
The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.
It sounds like both of you are unfamiliar with the link between the Zizians and Yudkowsky. So let us just return to the discussion of gemini-3, do you think the model did a bad job then in it's second response?
It literally does not matter how much they are connected out here in reality, the AI was to summarize the information in the article and that is exactly what it did.
>do you think the model did a bad job then in it's second response
Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.
Also that is the position you've been defending this whole thread. This whole conversation is happening because you believe Yudkowsy is an important figure to the story.
Here's another attempt:
llm --cid 01kabxtjq10exgk56yf802028f "I notice you did not mention Yudkowsky?" --no-log -m gem3-t1
Based on the text provided, Eliezer Yudkowsky is a central background figure to this story, serving as the intellectual progenitor of the movement from which the Zizians splintered.
Here is specifically how he features in the article:
* *The Founder:* Yudkowsky is identified as the founder of the Machine Intelligence Research Institute (MIRI) and the Center for Applied Rationality (CFAR), as well as the founder of the LessWrong forum. These are the institutions the Zizians originally belonged to and eventually declared war against.
* *The Source Code:* The Zizians based their radical worldview on Yudkowsky’s concepts, specifically *"timeless decision theory."* However, they believed they were applying his logic more rigorously than he was. They became convinced that Yudkowsky’s organizations had betrayed these principles (specifically regarding a rumor about MIRI paying blackmail to cover up a scandal), which they viewed as a moral failing that justified their rebellion.
Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.
Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.
It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.
Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.
Sadly, the answer was wrong.
It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.
Still a useful tool though. It definitely gets the majority of the insights.
Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
> It also returned 8 "sources"
well, there's your problem. it behaves like a search summary tool and not like a problem solver if you enable google search
Exactly this - and how chatGPT behaves too. After a few conversations with search enabled you figure this out, but they really ought to make the distinction clearer.
The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.
I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.
On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.
Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".
Why is this sad. You should bw rooting for these LLMs to be as bad as possible..
If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.
It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.
As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.
Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.
The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need a real throat to choke to show your leadership skills.
Generally, any expert hopes their tool/paintbrush/etc is as performant as possible.
That ship has sailed long ago.
I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.
Confirmed less wrong psyop victim
Rooting is useless. We should be taking conscious action to reduce the bosses' manipulation of our lives and society. We will not be saved by hoping to sabotage a genuinely useful technology.
How is it useful other than for people making money off token outout. Continue to fry your brain.
They’re fantastic learning tools, for a start. What you get out of them is proportional to what you put in.
You’ve probably heard of the Luddites, the group who destroyed textile mills in the early 1800s. If not: https://en.wikipedia.org/wiki/Luddite
Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.
A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.
You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.
are you pretending to be confused?
Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low
I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.
ChatGPT solves this problem now as well with 5.1. Time for a new litmus test.
To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.
But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.
I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.
I'm not explaining myself right.
Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.
Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.
I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.
---
[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.
Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.
> Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good.
Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.
Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?
Both, but perhaps more often neither.
From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".
So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.
That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.
This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.
Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.
I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.
It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.
But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.
It can be either one. In closed positions, it is often the latter.
It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.
I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).
That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)
The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.
If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.
I recommend Matthew Sadler's Game Changer and The Silicon Road To Chess Improvement.
You explained yourself right. The issue is that you keep qualifying your statements.
> it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...
> ... in tournament conditions.
I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?
I can wreck stockfish in chess boxing. Mostly because stockfish can't box, and it's easy for me to knock over a computer.
If it runs on a mainframe you would lose both the chess and the boxing.
The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.
How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.
My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.
Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.
Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970
I just had chatgpt explain that problem to me (I was unfamiliar with the mathematical background). It showed how to solve closed form answers for H(2) and H(3) and then numerical solutions using RK4 for higher values. Truly impressive, and it explained the derivations beautifully. There are few maths experts I've encountered who could have hand-held me through it as good.
The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.
It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).
This is very obviously an AI generated comment.
“This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.”
Not sure about the motive here.
Hmmm, I wrote those words myself, maybe I've spent too much time with LLMs and now I'm talking like them??
I'd be interested in any evidence-based arguments you might have beyond attacking my writing style and insinuating bad intent.
I found this commenter had sage advice about how to use HN well, I try to follow it: https://news.ycombinator.com/item?id=38944467
You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.
I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p
Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test
gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.
Are you sure it did not retrieve the answer using websearch?
Did it search the web?
Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.
So when does the developer admit defeat? Do we have a benchmark for that yet?
According to a bunch of philosophers (https://ai-2027.com/), doom is likely imminent. Kokotajlo was on Breaking Points today. Breaking Points is usually less gullible, but the top comment shows that "AI" hype strategy detection is now mainstream (https://www.youtube.com/watch?v=zRlIFn0ZIlU):
AI researcher: "Just another trillion dollars. This time we'll reach superintelligence, I swear."
I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.
Does it matter if it is out of the training data? The models integrate web search quite well.
What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.
They would surely add the latest Euler problems with solutions in order to show off in benchmarks.
Wow. Sounds pretty impressive.
This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
This is quite likely to be in the training data, since it's one of the projects in Wes Bos's free 30 days of Javascript course[0].
[0] https://javascript30.com/
I was under the impression for this to work like that, training data needs to be plenty. One project is not enough since it’s too "sparse".
But maybe this example was used by many other people and so it proliferated?
The repo[0] currently has been forked ~41300 times.
[0] https://github.com/wesbos/JavaScript30
The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.
In its defence, the code actually specifically calls that edge case out and justifies it:
The Swiss and German railway clocks actually work the same way and stop for (half a?) second while the minute handle progresses.
https://youtu.be/wejbVtj4YR0
The video shows closer to 2 seconds for it to finally throw itself over in what could only be described as a "Thunk". I figured it would be a little more smooth.
Station clocks in Switzerland receive a signal from a master clock each minute that advances the minute hand, the seconds hand moves completely independent from the minute hand. This allows them to sync to the minute.
> The station clocks in Switzerland are synchronised by receiving an electrical impulse from a central master clock at each full minute, advancing the minute hand by one minute. The second hand is driven by an electrical motor independent of the master clock. It takes only about 58.5 seconds to circle the face; then the hand pauses briefly at the top of the clock. It starts a new rotation as soon as it receives the next minute impulse from the master clock.[3] This movement is emulated in some of the licensed timepieces made by Mondaine.
https://en.wikipedia.org/wiki/Swiss_railway_clock
in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.
as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.
I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.
https://ai.studio/apps/drive/1oGzK7yIEEHvfPqxBGbsue-wLQEhfTP...
I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")
That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.
The prompt the other person was using is:
``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```
Which is much more difficult.
For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...
URL not found :(
This is cool. Gemini 2.5 Pro was also capable of this. Gemini was able to recreate famous piece of clock artwork in July: https://gemini.google.com/app/93087f373bd07ca2
"Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo
https://ai.studio/apps/drive/1yAxMpwtD66vD5PdnOyISiTS2qFAyq1... <- this is very nice, I was able to make seconds smooth with three iterations (it used svg initially which was jittery, but eventually this).
"Allow access to Google Drive to load this Prompt."
.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.
You don't want to give Google access to files you've stored in Google Drive? It's also only access to an application specific folder, not all files.
Well, you also have to allow it to train on your data. Although this is not explicitly about your Google drive data, and probably requires you to submit a prompt yourself, the barriers here are way to weak/fuzzy for me consider granting access via any account with private info.
Because most likely (at least according to Hanlon's razor) they somehow decided that using Google Drive as the only persistent storage backing AI studio was a reasonable UX decision.
It probably makes some sense internally in big tech corporation logic (no new data storage agreements on top of the ones the user has already agreed to when signing up for Drive etc.), but as a user, I find it incredibly strange too – especially since the text chats are in some proprietary format I can't easily open on my local GDrive replica, but the images generated or uploaded just look like regular JPEGs and PNGs.
It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.
Just like actual cheap but not bottom of the barrel clocks
holy shit! This is actually a VERY NICE clock!
Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?
This isn't using the same prompt or stack as the page from that post the other day; on aistudio it builds a web app across a few different files. It's still fairly concise but I don't think it's that much so.
It also includes javascript which was verboten in the original prompt, and doesn't specify the time the clock should be set too.
Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/
Considering how important this benchmark has become to the judgement of state of the art AI models, I imagine each AI lab has a dedicated 'pelican guy', a a highly accomplished and academically credentialed person, who's working around the clock on training the model to make better and better SVG pelicans on bikes.
That would mean my dastardly scheme has finally come to fruition: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Pelican guy may be one of the last jobs to be automated
They've been training for months to draw that pelican, just for you to move the goalposts.
Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.
I don't think the big labs would waste their time on it. If a model is great at making the pelican but sucks at all other svg it becomes obvious. But so far the good pelicans are strong indicators of good general SVG ability.
Unless training on the pelican increases all SVG ability, then good job.
I absolutely think they would given the amount of money and hype being pumped into it.
It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?
I updated my benchmark of 30 pelican-bicycle alternatives that I posted here a couple of weeks ago:
https://gally.net/temp/20251107pelican-alternatives/index.ht...
There seem to be one or two parsing errors. I'll fix those later.
I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.
Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?
The model card says: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
> This model is not a modification or a fine-tune of a prior model.
I'm curious why they decided not to update the training data cutoff date too.
Maybe that date is a rule of thumb for when AI generated content became so widespread that it is likely to have contaminated future data. Given that people have spoofed authentic Reddit users with Markov chains, it probably doesn’t go back nearly far enough.
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
>>benchmarks are meaningless
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.
I like to compare them using chathub using the same prompts
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.
This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.
I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.
Using a single custom benchmark as a metric seems pretty unreliable to me.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.
This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
my bad to the google team for the cursory brush off.
Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.
> This probably means my test is a little too niche.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.
I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.
Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.
I thought this kind of chaining was already part of these systems.
It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.
Your benchmarks should not involve IP.
The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.
It's not an ethics thing. It's a guardrails thing.
That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").
Why? This seems like a reasonable task to benchmark on.
Because you hit guard rails.
Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.
correction: pacman is not a human and has no soul.
Why do you have to willfully misinterpret the person you're replying to? There's truth in their comment.
How can you be sure that your benchmark is meaningful and well designed?
Is the only thing that prevents a benchmark from being meaningful publicity?
I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.
I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?
I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
Thanks
[0] https://news.ycombinator.com/item?id=45968665
> if it's not public, presumably LLMs would never get better at them.
Why? This is not obvious to me at all.
You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.
That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.
But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.
Problem solving ability is largely not from the pretraining data.
Yeah, great point.
I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)
Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.
> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.
A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....
GPT4/3o might be the best we will ever have
I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code
Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?
I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.
I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much
it's easy to focus on what they can't do
Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".
Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.
Then make it imagine playing against you and it's gonna be fast and reliable.
prompt was very concrete: draw a tic tac toe ASCII table and let's play. gemini 2.5 thought for pages particular moves
What's the benchmark?
I don't think it would be a good idea to publish it on a prime source of training data.
He could post an encrypted version and post the key with it to avoid it being trained on?
What makes you think it wouldn't end up in the training set anyway?
Every AI corp has people reading HN.
I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.
Good personal benchmarks should be kept secret :)
nice try!
you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.
curious if you tried grok 4.1 too
Could also just be rollout issues.
Could be. I'll reply to my comment later with pass/fail results of a re-run.
I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.
I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?
I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).
I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...
that's why everyone using AI for code should code in rust only.
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:
- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts
- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.
- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.
Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.
I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues.
Something like
1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it
Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.
The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?
This is specifically why I don't use Gemini. The gaslighting is ridiculous.
I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.
What prompt do you use for that?
I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.
3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
Super impressive!Does it deduce everyone's name?
It does! I redacted them, but yes. This was a 3-person call.
I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)
It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.
Parakeet TDT v3 would be really good at that
Yes, this is the best solution for that goal. Use the MacWhisper app + Parakeet 3.
I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
Excellent reference Tried to name an AI project at work Electric Monk but too 'controversial'
Had to change to Electric Mentor....
SMBC had a pretty great take on this: https://www.smbc-comics.com/comic/summary
This feels too real to laugh at
Now let’s hope that it will also save labour on resolving cloud infrastructure downtimes too.
after outsource developer job, we can outsource all of manager job and leaving CEO with AI agentic code as its servant
Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
"Were automating with AI" sounds better to investors than "We over hired and now need to downsize" or "We made some bad market bets, now need to free up cash flow"
> Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
> Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
I don't suppose these assertions are based on anything. If "AI" reduces the amount of time an engineer spends writing crud, boilerplate, test cases, random scripts, etc., and they have 5% more time to do other things, then all else being equal a project can be done with 5% fewer engineers.
Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
> Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
Between the disagreements regarding performance metrics, the fact that AI will happily increase its own scope of work as well as facilitate increasing any task, sprint, or projects scope of work, and Jevons Paradox, the world may never know the answer to either of these questions.
It does improve productivity, just like a good IDE. But engineers didn't get replaced by IDEs and they haven't yet been replaced by AI.
By the time its good enough to replace actual engineers, any job done in front of a computer will be at risk. I'm hoping that will happen at the same time as AI embodiment in robots, then every job will be automated, not just computer based ones.
Your assertion was not that "an engineer has never been replaced by AI". It is that no engineer has been laid off because of AI.
You agree AI improves engineer productivity. So last remaining question is, does greater productivity mean that fewer people are required to satisfy a given demand?
The answer is yes of course. So at this point, supporting the assertion requires handwaving about shortages and induced demand and demand for engineers to develop and support AI and so on. Which are all reasonable, but it should become pretty apparent that you can't be confident in an assertion like that. I would say it's pretty likely that AI has resulted in engineers being laid off in specific instances if not the net numbers.
"Not a single engineer has ever been laid off because of AI."
are you insane??? big tech literally make one of the most biggest layoff for the past few months
That's because of overhiring and other non-ai related reasons (i.e. Higher interest rates means less VC funding available).
In reality, getting AI to do actual human work, as of the moment, takes much more effort and cost than you get back in cost savings. These companies will claim they are using AI, even if its just a few engineers using Windsurf.
The companies claim AI is the reason they laid off engineers to make it look like they're innovating, not downsizing, which makes them look better in the eyes of investors and shareholders.
I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.
I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.
I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.
The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
(Source: "The mind is flat" by Nick Chater)
It's also easy to spot as when you are tired you might misrecognize objects, I caught myself with this when doing long roadtrips
Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.
> How do brains do it with such a low amount of energy?
Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.
There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.
[1] https://en.wikipedia.org/wiki/Etched_(company)
[2] https://spectrum.ieee.org/neuromorphic-computing-ibm-northpo...
This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.
My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.
My armchair is equally comfy, and I have an actual paper to point to:
Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]
They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:
"The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."
So yeah, real neurons are a bit more complex then ReLU or Sigmoid.
[1] https://www.biorxiv.org/content/10.1101/2024.08.21.608979v2....
Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.
ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.
Op is right.
https://imgcdn.stablediffusionweb.com/2024/4/19/8e54925a-004...
For the above pic I asked "What is wrong with the image?"
Result: - It totally missed the most obvious one - six fingers.
Instead it said this:
Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:
- The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.
- Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.
- Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.
- Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.
[Edit: 3.0 is same as 2.5 - both answered almost identically]
JFYI, Qwen managed to recognize the sixth finger:
Max: https://chat.qwen.ai/c/ca671562-7a56-4e2f-911f-40c37ff3ed79
VL-235B: https://chat.qwen.ai/c/21cc5f4e-5972-4489-9787-421943335150
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/
What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.
However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
The real strength of current neural nets/transformers relies on huge datasets.
ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.
Building your own large private ARC set does not seem too difficult if you have enough resources.
Doesn't even matter at this point.
We have a global RL Pipeline on our hand.
If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.
That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.
Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google
If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.
Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
[1] https://www.youtube.com/watch?v=f58kEHx6AQ8
Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.
Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.
The trick is to not put more value in the score than what it is.
> internal team to create an ARC replica, covering very similar puzzles
they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.
The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.
Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
they have two sets:
- semi-private, which they use to test proprietary models and which could be leaked
-private: used to test downloadable open source models.
ARG-AGI prize itself is for open source models.
My point is that it does not matter if the set is private or not.
If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.
It is not that hard, really, just tedious.
Humans study for tests. They just tend to forget.
Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard
It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.
This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3
There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.
ARC-AGI has a hidden private test suite, right ? No model will have access to that set.
I doubt they have offline access to the model, i.e. the prompts are sent to the model provider.
Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?
that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.
I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up
I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.
I can relate to this, it's doing exactly what I want, but it ain't pretty.
It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself
A nice Easter egg in the Gemini 3 docs [1]:
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.
I was under the impression that those reasoning outputs that you get back aren't references but simply raw CoT strings that are encrypted.
I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.
To what?
Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.
Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
Matches my experience exactly. It's not the best at writing code but Gemini 2.5 Pro is (was) the hands-down winner in every other use case I have.
This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.
I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).
Code quality was fine for my very limited tests but I was disappointed with instruction following.
I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.
I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.
this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.
for codex those instructions make plan mode redundant.
just say "don't code yet" at the end. I never use plan mode because plan mode is just a prompt anyways.
I've been working with it, and so far it's been very impressive. Better than Opus in my feels, but I have to test more, it's super early days
What I usually try to test with is try to get them do full scalable SaaS application from scratch... It seemed very impressive in how it did the early code organization using Antigravity, but then at some point, all of sudden it started really getting stuck and constantly stopped producing and I had to trigger continue, or babysit it. I don't know if I could've been doing something better, but that was just my experience. Seemed impressive at first, but otherwise at least vs Antigravity, Codex and Claude Code scale more reliably.
Just early anecdote from trying to build that 1 SaaS application though.
API pricing is up to $2/M for input and $12/M for output
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.
It is so impressive that Anthropic has been able to maintain this pricing still.
Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.
Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.
Well, it depends. Just recently I had Opus 4.1 spend 1.5 hours looking at 600+ sources while doing deep research, only to get back to me with a report consisting of a single sentence: "Full text as above - the comprehensive summary I wrote". Anthropic acknowledged that it was a problem on their side but refused to do anything to make it right, even though all I asked them to do was to adjust the counter so that this attempt doesn't count against their incredibly low limit.
Idk Anthropic has the least consistent models out there imho.
Because every time I try to move away I realize there’s nothing equivalent to move to.
People insist upon Codex, but it takes ages and has an absolutely hideous lack of taste.
Taste in what?
Wines!
It's interesting that grounding with search cost changed from
* 1,500 RPD (free), then $35 / 1,000 grounded prompts
to
* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries
It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)
With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.
There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
In case anyone wants to confirm if this link is official, it is.
https://goo.gle/enable-preview-features
-> https://github.com/google-gemini/gemini-cli/blob/release/v0....
--> https://goo.gle/geminicli-waitlist-signup
---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
Thrilled to see the cost is competitive with Anthropic.
Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.
> The winners aren’t necessarily those with the best models
Is there evidence that's true? That the other models are significantly better than the ones you named?
> Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...
Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.
Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.
Man, remember the days where we'd lose our minds at our operating systems doing stuff like that?
The people who lost their minds jumped ship. And I'm not going to work at a company that makes me use it, either. So, not my problem.
I've been so happy to see Google wake up.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.
Balance is too weak of a word. OpenAI was conceived specifically to prevent Google from getting AGI first. That was its original goal. At the time of its founding Google was the undisputed leader of AI anywhere in the world. Musk was then very worried about AGI being developed behind closed doors particularly Google, which was why he was the driving force behind the founding of OpenAI.
The book Empire of AI describes him as being particularly fixated on Demis as some kind of evil genius. From the book, early OAI employees couldn’t take the entire thing too seriously and just focused on the work.
I thought it was a workaround to Google's complete disinterest in productizing the AI research it was doing and publishing, rather than a way to balance their dominance in a market which didn't meaningfully exist.
That’s how it turned out, but IIRC at the time of OpenAI’s founding, “AI” was search and RL which Google and deep mind were dominating, and self driving, which Waymo was leading. And OpenAI was conceptualized as a research org to compete. A lot has changed and OpenAI has been good at seeing around those corners.
That was actually Character.ai's founding story. Two researchers at Google that were frustrated by a lack of resources and the inability to launch an LLM based chatbot. The founders are now back at Google. OpenAI was founded based on fears that Google would completely own AI in the future.
I think that Google didn't see the business case in that generation of models, and also saw significant safety concerns. If AI had been delayed by... 5 years... would the world really be a worse place?
Yes - less exciting! But worse?
Elon Musk specifically gave OAI $150M early on because of the risk of Google being the only Corp that has AGI or super-intelligence. These emails were part of the record in the lawsuit.
Pffft. OpenAI was conceived to be Open, too.
It’s a common pattern for upstarts to embrace openness as a way to differentiate and gain a foothold then become progressively less open once they get bigger. Android is a great example.
Last I checked, Android is still open source (as AOSP) and people can do whatever-the-f-they-want with the source code. Are we defining open differently?
I think we're defining "less" differently. You're interpreting "less open" to mean "not open at all," which is not what I said.
There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open.
For example, by moving features that were in the AOSP into their proprietary Play Services instead [1].
Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2].
In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience.
[1] https://arstechnica.com/gadgets/2018/07/googles-iron-grip-on...
[2] https://arstechnica.com/gadgets/2025/08/google-will-block-si...
Core is open source but for a device to be "Android compatible" and access the Google Play Store and other Google services, it must meet specific requirements from Google's Android Compatibility Program. These additional proprietary components are what make the final product closed source.
The Android Open Source Project is not Android.
> The Android Open Source Project is not Android.
Was "Android" the way you define it ever open? Isnt it similar to chromium vs chrome? chromium is the core, and chrome is the product built on top of it - which is what allows Comet, Atlas, Brave to be built on.
That's the same thing what GrapheneOS, /e/ OS and others are doing - building on top of AOSP.
They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.
This is an understandable, but simplistic way of looking at the world. Are you also gonna blame Apple for mining for rare earths, because they made a successful product that requires exotic materials which needs to be mined from earth? How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.
> How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
That would be bad if it happened, which is why it doesn't happen. Working in a factory isn't an inhumane condition.
I dont think your comment justifies calling out any form of simplistic view. It doesnt make sense. All the big players are bad. They"re companies, their one and only purpose is to make money and they will do whatever it takes to do it. Most of which does not serve human kind.
Compared to what?
It seems okay to me to be upset with the system and also point out the specific wrongs of companies in the right context. I actually think that's probably most effective. The person above specifically singled out Google as a reply to a comment praising the company, which seems reasonable enough. I guess you could get into whether it's a proportional response; the praise wasn't that high and also exists within the context of the system as you point out. Still, their reply doesn't necessarily indicate that they're not upset with all companies or the system.
Yes, we're absolutely holding Apple accountable for outsourcing jobs, degrading the US markets, using slave and child labor, laundering cobalt from illegal "artisanal" mines in the DRC, and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods.
If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally.
The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way.
Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover.
And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition.
I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants.
If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric.
The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.
> laundering cobalt from illegal "artisanal" mines in the DRC
They don't, all cobalt in Apple products is recycled.
> and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
They don't, Apple audits their entire supply chain so it wouldn't hide anything if something moved to another subcontractor.
One can claim 100% recycled cobalt under the mass balance system even if recycled and non-recycled cobalt was mixed as long as the total amount used in production is less or equal to recycled cobalt purchased in the books. At least here[0] they claim their recycled cobalt references are under the mass balance system.
0. https://www.apple.com/newsroom/2023/04/apple-will-use-100-pe...
Where is the fairy godmother's magic wand that will allow you to make all the governments of the world instantly agree to all of this?
People love getting their content for free and that's what Google does.
Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?
Yep, it's hard to believe it exists for free and with not a lot of ads when you have a good ad blocker... though the content creator's ads are inescapable, which I think is ok since they're making a little money in exchange for what, your little inconvenience for 1 minute or so - if you're not skipping the ad, which you aren't, right??) - after which you can watch some really good content. The history channels on YT are amazing, maybe world changing - they get people to learn history and actually enjoy it. Same with some match channels like 3brown1blue which are just outstanding, and many more.
> People love getting their content for free and that's what Google does.
They are forcing a payment method on us. It's basically like they have their hand in our pockets.
Yes, this is correct, and it happens everywhere. App Store, Play Store, YouTube, Meta, X, Amazon and even Uber - they all play in two-sided markets exploiting both its users and providers at the same time.
They're not a moral entity. corporations aren't people.
I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.
If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.
Voters are people and they are moral entities, direct any moral outrage at us.
Why should the collective of voters be any more of a moral entity than the collective of people who make up a corporation (which you may include its shareholders in if you want)?
It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
> Why should the collective of voters..
They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries.
> It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law.
The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have.
You're trying to live in a feudal society when you treat corporations like this.
If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters.
And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.
When we criticize corporations, we really are criticizing the people who make the decisions in the corporations. I don’t see why we shouldn’t apply exactly the same moral standards to people’s decision in the context of a corporation as we do to people’s decisions made in any other context. You talk about lawfulness, but we wouldn’t talk about morals if we meant lawfulness. It’s also lawful to vote for the hyper-capitalist party, so by the same token moral outrage shouldn’t be directed towards the voters.
I get that, but those CEOs are not elected officials, they don't represent us and have no part in the discourse of law making (despite the state of things). In their capacity has executives of a company, they have no rights, no say in what we find acceptable or not in society. We tell them what they can and cannot do or else. That's the social contract we have with companies and their executives.
Being in charge of a corporation shouldn't elevate someone to a platform where they have a louder voice than the common man. They can vote just as equally as others at the voting booth. they can participate in their capacity as individuals in politics. But neither money, nor corporate influence have places in the governance of a democratic society.
I talk about lawfulness because that is the only rule of law a corporation can and should be expected to follow. Morals are for individuals. Corporations have no morals. they are neither moral or immoral. Their owners have morals, and you can criticize their greed, but that is a construct of capitalism. They're supposed to enrich themselves. You can criticize them for valuing money over morals, but that's like criticizing the ocean for being wet or the sun for being too hot. It's what they do. It's their role in society.
If a small business owner raises prices to increase revenue, that isn't immoral right? even though poor people that frequent them will be disaffected? amp that up to the scale of a megacorp, and the morality is still the same.
Corporations are entities that exist for the sole purpose of generating revenue for their owners. So when you criticize Google, you're criticizing a logical organization designed to do the thing you're criticizing it of doing. The CEO of google is acting in his official capacity, doing the job they were hired to do when they are resisting adblocking. The investors of Google are risking their money in anticipation of ROI, so their expectation from Google is valid as well.
When you find something to be immoral, the only meaningful avenue of expressing that with corporations is the law. You're criticizing google as if it was an elected official we could vote in/out of office. or as if it is an entity that can be convinced of its moral failings.
When we don't speak up and user our voice, we lose it.
Because of the inherent capitalism structure that leads to the inevitable: the tragedy of the commons.
Why are you directing the statement that "[Corporations are] not a moral entity" at me instead of the parent poster claiming that "[Google has] been the great balancing force (often for good) in the industry."? Saying that Google is a force "for good" is a claim by them that corporations can be moral entities; I agree with you that they aren't.
I could have just the same I suppose, but their comment was about google being a balancing force in terms of competition and monopoly. it wasn't a praise of their moral character. They did what was best for their business and that turns out to be good for reducing monopolies. If it turned out to be monopolistic, I would be wondering what congress and the DOJ are doing about it, instead of criticizing Google for trying to turn a profit.
> They've poisoned the internet
And what of the people that ravenously support ads and ad-supported content, instead of paying?
What of the consumptive public? Are they not responsible for their choices?
I do not consume algorithmic content, I do not have any social media (unless you count HN for either).
You can't have it both ways. Lead by example, stop using the poison and find friends that aren't addicted. Build an offline community.
I don't understand your logic, it seems like victim blaming. Using the internet and pointing out that targeted advertising has a negative effect on society is not "having it both ways".
Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?
You are not a "victim" for using or purchasing something which is completely unnecessary. Or if that's the case, then you have no agency and have to be medicinally declared unfit to govern yourself and be appointed a legal guardian to control your affairs.
What kind of world do you live in? Actually Google ads tend to be some of the highest ROI for the advertiser and most likely to be beneficial for the user. Vs the pure junk ads that aren't personalized, and just banner ads that have zero relationship to me. Google Ads is the enabler of free internet. I for one am thankful to them. Else you end up paying for NYT, Washinton Post, Information etc -- virtually for any high quality web site (including Search).
Ads. Beneficial to the user.
Most of the time, you need to pick one. Modern advertising is not based on finding the item with the most utility for the user - which means they are aimed at manipulating the user's behaviour in one way or another.
Suppressed wages to colluding with Apple to not poach.
Outlook is much better than Gmail and so is the office suite.
It's good there's competition in the space though.
Outlook is not better in ways that email or gmail users necessarily care about, and in my experience gets in the way more than it helps with productivity or anything it tries to be good at. I've used it in office settings because it's the default, but never in my life have I considered using it by choice. If it's better, it might not matter.
I couldn't disagree more
> Drive vs Word
You mean Drive vs OneDrive or, maybe Docs vs Word?
Workspace vs Office
Surely they meant Writely vs Word
- Making money vs general computing
For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."
>most of those examples are acquisitions
Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.
But there's also plenty that fail, it's just that you won't know about those.
I don't think what you're saying proves that the companies that were acquired couldn't have done that themselves.
If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".
All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?
They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.
Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!
Why stop at YouTube? Blame Apple for creating an additive gadget that has single handedly wasted billions of hours of collective human intelligence. Life was so much better before iPhones.
But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things.
Dont paint everything with a single, large, coarse brush stroke.
frankly when compared against TikTok, Insta, etc, YouTube is a force for good. Just script the shorts away...
Something about bringing balance to the force not destroying it.
Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.
You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
[0] https://news.ycombinator.com/item?id=34713073
Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.
But at least this company is not run by a narcissistic sociopath.
Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.
I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
> It seems very US centric in its thinking
I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.
Supposedly this is the model card. Very impressive results.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?
This is a list of questions and answers that was created by different people.
The questions AND the answers are public.
If the LLM manages through reasoning OR memory to repeat back the answer then they win.
The scores represent the % of correct answers they recalled.
That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.
You could question how well this works, but it’s not like the answers are just hanging out on the public internet.
I estimate another 7 months before models start getting 115% on Humanity's Last Exam.
If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.
The person also claims that with thinking on the gap narrows considerably.
We'll probably have 3rd party benchmarks in a couple of days.
This is easily shown that the numbers are for GPT 5.1 thinking high.
Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard
DeepMind page: https://deepmind.google/models/gemini/
Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/
Developer blog: https://blog.google/technology/developers/gemini-3-developer...
Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Google Antigravity: https://antigravity.google/
Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047
The default FPS it's analyzing video at is 1, and I'm not sure the max is anywhere near enough to catch a full speed tennis serve.
Ah, I should have mentioned it was a slow motion video.
> The default FPS it's analyzing video at is 1
Source?
https://ai.google.dev/gemini-api/docs/video-understanding#cu...
"By default 1 frame per second (FPS) is sampled from the video."
OK, I just used https://gemini.google.com/app, I wonder if it's the same there.
I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.
Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png
2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
Great improvement by only adding one feedback prompt: Change the rotation axis of the wheels by 90 degrees in the horizontal plane. Same for the legs and arms
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?
I hadn't! It looks like that is there to power the text box at the bottom of the app that allows for AI-powered changes to the scene.
This says Gemini 2.5 though.
Good observation. The app was created with Gemini 3 Pro Preview, but the app calls out to Gemini 2.5 if you use the embedded prompt box.
Incredible. Thanks for sharing.
Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?
IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?
At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.
i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.
It would be next to impossible for anyone without insider knowledge to prove that to be the case.
Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.
It’s a good pelican. Not great but good.
The blue lines indicating wind really sell it.
How long does it typically take after this to become available on https://gemini.google.com/app ?
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.
On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro
> https://gemini.google.com/app
How come I can't even see prices without logging in... they doing regional pricing?
Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.
It's available in cursor. Should be there pretty soon as well.
are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )
It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...
Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.
[1] https://lmarena.ai/leaderboard/text
Is it just me or is that link broken because of the cloudflare outage?
Edit: nvm it looks to be up for me again
Grok is heavily censored though
> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
> Gemini app surpasses 650 million users per month
Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.
I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app. There was a big "Get Started" button that I almost accidentally clicked because it was where I was about to tap for something else.
As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.
I constantly accidentally use some btn and Gemini opens up on my Samsung Galaxy. I haven't bothered to figure this out.
I don't know for sure but they have to be counting users like me whose phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app
> it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Yes and no, my power button got remapped to opening Gemini in an update...
I removed that but I can imagine that your average user doesn't.
This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.
For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.
It says Gemini App, not AI Overviews, AI Mode, etc
They claim AI overviews as having "2 billion users" in the sentences prior. They are clearly trying as hard as possible to show the "best" numbers.
> They are clearly trying as hard as possible to show the "best" numbers.
This isnt a hottake at all. Marketing (iPhone keynotes, product launches) are about showing impressive numbers. It isnt a gotcha you think it is.
Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.
I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.
Gemini app != Google search.
You're implying they're lying?
And you're implying they're being 100% truthful?
Marketing is always somewhere in the middle
Companies cant get away from egregious marketing. See Apple class action lawsuit for Apple Intelligence.
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
It's available to be selected, but the quota does not seem to have been enabled just yet.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
Looks to be available in Vertex.
I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.
For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more
I hope some users will switch from cerebras to free up those resources
Works for me.
seeing the same issue.
you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.
when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.
Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:
- 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%
The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.
Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.
https://artificialanalysis.ai/evaluations/omniscience
OpenAI uses SimpleQA to assess hallucinations
I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.
What we have all been waiting for:
"Create me a SVG of a pelican riding on a bicycle"
https://www.svgviewer.dev/s/FfhmhTK1
That is pretty impressive.
So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.
Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.
Like replacing named concepts with nonsense words in reasoning benchmarks.
https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari
I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.
It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.
Very aero
Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390
> When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?
> Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
I'm sure each of the frontier labs have some secret methods, especially in training the models and the engineering of optimizing inference. That said, I don't think them saying they'd keep a big breakthrough secret would be evidence in this case of a "secret sauce" on ARC-AGI-2.
If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)
I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?
They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1
Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.
I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model. If your product doesn't work and the new architectures unlock performance that would let you have a feasible business, even a 2x on input tokens shouldn't be the dealbreaker.
If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.
Gemini 3 is crushing my personal evals for research purposes.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.
Genuinely curious here: why is the desktop app so important?
I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?
Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.
I have a few reasons for the preference:
(1) The ability to add context via a local apps integration into OS level resources is big. With Claude, eg, I hit Option-SPC which brings up a prompt bar. From there, taking a screenshot that will get sent my prompt is as simple as dragging a bounding box. This is great. Beyond that, I can add my own MCP connectors and give my desktop app direct access to relevant context in a way that doesn't work via web UI. It may also be inconvenient to give context to a web UI in some case where, eg, I may have a folder of PDFs I want it to be able to reference.
(2) Its own icon that I can CMD-TAB to is so much nicer. Maybe that works with a PWA? Not really sure.
(3) Even if I can't use an LLM when offline, having access to my chats for context has been repeatedly valuable to me.
I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
> The ability to add context via a local apps integration into OS level resources is big
Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
> Its own icon that I can CMD-TAB to is so much nicer
Fair enough. I personally prefer Firefox's tab organization to my OS's window organization, but I can see how separating the LLM into its own window would be helpful.
> having access to my chats for context has been repeatedly valuable to me.
I didn't at all consider this. Point ceded.
> I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
Interesting. Why? Is it security? The ones I've listed are open source and auditable. I'm confident that they won't steal my API keys. Msty has a lot of advanced functionality that I haven't seen in other interfaces like allowing you to compare responses between different LLMs, export the entire conversation to Markdown, and edit the LLM's response to manage context. It also sidesteps the problem of '[provider] doesn't have a desktop app' because you can use any provider API.
> Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
Access to OS level resources != context pollution. You still have control, just more direct and less manual.
> The ones I've listed are open source and auditable.
Yeah I don't plan on spending who knows how much time auditing some major app's code (lol) before giving it my API keys and access to my chats. Unless there's a critical mass of people I know and trust using something like that it's not going to happen for me.
But also, I tried quickly looking up Msty to see if it is open source and what its adoption looked like and AFAICT it's not open source. Asked Gemini 3 if it was and it also said no. Frankly that makes it a very hard no for me. If you are using it because you think it's Open Source I suggest you stop.
> If you are using it because you think it's Open Source I suggest you stop.
I did not know that. Thank you very much for the correction. I guess I have some keys to revoke now.
Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models
I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.
Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/
Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.
> it nailed it, but only solved it partially
Hey either it nailed it or it didn't.
Probably figured out the exact cause of the bug but not how to solve it
Yes; they nailed the root case but the implementation is not 100% correct
The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.
As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.
Wow so the polymarket insider bet was true then..
https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...
These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.
It’s not really abuse though. These markets aggregate information; when an insider takes one side of a trade, they are selling their information about the true price (probability of the thing happening) to the market (and the price will move accordingly).
You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.
Disallowing people from making terrible trades seems…paternalistic? Idk
You don’t get it. Allowing insiders to trade disincentivizes normal people from putting money. Why else is it not allowed in stock market?
Why should normal people be incentivized to make trades on things they probably haven’t got the slightest idea about
The point of prediction markets isn't to be fair. They are not the stock market. The point of prediction markets is to predict. They provide a monetary incentive for people who are good at predicting stuff. Whether that's due to luck, analysis, insider knowledge, or the ability to influence the result is irrelevant. If you don't want to participate in an unfair market, don't participate in prediction markets.
But what's the point of predicting how many times Elon will say "Trump" on an earnings call (or some random event Kalshi or Polymarket make up)? At least the stock market serves a purpose. People will claim "prediction markets are great for price discovery!" Ok. I'm so glad we found out the chance of Nicki Minaj saying "Bible" during some recent remarks. In case you were wondering, the chance peaked at around 45% and she did not say 'bible'! She passed up a great opportunity to buy the "yes" and make a ton of money!
https://kalshi.com/markets/kxminajmention/nicki-minaj/kxmina...
I agree that the "will [person] say [word]" markets are stupid. "Will Brian Armstrong say the word 'Bitcoin' in the Q4 earnings call" is a stupid market because nobody a actually cares whether or not he actually says 'Bitcoin', they care about whether or not Coinbase is focusing on Bitcoin. If Armstrong manipulates the market by saying the words without actually doing anything, nobody wins except Armstrong. "Will Coinbase process $10B in Bitcoin transactions in Q4" is a much better market because, though Armstrong could still manipulate the market's outcome, his manipulation would influence a result that people actually care about. The existence of stupid markets doesn't invalidate the concept.
That argument works for insider training too.
And? Insider trading is bad because it's unfair, and the stock market is supposed to be fair. Prediction markets are not fair. If you are looking for a fair market, prediction markets are not that. Insider trading is accepted and encouraged in prediction markets because it makes the predictions more accurate, which is the entire point.
The stock market isn't supposed to be fair.
By 'fair', I mean 'all parties have access to the same information'. The stock market is supposed to give everyone the same information. Trading with privileged information (insider trading), is illegal. Publicly traded companies are required to file 10-Qs and 10-Ks. SEC rule 10b5-1 prohibits trading with material non-public information. There are measures and regulations in place to try to make the stock market fair. There are, by design, zero such measures with prediction markets. Insider trading improves the accuracy of prediction markets, which is their whole purpose to begin with.
> people need to realize there are real people on the other side of these bets
None of whom were forced by anyone to place bets in the first place.
>Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call.
For the kind of person playing these sorts of games, that actually really "hype".
I’m pretty sure that these model release date markets are made to be abused. They’re just a way to pay insiders to tell you when the model will be released.
The mention markets are pure degenerate gambling and everyone involved knows that
Correct, and this is actually how all markets work in the sense that they allow for price discovery :)
Abuse sounds bad, this is good! Now we have a sneak peek into the future, for free! Just don't bet on any markets where an insider has knowledge (or don't bet at all)
In hindsight, one possible reason to bet on November 18 was the deprecation date of older models: https://www.reddit.com/r/singularity/comments/1oom1lq/google...
Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?
Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
And of course they hiked the API prices
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
Claude Opus is $15 input, $75 output.
If the model solves your needs in fewer prompts, it costs less.
Is it the first time long context has separate pricing? I hadn’t encountered that yet
Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5
Google has been doing that for a while.
Google has always done this.
Ok wow then I‘ve always overlooked that.
Make a pelican riding a bicycle in 3d: https://gemini.google.com/share/def18e3daa39
Amazing and hilarious
Similar hilarious results (one shot): https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.
I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.
Gemini CLI can also control a browser: https://github.com/ChromeDevTools/chrome-devtools-mcp
Seems like their new Antigravity IDE specifically has this built in. https://antigravity.google/docs/browser
Wow, that is awesome.
"AI Overviews now have 2 billion users every month."
"Users"? Or people that get presented with it and ignore it?
They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.
Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).
I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.
You're right, I'm probably being a little uncharitable!
Normal users (i.e. not grumpy techies ;) ) probably just go with the flow rather than finding it irritating.
"Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."
Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.
> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
Do regular users know how to disable AI Overviews, if they don't love them?
it's as low tech as using adblock - select element and block
Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).
entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps
When will this be available in the cli?
Gemini CLI team member here. We'll start rolling out today.
How about for Pro (not Ultra) subscribers?
This is the heroic move everyone is waiting for. Do you know how this will be priced?
I'm already seeing it in https://aistudio.google.com/
Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.
Given the nature of how those models work, you don't need exact replicas.
Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,
https://www.youtube.com/watch?v=cUbGVH1r_1U
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
Don't self-promote without disclosure.
Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators
What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.
Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.
Isn't it the same cutoff as 2.5?
Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?
If you mean the "consumer ecosystem", then Gemini 3 should be available as an API through Google's AI Vertex platform. If you don't even want a Google Cloud account, then I think the answer is no unless they announce a partnership with an inference cloud like cerebras.
You could probably do a new account. I have the odd junk google account.
Somebody "two-shotted" Mario Bros NES in HTML:
https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...
https://www.youtube.com/watch?v=cUbGVH1r_1U
side by side comparison of gemini with other models
I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.
> I would even pay for it.
Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.
You never have to add your phone number in order to pay.
While I haven't tried leaving the field blank on every credit card form I've come across, I'm certain that at least some of them considered it required.
Perhaps its country specific?
I've never been asked a phone number. Maybe country specific. no idea.
What I'd prefer over benchmarks is the answer to a simple question:
What useful thing can it demonstrably do that its predecessors couldn't?
Keep the bubble expanding for a few months longer.
Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
Feeling great to see something confidential
- Anyone have any idea why it says 'confidential'?
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
Anyone has any idea if/when it’s coming to paid Perplexity?
Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.
Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?
Here it makes a text based video editor that works:
https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797
Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
Reading the introductory passage - all I can say now is, Ai is here to stay.
First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5
GOOGLE: "We have a new product".
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over
Is it coming to Google Jules?
> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.
Come on, you can’t be serious.
This is so disingenuous that it hurts the credibility of the whole thing.
I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
Gotta hand it to gemini, those are some top notch points
The "Model card leak" point is worth negative points though, as it's clearly a misreading of reality.
yeah hahahahah, it made me think!
> We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
That feels like something between a hallucination and an intentional fallacy that popped up because you specifically said "intense discussion". The increase is 60% on input tokens from the old model, but it's not a markup, and especially not "sold back to us at X markup".
I've seen more and more of these kinds of hallucinations as these models seem to be RL'd to not be a sycophant, they're slowly inching into the opposite direction where they tell small fibs or embellish in a way that seems like it's meant to add more weight to their answers.
I wonder if it's a form of reward hacking, since it trades being maximally accurate for being confident, and that might result in better rewards than being accurate and precise
I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?
It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.
Maybe they're slow rolling the announcements to be in the news more
Most likely. And/or they use the full model to train the smaller ones somehow
The term of art is distillation
I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
Terry Tao is writing about this on his blog.
Can't wait til Gemini 4 is out!
okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.
it is live in the api
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.
"Gemini 3 Pro Preview" is in Vertex
is there even a puzzle or math problem gemini 3 cant solve?
Trained models should be able to use formal tools (for instance a logical solver, a computer?).
Good. That said, I wonder if those models are still LLMs.
It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh
2025: solve the biking pelican problem
2026: cure cancer
So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it
Glad to see Google still can't get out of its own way.
I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.
Yes, I know the Workspaces workaround, but that’s silly.
Wild
When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.
How's the pelican?
Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?
Honestly I liked 2.5 Pro preview much more than the final version
Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
… agentic …
Meh, not interested already
The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.
Finally!
I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Where is the outrage?
[0] https://web.archive.org/web/20251118111103/https://storage.g...
[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...
Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.
The real question is, "For how long?"
i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.
"gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.
I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.
That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.
If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.
By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.
Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.
This seems like a dubious conclusion. I think you missed this part:
> in accordance with Google’s relevant terms of service, privacy policy
It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.
Also probably over for OpenAI
@simonw wen pelican
It's amazing to see Google take the lead while OpenAI worsens their product every release.
Valve could learn from Google here
Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.
It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com
I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...
Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...
He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.
And yet you could eliminate him entirely and the story is still coherent.
The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.
You're absolutely right! The AI said it, so it must be true!
At least read what you respond to... Imagine thinking Yudkowsky was NOT a central figure in the Zizians story.
You literally quoted the LLMs output verbatim as your proof.
Edit: And upon skimming the article at the points where Yudkowsky's name is mentioned, I 100% agree with stickfigure.
I challenge you to name one way in which the story falls apart without the mention of Yudkowsky.
It sounds like both of you are unfamiliar with the link between the Zizians and Yudkowsky. So let us just return to the discussion of gemini-3, do you think the model did a bad job then in it's second response?
It literally does not matter how much they are connected out here in reality, the AI was to summarize the information in the article and that is exactly what it did.
>do you think the model did a bad job then in it's second response
Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.
Why do you think I obviously wanted to hear that?
It's implicit in your prompt!
Also that is the position you've been defending this whole thread. This whole conversation is happening because you believe Yudkowsy is an important figure to the story.Here's another attempt: llm --cid 01kabxtjq10exgk56yf802028f "I notice you did not mention Yudkowsky?" --no-log -m gem3-t1 Based on the text provided, Eliezer Yudkowsky is a central background figure to this story, serving as the intellectual progenitor of the movement from which the Zizians splintered.
Here is specifically how he features in the article:
* *The Founder:* Yudkowsky is identified as the founder of the Machine Intelligence Research Institute (MIRI) and the Center for Applied Rationality (CFAR), as well as the founder of the LessWrong forum. These are the institutions the Zizians originally belonged to and eventually declared war against. * *The Source Code:* The Zizians based their radical worldview on Yudkowsky’s concepts, specifically *"timeless decision theory."* However, they believed they were applying his logic more rigorously than he was. They became convinced that Yudkowsky’s organizations had betrayed these principles (specifically regarding a rumor about MIRI paying blackmail to cover up a scandal), which they viewed as a moral failing that justified their rebellion.
Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.
Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.
Maybe it has guard rails against such things? That would be my main guess on the Zizian one.
It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.