Astro - Hacker News

578 comments

lairv 9 hours ago ago

Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
[-]
- thomasahle 8 hours ago ago
  
  I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.
  Sadly, the answer was wrong.
  It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.
  Still a useful tool though. It definitely gets the majority of the insights.
  Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
  [-]
  - pennaMan 4 hours ago ago
    
    > It also returned 8 "sources"
    well, there's your problem. it behaves like a search summary tool and not like a problem solver if you enable google search
    
    [-]
    
    factsaresacred 3 hours ago ago
    
    Exactly this - and how chatGPT behaves too. After a few conversations with search enabled you figure this out, but they really ought to make the distinction clearer.
  - JBiserkov 7 hours ago ago
    
    The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.
    
    [-]
    
    junon 7 hours ago ago
    
    I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.
    
    [-]
    
    ashdksnndck 7 hours ago ago
    
    On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.
    
    dormento 7 hours ago ago
    
    Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".
  - TechDebtDevin 2 hours ago ago
    
    Why is this sad. You should bw rooting for these LLMs to be as bad as possible..
    
    [-]
    
    windexh8er an hour ago ago
    
    If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.
    
    [-]
    
    DrewADesign 15 minutes ago ago
    
    It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.
    As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.
    
    [-]
    
    windexh8er 2 minutes ago ago
    
    Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.
    The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need a real throat to choke to show your leadership skills.
    
    drusepth 2 hours ago ago
    
    Generally, any expert hopes their tool/paintbrush/etc is as performant as possible.
    
    onoesworkacct 2 hours ago ago
    
    That ship has sailed long ago.
    I'm rooting for biological cognitive enhancement through gene editing or whatever other crazy shit. I do not want to have some corporation's AI chip in my brain.
    
    [-]
    
    TechDebtDevin an hour ago ago
    
    Confirmed less wrong psyop victim
    
    MaximusLegroom 2 hours ago ago
    
    Rooting is useless. We should be taking conscious action to reduce the bosses' manipulation of our lives and society. We will not be saved by hoping to sabotage a genuinely useful technology.
    
    [-]
    
    TechDebtDevin 2 hours ago ago
    
    How is it useful other than for people making money off token outout. Continue to fry your brain.
    
    [-]
    
    antonvs an hour ago ago
    
    They’re fantastic learning tools, for a start. What you get out of them is proportional to what you put in.
    You’ve probably heard of the Luddites, the group who destroyed textile mills in the early 1800s. If not: https://en.wikipedia.org/wiki/Luddite
    Luddites often get a bad rap, probably in large part because of employer propaganda and influence over the writing of history, as well as the common tendency of people to react against violent means of protest. But regardless of whether you think they were heroes, villains, or something else, the fact is that their efforts made very little difference in the end, because that kind of technological progress is hard to arrest.
    A better approach is to find ways to continue to thrive even in the presence of problematic technologies, and work to challenge the systems that exploit people rather than attack tools which can be used by anyone.
    You can, of course, continue to flail at the inevitable, but you might want to make sure you understand what you’re trying to achieve.
    
    nh23423fefe an hour ago ago
    
    are you pretending to be confused?
- rbjorklin 6 hours ago ago
  
  Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low
  I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.
  [-]
  - kenjackson 2 hours ago ago
    
    ChatGPT solves this problem now as well with 5.1. Time for a new litmus test.
- qsort 8 hours ago ago
  
  To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.
  But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
  [-]
  - nerdsniper 7 hours ago ago
    
    I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.
    I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.
    
    [-]
    
    qsort 7 hours ago ago
    
    I'm not explaining myself right.
    Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.
    Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.
    I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.
    ---
    [0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.
    
    [-]
    
    pclmulqdq 7 hours ago ago
    
    Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.
    
    [-]
    
    jeswin an hour ago ago
    
    > Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good.
    Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.
    
    emodendroket 7 hours ago ago
    
    Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?
    
    [-]
    
    nerdsniper 4 hours ago ago
    
    Both, but perhaps more often neither.
    From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".
    So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.
    That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.
    
    [-]
    
    travisjungroth 2 hours ago ago
    
    This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.
    Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.
    
    pclmulqdq 3 hours ago ago
    
    I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.
    
    bionsystem 3 hours ago ago
    
    It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.
    But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.
    
    pclmulqdq 7 hours ago ago
    
    It can be either one. In closed positions, it is often the latter.
    
    thomasahle 5 hours ago ago
    
    It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.
    
    pmarreck 3 hours ago ago
    
    I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).
    That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)
    
    [-]
    
    qsort 2 hours ago ago
    
    The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.
    If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.
    
    mquander an hour ago ago
    
    I recommend Matthew Sadler's Game Changer and The Silicon Road To Chess Improvement.
    
    nerdsniper 4 hours ago ago
    
    You explained yourself right. The issue is that you keep qualifying your statements.
    > it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...
    > ... in tournament conditions.
    I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?
    
    [-]
    
    jpadkins 4 hours ago ago
    
    I can wreck stockfish in chess boxing. Mostly because stockfish can't box, and it's easy for me to knock over a computer.
    
    [-]
    
    bionsystem 3 hours ago ago
    
    If it runs on a mainframe you would lose both the chess and the boxing.
    
    crooked-v 4 hours ago ago
    
    The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.
  - vjerancrnjak 7 hours ago ago
    
    How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.
    
    [-]
    
    qsort 7 hours ago ago
    
    My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.
    Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.
- sedatk 7 hours ago ago
  
  Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970
  [-]
  - tails4e 2 hours ago ago
    
    I just had chatgpt explain that problem to me (I was unfamiliar with the mathematical background). It showed how to solve closed form answers for H(2) and H(3) and then numerical solutions using RK4 for higher values. Truly impressive, and it explained the derivations beautifully. There are few maths experts I've encountered who could have hand-held me through it as good.
- panarky 2 hours ago ago
  
  The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.
  It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
  This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
  To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
  You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
  The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
  This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
  And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
  That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).
  [-]
  - thejarren 2 hours ago ago
    
    This is very obviously an AI generated comment.
    “This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.”
    Not sure about the motive here.
    
    [-]
    
    panarky an hour ago ago
    
    Hmmm, I wrote those words myself, maybe I've spent too much time with LLMs and now I'm talking like them??
    I'd be interested in any evidence-based arguments you might have beyond attacking my writing style and insinuating bad intent.
    I found this commenter had sage advice about how to use HN well, I try to follow it: https://news.ycombinator.com/item?id=38944467
    
    sindriava 2 hours ago ago
    
    You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.
- thomasahle 8 hours ago ago
  
  I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p
  [-]
  - lairv 8 hours ago ago
    
    Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test
- id 7 hours ago ago
  
  gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.
- irthomasthomas 5 hours ago ago
  
  Are you sure it did not retrieve the answer using websearch?
- j2kun 7 hours ago ago
  
  Did it search the web?
- jamilton 4 hours ago ago
  
  Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.
- ivape 41 minutes ago ago
  
  So when does the developer admit defeat? Do we have a benchmark for that yet?
  [-]
  - bgwalter 29 minutes ago ago
    
    According to a bunch of philosophers (https://ai-2027.com/), doom is likely imminent. Kokotajlo was on Breaking Points today. Breaking Points is usually less gullible, but the top comment shows that "AI" hype strategy detection is now mainstream (https://www.youtube.com/watch?v=zRlIFn0ZIlU):
    AI researcher: "Just another trillion dollars. This time we'll reach superintelligence, I swear."
- bumling 5 hours ago ago
  
  I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.
- bgwalter 2 hours ago ago
  
  Does it matter if it is out of the training data? The models integrate web search quite well.
  What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.
  They would surely add the latest Euler problems with solutions in order to show off in benchmarks.
- orly01 8 hours ago ago
  
  Wow. Sounds pretty impressive.
davidpolberger 2 hours ago ago

This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
dwringer 9 hours ago ago

Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
[-]
- kjgkjhfkjf 4 hours ago ago
  
  This is quite likely to be in the training data, since it's one of the projects in Wes Bos's free 30 days of Javascript course[0].
  [0] https://javascript30.com/
  [-]
  - baxtr 4 hours ago ago
    
    I was under the impression for this to work like that, training data needs to be plenty. One project is not enough since it’s too "sparse".
    But maybe this example was used by many other people and so it proliferated?
    
    [-]
    
    kjgkjhfkjf 4 hours ago ago
    
    The repo[0] currently has been forked ~41300 times.
    [0] https://github.com/wesbos/JavaScript30
- stalfie 8 hours ago ago
  
  The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.
  [-]
  - apetresc 6 hours ago ago
    
    In its defence, the code actually specifically calls that edge case out and justifies it:
    // Calculate rotations // We use a cumulative calculation logic mentally, but here simple degrees work because of the transition reset trick or specific animation style. // To prevent the "spin back" glitch at 360->0, we can use a simple tick without transition for the wrap-around, // but for simplicity in this specific React rendering, we will stick to standard 0-360 degrees. // A robust way to handle the spin-back on the second hand is to accumulate degrees, but standard clock widgets often reset.
  - skipnup 5 hours ago ago
    
    The Swiss and German railway clocks actually work the same way and stop for (half a?) second while the minute handle progresses.
    https://youtu.be/wejbVtj4YR0
    
    [-]
    
    Severian 4 hours ago ago
    
    The video shows closer to 2 seconds for it to finally throw itself over in what could only be described as a "Thunk". I figured it would be a little more smooth.
    
    quickthrowman 4 hours ago ago
    
    Station clocks in Switzerland receive a signal from a master clock each minute that advances the minute hand, the seconds hand moves completely independent from the minute hand. This allows them to sync to the minute.
    > The station clocks in Switzerland are synchronised by receiving an electrical impulse from a central master clock at each full minute, advancing the minute hand by one minute. The second hand is driven by an electrical motor independent of the master clock. It takes only about 58.5 seconds to circle the face; then the hand pauses briefly at the top of the clock. It starts a new rotation as soon as it receives the next minute impulse from the master clock.[3] This movement is emulated in some of the licensed timepieces made by Mondaine.
    https://en.wikipedia.org/wiki/Swiss_railway_clock
- kldg 2 hours ago ago
  
  in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.
  as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.
  I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.
- pmarreck 4 hours ago ago
  
  https://ai.studio/apps/drive/1oGzK7yIEEHvfPqxBGbsue-wLQEhfTP...
  I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")
- malfist 3 hours ago ago
  
  That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.
  The prompt the other person was using is:
``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```
Which is much more difficult.
For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...
[-]
- irthomasthomas 2 hours ago ago
  
  URL not found :(
- xnx 8 hours ago ago
  
  This is cool. Gemini 2.5 Pro was also capable of this. Gemini was able to recreate famous piece of clock artwork in July: https://gemini.google.com/app/93087f373bd07ca2
  "Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo
- farazbabar 8 hours ago ago
  
  https://ai.studio/apps/drive/1yAxMpwtD66vD5PdnOyISiTS2qFAyq1... <- this is very nice, I was able to make seconds smooth with three iterations (it used svg initially which was jittery, but eventually this).
- thegrim33 9 hours ago ago
  
  "Allow access to Google Drive to load this Prompt."
  .... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.
  [-]
  - LiamPowell 9 hours ago ago
    
    You don't want to give Google access to files you've stored in Google Drive? It's also only access to an application specific folder, not all files.
    
    [-]
    
    tibbar 7 hours ago ago
    
    Well, you also have to allow it to train on your data. Although this is not explicitly about your Google drive data, and probably requires you to submit a prompt yourself, the barriers here are way to weak/fuzzy for me consider granting access via any account with private info.
  - lxgr 3 hours ago ago
    
    Because most likely (at least according to Hanlon's razor) they somehow decided that using Google Drive as the only persistent storage backing AI studio was a reasonable UX decision.
    It probably makes some sense internally in big tech corporation logic (no new data storage agreements on top of the ones the user has already agreed to when signing up for Drive etc.), but as a user, I find it incredibly strange too – especially since the text chats are in some proprietary format I can't easily open on my local GDrive replica, but the images generated or uploaded just look like regular JPEGs and PNGs.
- skybrian 8 hours ago ago
  
  It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.
  [-]
  - wongarsu 7 hours ago ago
    
    Just like actual cheap but not bottom of the barrel clocks
- pmarreck 9 hours ago ago
  
  holy shit! This is actually a VERY NICE clock!
- dyauspitr 9 hours ago ago
  
  Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?
  [-]
  - dwringer 9 hours ago ago
    
    This isn't using the same prompt or stack as the page from that post the other day; on aistudio it builds a web app across a few different files. It's still fairly concise but I don't think it's that much so.
    
    [-]
    
    malfist 3 hours ago ago
    
    It also includes javascript which was verboten in the original prompt, and doesn't specify the time the clock should be set too.
simonw 7 hours ago ago

Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/
[-]
- torginus 4 hours ago ago
  
  Considering how important this benchmark has become to the judgement of state of the art AI models, I imagine each AI lab has a dedicated 'pelican guy', a a highly accomplished and academically credentialed person, who's working around the clock on training the model to make better and better SVG pelicans on bikes.
  [-]
  - simonw 3 hours ago ago
    
    That would mean my dastardly scheme has finally come to fruition: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
  - AstroBen 3 hours ago ago
    
    Pelican guy may be one of the last jobs to be automated
- skylurk 6 hours ago ago
  
  They've been training for months to draw that pelican, just for you to move the goalposts.
- Thrymr 5 hours ago ago
  
  Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.
  [-]
  - Workaccount2 3 hours ago ago
    
    I don't think the big labs would waste their time on it. If a model is great at making the pelican but sucks at all other svg it becomes obvious. But so far the good pelicans are strong indicators of good general SVG ability.
    Unless training on the pelican increases all SVG ability, then good job.
    
    [-]
    
    oceansky 3 hours ago ago
    
    I absolutely think they would given the amount of money and hype being pumped into it.
- mtrovo 6 hours ago ago
  
  It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?
- tkgally an hour ago ago
  
  I updated my benchmark of 30 pelican-bicycle alternatives that I posted here a couple of weeks ago:
  https://gally.net/temp/20251107pelican-alternatives/index.ht...
  There seem to be one or two parsing errors. I'll fix those later.
- libraryofbabel 6 hours ago ago
  
  I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.
  Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?
  [-]
  - simonw 6 hours ago ago
    
    The model card says: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
    > This model is not a modification or a fine-tune of a prior model.
    I'm curious why they decided not to update the training data cutoff date too.
    
    [-]
    
    stocksinsmocks 5 hours ago ago
    
    Maybe that date is a rule of thumb for when AI generated content became so widespread that it is likely to have contaminated future data. Given that people have spoofed authentic Reddit users with Markov chains, it probably doesn’t go back nearly far enough.
prodigycorp 10 hours ago ago

I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
[-]
- WhitneyLand 10 hours ago ago
  
  >>benchmarks are meaningless
  No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
  >>my fairly basic python benchmark
  I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
  [-]
  - NaomiLehman 9 hours ago ago
    
    they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.
    I like to compare them using chathub using the same prompts
    Gemini still calls me "the architect" in half of the prompts. It's very cringe.
    
    [-]
    
    sothatsit an hour ago ago
    
    It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.
    This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses. It takes time for people to really get to grips with a new model, not just a few prompt comparisons where luck and prompt selection will play a big role.
    
    mpalmer 2 hours ago ago
    
    Gemini still calls me "the architect" in half of the prompts. It's very cringe.
    Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?
    
    beepbooptheory 6 hours ago ago
    
    I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.
- dekhn 9 hours ago ago
  
  Using a single custom benchmark as a metric seems pretty unreliable to me.
  Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
  [-]
  - prodigycorp 7 hours ago ago
    
    after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.
    This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
    While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
    my bad to the google team for the cursory brush off.
    
    [-]
    
    chermi 6 hours ago ago
    
    Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.
    
    nomel 5 hours ago ago
    
    > This probably means my test is a little too niche.
    > my python one needs to be down weighted or supplanted.
    To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
    I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
    
    [-]
    
    relaytheurgency 3 hours ago ago
    
    I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.
- thefourthchime 10 hours ago ago
  
  I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.
  [-]
  - bitexploder 7 hours ago ago
    
    Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.
    
    [-]
    
    amelius 6 hours ago ago
    
    I thought this kind of chaining was already part of these systems.
  - Workaccount2 8 hours ago ago
    
    It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.
  - ofa0e 10 hours ago ago
    
    Your benchmarks should not involve IP.
    
    [-]
    
    sowbug 9 hours ago ago
    
    The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.
    
    [-]
    
    bongodongobob 7 hours ago ago
    
    It's not an ethics thing. It's a guardrails thing.
    
    [-]
    
    sowbug 7 hours ago ago
    
    That's a valid point, though an average LLM would certainly understand the difference between trademark and other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").
    
    ComplexSystems 9 hours ago ago
    
    Why? This seems like a reasonable task to benchmark on.
    
    [-]
    
    adastra22 9 hours ago ago
    
    Because you hit guard rails.
    
    ofa0e 9 hours ago ago
    
    Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.
    
    [-]
    
    scragz 9 hours ago ago
    
    correction: pacman is not a human and has no soul.
    
    [-]
    
    WhyOhWhyQ an hour ago ago
    
    Why do you have to willfully misinterpret the person you're replying to? There's truth in their comment.
- sosodev 10 hours ago ago
  
  How can you be sure that your benchmark is meaningful and well designed?
  Is the only thing that prevents a benchmark from being meaningful publicity?
  [-]
  - prodigycorp 10 hours ago ago
    
    I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.
    I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
    
    [-]
    
    gregsadetsky 9 hours ago ago
    
    I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?
    I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
    Thanks
    [0] https://news.ycombinator.com/item?id=45968665
    
    [-]
    
    adastra22 9 hours ago ago
    
    > if it's not public, presumably LLMs would never get better at them.
    Why? This is not obvious to me at all.
    
    [-]
    
    gregsadetsky 9 hours ago ago
    
    You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.
    
    [-]
    
    adastra22 9 hours ago ago
    
    That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.
    But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.
    Problem solving ability is largely not from the pretraining data.
    
    [-]
    
    gregsadetsky 9 hours ago ago
    
    Yeah, great point.
    I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)
- t0mas88 4 hours ago ago
  
  Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.
- benterix 10 hours ago ago
  
  > This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.
  Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
  [-]
  - adastra22 9 hours ago ago
    
    I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.
  - Iulioh 10 hours ago ago
    
    A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....
    GPT4/3o might be the best we will ever have
- ddalex 10 hours ago ago
  
  I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code
  [-]
  - layer8 7 hours ago ago
    
    Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?
- mring33621 10 hours ago ago
  
  I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.
  I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
  IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
  This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
- testartr 10 hours ago ago
  
  and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much
  it's easy to focus on what they can't do
  [-]
  - big-and-small 9 hours ago ago
    
    Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".
    Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.
    Then make it imagine playing against you and it's gonna be fast and reliable.
    
    [-]
    
    testartr 3 hours ago ago
    
    prompt was very concrete: draw a tic tac toe ASCII table and let's play. gemini 2.5 thought for pages particular moves
- Filligree 10 hours ago ago
  
  What's the benchmark?
  [-]
  - ahmedfromtunis 10 hours ago ago
    
    I don't think it would be a good idea to publish it on a prime source of training data.
    
    [-]
    
    Hammershaft 10 hours ago ago
    
    He could post an encrypted version and post the key with it to avoid it being trained on?
    
    [-]
    
    benterix 10 hours ago ago
    
    What makes you think it wouldn't end up in the training set anyway?
    
    stefs 3 hours ago ago
    
    Every AI corp has people reading HN.
    
    rs186 8 hours ago ago
    
    I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.
  - petters 10 hours ago ago
    
    Good personal benchmarks should be kept secret :)
  - prodigycorp 10 hours ago ago
    
    nice try!
    
    [-]
    
    ankit219 7 hours ago ago
    
    you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.
- Rover222 10 hours ago ago
  
  curious if you tried grok 4.1 too
- mupuff1234 10 hours ago ago
  
  Could also just be rollout issues.
  [-]
  - prodigycorp 10 hours ago ago
    
    Could be. I'll reply to my comment later with pass/fail results of a re-run.
- luckydata 9 hours ago ago
  
  I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.
  [-]
  - nomel 4 hours ago ago
    
    I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?
    I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).
    I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...
- m00dy 10 hours ago ago
  
  that's why everyone using AI for code should code in rust only.
ttul 10 hours ago ago

My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
[-]
- rfw300 7 hours ago ago
  
  My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:
  - Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts
  - Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.
  - Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.
  Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.
  [-]
  - Rudybega 6 hours ago ago
    
    I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues.
    Something like
    1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it
    Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.
  - ant6n 6 hours ago ago
    
    The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?
    
    [-]
    
    nomel 3 hours ago ago
    
    This is specifically why I don't use Gemini. The gaslighting is ridiculous.
- satvikpendem 8 hours ago ago
  
  I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.
- iagooar 10 hours ago ago
  
  What prompt do you use for that?
  [-]
  - gregsadetsky 9 hours ago ago
    
    I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.
    3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
    [00:00] Greg: Hello. [00:01] X: You great? [00:02] Greg: Hi. [00:03] X: I'm X. [00:04] Y: I'm Y. ...
    Super impressive!
    
    [-]
    
    HPsquared 9 hours ago ago
    
    Does it deduce everyone's name?
    
    [-]
    
    gregsadetsky 9 hours ago ago
    
    It does! I redacted them, but yes. This was a 3-person call.
  - punnerud 9 hours ago ago
    
    I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)
- renegade-otter 9 hours ago ago
  
  It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.
- valtism 9 hours ago ago
  
  Parakeet TDT v3 would be really good at that
  [-]
  - kridsdale3 3 hours ago ago
    
    Yes, this is the best solution for that goal. Use the MacWhisper app + Parakeet 3.
falcor84 5 hours ago ago

I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
[-]
- davedigerati 4 hours ago ago
  
  Excellent reference Tried to name an AI project at work Electric Monk but too 'controversial'
  Had to change to Electric Mentor....
- mikepurvis 4 hours ago ago
  
  SMBC had a pretty great take on this: https://www.smbc-comics.com/comic/summary
  [-]
  - AstroBen 4 hours ago ago
    
    This feels too real to laugh at
- xeonmc 3 hours ago ago
  
  Now let’s hope that it will also save labour on resolving cloud infrastructure downtimes too.
- tonyhart7 3 hours ago ago
  
  after outsource developer job, we can outsource all of manager job and leaving CEO with AI agentic code as its servant
  [-]
  - aussieguy1234 3 hours ago ago
    
    Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
    Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
    "Were automating with AI" sounds better to investors than "We over hired and now need to downsize" or "We made some bad market bets, now need to free up cash flow"
    
    [-]
    
    stinkbeetle 3 hours ago ago
    
    > Not sure what you mean here, but the only real jobs at risk from AI right now are middle/upper management.
    > Not a single engineer has ever been laid off because of AI. Any company claiming this is the case is trying to cover up bad decisions.
    I don't suppose these assertions are based on anything. If "AI" reduces the amount of time an engineer spends writing crud, boilerplate, test cases, random scripts, etc., and they have 5% more time to do other things, then all else being equal a project can be done with 5% fewer engineers.
    Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
    
    [-]
    
    judahmeek 2 hours ago ago
    
    > Does AI result in greater productivity for engineers, and does greater productivity per person mean demand can be satisfied with fewer people?
    Between the disagreements regarding performance metrics, the fact that AI will happily increase its own scope of work as well as facilitate increasing any task, sprint, or projects scope of work, and Jevons Paradox, the world may never know the answer to either of these questions.
    
    aussieguy1234 2 hours ago ago
    
    It does improve productivity, just like a good IDE. But engineers didn't get replaced by IDEs and they haven't yet been replaced by AI.
    By the time its good enough to replace actual engineers, any job done in front of a computer will be at risk. I'm hoping that will happen at the same time as AI embodiment in robots, then every job will be automated, not just computer based ones.
    
    [-]
    
    stinkbeetle an hour ago ago
    
    Your assertion was not that "an engineer has never been replaced by AI". It is that no engineer has been laid off because of AI.
    You agree AI improves engineer productivity. So last remaining question is, does greater productivity mean that fewer people are required to satisfy a given demand?
    The answer is yes of course. So at this point, supporting the assertion requires handwaving about shortages and induced demand and demand for engineers to develop and support AI and so on. Which are all reasonable, but it should become pretty apparent that you can't be confident in an assertion like that. I would say it's pretty likely that AI has resulted in engineers being laid off in specific instances if not the net numbers.
    
    tonyhart7 2 hours ago ago
    
    "Not a single engineer has ever been laid off because of AI."
    are you insane??? big tech literally make one of the most biggest layoff for the past few months
    
    [-]
    
    aussieguy1234 2 hours ago ago
    
    That's because of overhiring and other non-ai related reasons (i.e. Higher interest rates means less VC funding available).
    In reality, getting AI to do actual human work, as of the moment, takes much more effort and cost than you get back in cost savings. These companies will claim they are using AI, even if its just a few engineers using Windsurf.
    The companies claim AI is the reason they laid off engineers to make it look like they're innovating, not downsizing, which makes them look better in the eyes of investors and shareholders.
CMay 4 hours ago ago

I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
Workaccount2 10 hours ago ago

It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
[-]
- recitedropper 9 hours ago ago
  
  Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.
  [-]
  - Workaccount2 8 hours ago ago
    
    I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.
    
    [-]
    
    recitedropper 5 hours ago ago
    
    I think in this case, tokenization and percpetion are somewhat analogous. I think it is probably the case our current tokenization schemes are really simplistic compared to what nature is working with. If you allow the analogy.
    
    stalfie 8 hours ago ago
    
    The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.
    (Source: "The mind is flat" by Nick Chater)
    
    [-]
    
    machiaweliczny 7 hours ago ago
    
    It's also easy to spot as when you are tired you might misrecognize objects, I caught myself with this when doing long roadtrips
  - orly01 8 hours ago ago
    
    Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.
    
    [-]
    
    nomel 3 hours ago ago
    
    > How do brains do it with such a low amount of energy?
    Physical analog chemical circuits whose physical structure directly is the network, and use chemistry/physics directly for the computations. For example, a sum is usually represented as the number of physical ions present within a space, not some ALU that takes in two binary numbers, each with some large number of bits, requiring shifting electrons to and from buckets, with a bunch of clocked logic operations.
    There are a few companies working on more "direct" implementations of inference, like Etched AI [1] and IBM [2], for massive power savings.
    [1] https://en.wikipedia.org/wiki/Etched_(company)
    [2] https://spectrum.ieee.org/neuromorphic-computing-ibm-northpo...
    
    recitedropper 7 hours ago ago
    
    This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.
    My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.
    
    [-]
    
    programd 3 hours ago ago
    
    My armchair is equally comfy, and I have an actual paper to point to:
    Jaxley: Differentiable simulation enables large-scale training of detailed biophysical models of neural dynamics [1]
    They basically created sofware to simulate real neurons and ran some realistic models to replicate typical AI learning tasks:
    "The model had nine different channels in the apical and basal dendrite, the soma, and the axon [39], with a total of 19 free parameters, including maximal channel conductances and dynamics of the calcium pumps."
    So yeah, real neurons are a bit more complex then ReLU or Sigmoid.
    [1] https://www.biorxiv.org/content/10.1101/2024.08.21.608979v2....
    
    nick32661123 7 hours ago ago
    
    Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.
- lukebechtel 9 hours ago ago
  
  ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.
  [-]
  - achow 9 hours ago ago
    
    Op is right.
    https://imgcdn.stablediffusionweb.com/2024/4/19/8e54925a-004...
    For the above pic I asked "What is wrong with the image?"
    Result: - It totally missed the most obvious one - six fingers.
    Instead it said this:
    Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:
    - The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.
    - Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.
    - Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.
    - Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.
    [Edit: 3.0 is same as 2.5 - both answered almost identically]
    
    [-]
    
    evrenesat 8 hours ago ago
    
    JFYI, Qwen managed to recognize the sixth finger:
    Max: https://chat.qwen.ai/c/ca671562-7a56-4e2f-911f-40c37ff3ed79
    VL-235B: https://chat.qwen.ai/c/21cc5f4e-5972-4489-9787-421943335150
tylervigen 9 hours ago ago

I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/
[-]
- stephc_int13 8 hours ago ago
  
  What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.
  Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
  But I think this is also fair to use any means to beat it.
  [-]
  - tylervigen 8 hours ago ago
    
    I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.
    However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
    
    [-]
    
    stephc_int13 6 hours ago ago
    
    The real strength of current neural nets/transformers relies on huge datasets.
    ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.
    Building your own large private ARC set does not seem too difficult if you have enough resources.
  - Blamklmo 3 hours ago ago
    
    Doesn't even matter at this point.
    We have a global RL Pipeline on our hand.
    If there is something new a LLM/AI model can't solve today, plenty of humans can't either.
    But tomorrow every LLM/AI model can solve it and again plent of humans still can't.
    Even if AGI is just the sum of companies adding more and more trainingdata, as long as this learning pipeline becomes faster and easier to train with new scenarios, that will start to bleed out humans in the loop.
  - benlivengood 2 hours ago ago
    
    That's ok; just start publishing your real problems to solve as "AI benchmarks" and then it'll work in ~6 months.
  - AstroBen 6 hours ago ago
    
    Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google
    
    [-]
    
    nomel 4 hours ago ago
    
    If they're memory/reference constrained systems that can't directly "store" every solution, then doing well on benchmarks should result in better real world/reasoning performance, since lack of memorized answer requires understanding.
    Like with humans [1], generalized reasoning ability lets you skip the direct storage of that solution, and many many others, completely! You can just synthesize a solution when a problem is presented.
    [1] https://www.youtube.com/watch?v=f58kEHx6AQ8
    
    spprashant 6 hours ago ago
    
    Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.
    
    stephc_int13 6 hours ago ago
    
    Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined.
    The trick is to not put more value in the score than what it is.
  - riku_iki 3 hours ago ago
    
    > internal team to create an ARC replica, covering very similar puzzles
    they can target benchmark directly, not just replica. If google or OAI are bad actors, they already have benchmark data from previous runs.
    
    [-]
    
    energy123 2 hours ago ago
    
    The 'private' set is just a pinkie promise not to store logs or not to use the logs when the evaluator uses the API to run the test, so yeah. It's trivially exploitable.
    Not only do you have the financial self-interest to do it (helps with capital raising to be #1), but you are worried that your competitors are doing it, so you may as well cheat to make things fair. Easy to do and easy to justify.
    Maybe a way to make the benchmark more robust to this adversarial environment is to introduce noise and random red herrings into the question, and run the test 20 times and average the correctness. So even if you assume they're training on it, you have some semblance of a test still happening. You'd probably end up with a better benchmark anyway which better reflects real-world usage, where there's a lot of junk in the context window.
    
    [-]
    
    riku_iki 2 hours ago ago
    
    they have two sets:
    - semi-private, which they use to test proprietary models and which could be leaked
    -private: used to test downloadable open source models.
    ARG-AGI prize itself is for open source models.
    
    [-]
    
    stephc_int13 22 minutes ago ago
    
    My point is that it does not matter if the set is private or not.
    If you want to train your model you'd need more data than the private set anyway. So you have to build a very large training set on your own, using the same kind of puzzles.
    It is not that hard, really, just tedious.
  - simpsond 8 hours ago ago
    
    Humans study for tests. They just tend to forget.
- grantpitt 9 hours ago ago
  
  Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard
  [-]
  - energy123 3 hours ago ago
    
    It leads on arc-agi-1 with Gemini 3.0 Deep Think, which uses "tool calls" according to google's post, whereas regular Gemini 3.0 Pro doesn't use "tool calls" for the same benchmark. I am unsure how significant this difference is.
- tylervigen 7 hours ago ago
  
  This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3
- HarHarVeryFunny 7 hours ago ago
  
  There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.
  [-]
  - knowriju 3 hours ago ago
    
    ARC-AGI has a hidden private test suite, right ? No model will have access to that set.
    
    [-]
    
    variadix 2 hours ago ago
    
    I doubt they have offline access to the model, i.e. the prompts are sent to the model provider.
    
    [-]
    
    xlbuttplug2 26 minutes ago ago
    
    Even if the prompts are technically leaked to the provider, how would they be identified as something worth optimizing for out of the millions of other prompts received?
- m3kw9 4 hours ago ago
  
  that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.
syspec 8 hours ago ago

I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
[-]
- plaidfuji an hour ago ago
  
  I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up
- eitally 2 hours ago ago
  
  I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.
- jmkni 7 hours ago ago
  
  I can relate to this, it's doing exactly what I want, but it ain't pretty.
  It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself
- poyu 2 hours ago ago
```
    but I would never want to have to interact with
```
  That is its job security ;)
stevesimmons 9 hours ago ago
A nice Easter egg in the Gemini 3 docs [1]:
```
    If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:

    "thoughtSignature": "context_engineering_is_the_way_to_go"
```
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...
[-]
- bijant 8 hours ago ago
  
  It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.
  [-]
  - int_19h 3 hours ago ago
    
    I was under the impression that those reasoning outputs that you get back aren't references but simply raw CoT strings that are encrypted.
ogig 2 hours ago ago

I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.
[-]
- brcmthrowaway 37 minutes ago ago
  
  To what?
crawshaw 8 hours ago ago

Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
[-]
- film42 8 hours ago ago
  
  At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.
  Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
  [-]
  - gordonhart 7 hours ago ago
    
    Matches my experience exactly. It's not the best at writing code but Gemini 2.5 Pro is (was) the hands-down winner in every other use case I have.
    This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.
- Szpadel 7 hours ago ago
  
  I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).
  Code quality was fine for my very limited tests but I was disappointed with instruction following.
  I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.
  I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.
  this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.
  for codex those instructions make plan mode redundant.
  [-]
  - m3kw9 4 hours ago ago
    
    just say "don't code yet" at the end. I never use plan mode because plan mode is just a prompt anyways.
- Narciss 4 hours ago ago
  
  I've been working with it, and so far it's been very impressive. Better than Opus in my feels, but I have to test more, it's super early days
  [-]
  - mewpmewp2 3 hours ago ago
    
    What I usually try to test with is try to get them do full scalable SaaS application from scratch... It seemed very impressive in how it did the early code organization using Antigravity, but then at some point, all of sudden it started really getting stuck and constantly stopped producing and I had to trigger continue, or babysit it. I don't know if I could've been doing something better, but that was just my experience. Seemed impressive at first, but otherwise at least vs Antigravity, Codex and Claude Code scale more reliably.
    Just early anecdote from trying to build that 1 SaaS application though.
__jl__ 11 hours ago ago

API pricing is up to $2/M for input and $12/M for output
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
[-]
- raincole 10 hours ago ago
  
  Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.
  [-]
  - brianjking 10 hours ago ago
    
    It is so impressive that Anthropic has been able to maintain this pricing still.
    
    [-]
    
    bottlepalm 8 hours ago ago
    
    Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.
    Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.
    
    [-]
    
    int_19h 3 hours ago ago
    
    Well, it depends. Just recently I had Opus 4.1 spend 1.5 hours looking at 600+ sources while doing deep research, only to get back to me with a report consisting of a single sentence: "Full text as above - the comprehensive summary I wrote". Anthropic acknowledged that it was a problem on their side but refused to do anything to make it right, even though all I asked them to do was to adjust the counter so that this attempt doesn't count against their incredibly low limit.
    
    epolanski 7 hours ago ago
    
    Idk Anthropic has the least consistent models out there imho.
    
    Aeolun 10 hours ago ago
    
    Because every time I try to move away I realize there’s nothing equivalent to move to.
    
    [-]
    
    Alex-Programs 9 hours ago ago
    
    People insist upon Codex, but it takes ages and has an absolutely hideous lack of taste.
    
    [-]
    
    andybak 9 hours ago ago
    
    Taste in what?
    
    [-]
    
    js4ever 3 hours ago ago
    
    Wines!
- dktp 3 hours ago ago
  
  It's interesting that grounding with search cost changed from
  * 1,500 RPD (free), then $35 / 1,000 grounded prompts
  to
  * 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries
  It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)
- jhack 10 hours ago ago
  
  With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.
  [-]
  - xnx 9 hours ago ago
    
    There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
    
    [-]
    
    eevmanu 8 hours ago ago
    
    In case anyone wants to confirm if this link is official, it is.
    https://goo.gle/enable-preview-features
    -> https://github.com/google-gemini/gemini-cli/blob/release/v0....
    --> https://goo.gle/geminicli-waitlist-signup
    ---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
- fosterfriends 10 hours ago ago
  
  Thrilled to see the cost is competitive with Anthropic.
coffeecoders 9 hours ago ago

Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
[-]
- Workaccount2 9 hours ago ago
  
  AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.
- ehsankia an hour ago ago
  
  > The winners aren’t necessarily those with the best models
  Is there evidence that's true? That the other models are significantly better than the ones you named?
- bitpush 8 hours ago ago
  
  > Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
  One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...
- int_19h 3 hours ago ago
  
  Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.
- acoustics 9 hours ago ago
  
  Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.
  [-]
  - CobrastanJorji 8 hours ago ago
    
    Man, remember the days where we'd lose our minds at our operating systems doing stuff like that?
    
    [-]
    
    esafak 7 hours ago ago
    
    The people who lost their minds jumped ship. And I'm not going to work at a company that makes me use it, either. So, not my problem.
bnchrch 10 hours ago ago

I've been so happy to see Google wake up.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
[-]
- digbybk 10 hours ago ago
  
  Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.
  [-]
  - kccqzy 7 hours ago ago
    
    Balance is too weak of a word. OpenAI was conceived specifically to prevent Google from getting AGI first. That was its original goal. At the time of its founding Google was the undisputed leader of AI anywhere in the world. Musk was then very worried about AGI being developed behind closed doors particularly Google, which was why he was the driving force behind the founding of OpenAI.
    
    [-]
    
    kranke155 5 hours ago ago
    
    The book Empire of AI describes him as being particularly fixated on Demis as some kind of evil genius. From the book, early OAI employees couldn’t take the entire thing too seriously and just focused on the work.
  - dragonwriter 9 hours ago ago
    
    I thought it was a workaround to Google's complete disinterest in productizing the AI research it was doing and publishing, rather than a way to balance their dominance in a market which didn't meaningfully exist.
    
    [-]
    
    mattnewton 9 hours ago ago
    
    That’s how it turned out, but IIRC at the time of OpenAI’s founding, “AI” was search and RL which Google and deep mind were dominating, and self driving, which Waymo was leading. And OpenAI was conceptualized as a research org to compete. A lot has changed and OpenAI has been good at seeing around those corners.
    
    jonny_eh 7 hours ago ago
    
    That was actually Character.ai's founding story. Two researchers at Google that were frustrated by a lack of resources and the inability to launch an LLM based chatbot. The founders are now back at Google. OpenAI was founded based on fears that Google would completely own AI in the future.
    
    sgt101 6 hours ago ago
    
    I think that Google didn't see the business case in that generation of models, and also saw significant safety concerns. If AI had been delayed by... 5 years... would the world really be a worse place?
    Yes - less exciting! But worse?
    
    jpadkins 9 hours ago ago
    
    Elon Musk specifically gave OAI $150M early on because of the risk of Google being the only Corp that has AGI or super-intelligence. These emails were part of the record in the lawsuit.
  - CobrastanJorji 8 hours ago ago
    
    Pffft. OpenAI was conceived to be Open, too.
    
    [-]
    
    lemoncucumber 8 hours ago ago
    
    It’s a common pattern for upstarts to embrace openness as a way to differentiate and gain a foothold then become progressively less open once they get bigger. Android is a great example.
    
    [-]
    
    bitpush 8 hours ago ago
    
    Last I checked, Android is still open source (as AOSP) and people can do whatever-the-f-they-want with the source code. Are we defining open differently?
    
    [-]
    
    lemoncucumber 7 hours ago ago
    
    I think we're defining "less" differently. You're interpreting "less open" to mean "not open at all," which is not what I said.
    There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open.
    For example, by moving features that were in the AOSP into their proprietary Play Services instead [1].
    Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2].
    In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience.
    [1] https://arstechnica.com/gadgets/2018/07/googles-iron-grip-on...
    [2] https://arstechnica.com/gadgets/2025/08/google-will-block-si...
    
    ipaddr 7 hours ago ago
    
    Core is open source but for a device to be "Android compatible" and access the Google Play Store and other Google services, it must meet specific requirements from Google's Android Compatibility Program. These additional proprietary components are what make the final product closed source.
    The Android Open Source Project is not Android.
    
    [-]
    
    bitpush 3 hours ago ago
    
    > The Android Open Source Project is not Android.
    Was "Android" the way you define it ever open? Isnt it similar to chromium vs chrome? chromium is the core, and chrome is the product built on top of it - which is what allows Comet, Atlas, Brave to be built on.
    That's the same thing what GrapheneOS, /e/ OS and others are doing - building on top of AOSP.
- ThrowawayR2 9 hours ago ago
  
  They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.
  [-]
  - bitpush 7 hours ago ago
    
    This is an understandable, but simplistic way of looking at the world. Are you also gonna blame Apple for mining for rare earths, because they made a successful product that requires exotic materials which needs to be mined from earth? How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
    For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.
    
    [-]
    
    astrange 6 hours ago ago
    
    > How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?
    That would be bad if it happened, which is why it doesn't happen. Working in a factory isn't an inhumane condition.
    
    fractalf 7 hours ago ago
    
    I dont think your comment justifies calling out any form of simplistic view. It doesnt make sense. All the big players are bad. They"re companies, their one and only purpose is to make money and they will do whatever it takes to do it. Most of which does not serve human kind.
    
    [-]
    
    jimbokun 5 hours ago ago
    
    Compared to what?
    
    dieggsy 7 hours ago ago
    
    It seems okay to me to be upset with the system and also point out the specific wrongs of companies in the right context. I actually think that's probably most effective. The person above specifically singled out Google as a reply to a comment praising the company, which seems reasonable enough. I guess you could get into whether it's a proportional response; the praise wasn't that high and also exists within the context of the system as you point out. Still, their reply doesn't necessarily indicate that they're not upset with all companies or the system.
    
    observationist 7 hours ago ago
    
    Yes, we're absolutely holding Apple accountable for outsourcing jobs, degrading the US markets, using slave and child labor, laundering cobalt from illegal "artisanal" mines in the DRC, and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
    I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods.
    If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally.
    The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way.
    Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover.
    And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition.
    I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants.
    If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric.
    The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.
    
    [-]
    
    astrange 6 hours ago ago
    
    > laundering cobalt from illegal "artisanal" mines in the DRC
    They don't, all cobalt in Apple products is recycled.
    > and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all.
    They don't, Apple audits their entire supply chain so it wouldn't hide anything if something moved to another subcontractor.
    
    [-]
    
    sharpshadow 5 hours ago ago
    
    One can claim 100% recycled cobalt under the mass balance system even if recycled and non-recycled cobalt was mixed as long as the total amount used in production is less or equal to recycled cobalt purchased in the books. At least here[0] they claim their recycled cobalt references are under the mass balance system.
    0. https://www.apple.com/newsroom/2023/04/apple-will-use-100-pe...
    
    jimbokun 5 hours ago ago
    
    Where is the fairy godmother's magic wand that will allow you to make all the governments of the world instantly agree to all of this?
  - ApolloFortyNine 6 hours ago ago
    
    People love getting their content for free and that's what Google does.
    Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?
    
    [-]
    
    brabel 6 hours ago ago
    
    Yep, it's hard to believe it exists for free and with not a lot of ads when you have a good ad blocker... though the content creator's ads are inescapable, which I think is ok since they're making a little money in exchange for what, your little inconvenience for 1 minute or so - if you're not skipping the ad, which you aren't, right??) - after which you can watch some really good content. The history channels on YT are amazing, maybe world changing - they get people to learn history and actually enjoy it. Same with some match channels like 3brown1blue which are just outstanding, and many more.
    
    amelius 6 hours ago ago
    
    > People love getting their content for free and that's what Google does.
    They are forcing a payment method on us. It's basically like they have their hand in our pockets.
  - visarga 9 hours ago ago
    
    Yes, this is correct, and it happens everywhere. App Store, Play Store, YouTube, Meta, X, Amazon and even Uber - they all play in two-sided markets exploiting both its users and providers at the same time.
  - notepad0x90 8 hours ago ago
    
    They're not a moral entity. corporations aren't people.
    I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.
    If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.
    Voters are people and they are moral entities, direct any moral outrage at us.
    
    [-]
    
    layer8 7 hours ago ago
    
    Why should the collective of voters be any more of a moral entity than the collective of people who make up a corporation (which you may include its shareholders in if you want)?
    It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
    
    [-]
    
    notepad0x90 6 hours ago ago
    
    > Why should the collective of voters..
    They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries.
    > It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.
    In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law.
    The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have.
    You're trying to live in a feudal society when you treat corporations like this.
    If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters.
    And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.
    
    [-]
    
    layer8 5 hours ago ago
    
    When we criticize corporations, we really are criticizing the people who make the decisions in the corporations. I don’t see why we shouldn’t apply exactly the same moral standards to people’s decision in the context of a corporation as we do to people’s decisions made in any other context. You talk about lawfulness, but we wouldn’t talk about morals if we meant lawfulness. It’s also lawful to vote for the hyper-capitalist party, so by the same token moral outrage shouldn’t be directed towards the voters.
    
    [-]
    
    notepad0x90 4 hours ago ago
    
    I get that, but those CEOs are not elected officials, they don't represent us and have no part in the discourse of law making (despite the state of things). In their capacity has executives of a company, they have no rights, no say in what we find acceptable or not in society. We tell them what they can and cannot do or else. That's the social contract we have with companies and their executives.
    Being in charge of a corporation shouldn't elevate someone to a platform where they have a louder voice than the common man. They can vote just as equally as others at the voting booth. they can participate in their capacity as individuals in politics. But neither money, nor corporate influence have places in the governance of a democratic society.
    I talk about lawfulness because that is the only rule of law a corporation can and should be expected to follow. Morals are for individuals. Corporations have no morals. they are neither moral or immoral. Their owners have morals, and you can criticize their greed, but that is a construct of capitalism. They're supposed to enrich themselves. You can criticize them for valuing money over morals, but that's like criticizing the ocean for being wet or the sun for being too hot. It's what they do. It's their role in society.
    If a small business owner raises prices to increase revenue, that isn't immoral right? even though poor people that frequent them will be disaffected? amp that up to the scale of a megacorp, and the morality is still the same.
    Corporations are entities that exist for the sole purpose of generating revenue for their owners. So when you criticize Google, you're criticizing a logical organization designed to do the thing you're criticizing it of doing. The CEO of google is acting in his official capacity, doing the job they were hired to do when they are resisting adblocking. The investors of Google are risking their money in anticipation of ROI, so their expectation from Google is valid as well.
    When you find something to be immoral, the only meaningful avenue of expressing that with corporations is the law. You're criticizing google as if it was an elected official we could vote in/out of office. or as if it is an entity that can be convinced of its moral failings.
    When we don't speak up and user our voice, we lose it.
    
    svnt 7 hours ago ago
    
    Because of the inherent capitalism structure that leads to the inevitable: the tragedy of the commons.
    
    ThrowawayR2 7 hours ago ago
    
    Why are you directing the statement that "[Corporations are] not a moral entity" at me instead of the parent poster claiming that "[Google has] been the great balancing force (often for good) in the industry."? Saying that Google is a force "for good" is a claim by them that corporations can be moral entities; I agree with you that they aren't.
    
    [-]
    
    notepad0x90 6 hours ago ago
    
    I could have just the same I suppose, but their comment was about google being a balancing force in terms of competition and monopoly. it wasn't a praise of their moral character. They did what was best for their business and that turns out to be good for reducing monopolies. If it turned out to be monopolistic, I would be wondering what congress and the DOJ are doing about it, instead of criticizing Google for trying to turn a profit.
  - kryogen1c 7 hours ago ago
    
    > They've poisoned the internet
    And what of the people that ravenously support ads and ad-supported content, instead of paying?
    What of the consumptive public? Are they not responsible for their choices?
    I do not consume algorithmic content, I do not have any social media (unless you count HN for either).
    You can't have it both ways. Lead by example, stop using the poison and find friends that aren't addicted. Build an offline community.
    
    [-]
    
    xordon 6 hours ago ago
    
    I don't understand your logic, it seems like victim blaming. Using the internet and pointing out that targeted advertising has a negative effect on society is not "having it both ways".
    Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?
    
    [-]
    
    carlosjobim 6 hours ago ago
    
    You are not a "victim" for using or purchasing something which is completely unnecessary. Or if that's the case, then you have no agency and have to be medicinally declared unfit to govern yourself and be appointed a legal guardian to control your affairs.
  - starchild3001 6 hours ago ago
    
    What kind of world do you live in? Actually Google ads tend to be some of the highest ROI for the advertiser and most likely to be beneficial for the user. Vs the pure junk ads that aren't personalized, and just banner ads that have zero relationship to me. Google Ads is the enabler of free internet. I for one am thankful to them. Else you end up paying for NYT, Washinton Post, Information etc -- virtually for any high quality web site (including Search).
    
    [-]
    
    shakna 6 hours ago ago
    
    Ads. Beneficial to the user.
    Most of the time, you need to pick one. Modern advertising is not based on finding the item with the most utility for the user - which means they are aimed at manipulating the user's behaviour in one way or another.
  - nwienert 7 hours ago ago
    
    Suppressed wages to colluding with Apple to not poach.
- epolanski 7 hours ago ago
  
  Outlook is much better than Gmail and so is the office suite.
  It's good there's competition in the space though.
  [-]
  - brailsafe 6 hours ago ago
    
    Outlook is not better in ways that email or gmail users necessarily care about, and in my experience gets in the way more than it helps with productivity or anything it tries to be good at. I've used it in office settings because it's the default, but never in my life have I considered using it by choice. If it's better, it might not matter.
  - vanillax 6 hours ago ago
    
    I couldn't disagree more
- redbell 9 hours ago ago
  
  > Drive vs Word
  You mean Drive vs OneDrive or, maybe Docs vs Word?
  [-]
  - lemoncucumber 8 hours ago ago
    
    Workspace vs Office
  - BHSPitMonkey 8 hours ago ago
    
    Surely they meant Writely vs Word
- 63stack 10 hours ago ago
  
  - Making money vs general computing
- drewda 9 hours ago ago
  
  For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."
  [-]
  - charcircuit 9 hours ago ago
    
    >most of those examples are acquisitions
    Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.
    
    [-]
    
    hvb2 7 hours ago ago
    
    But there's also plenty that fail, it's just that you won't know about those.
    I don't think what you're saying proves that the companies that were acquired couldn't have done that themselves.
- storus 8 hours ago ago
  
  If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".
- kevstev 8 hours ago ago
  
  All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?
  They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.
- qweiopqweiop 9 hours ago ago
  
  Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!
  [-]
  - bitpush 7 hours ago ago
    
    Why stop at YouTube? Blame Apple for creating an additive gadget that has single handedly wasted billions of hours of collective human intelligence. Life was so much better before iPhones.
    But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things.
    Dont paint everything with a single, large, coarse brush stroke.
  - polotics 8 hours ago ago
    
    frankly when compared against TikTok, Insta, etc, YouTube is a force for good. Just script the shorts away...
- IlikeKitties 8 hours ago ago
  
  Something about bringing balance to the force not destroying it.
- rvz 10 hours ago ago
  
  Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.
  You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
  [0] https://news.ycombinator.com/item?id=34713073
- stephc_int13 9 hours ago ago
  
  Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.
  But at least this company is not run by a narcissistic sociopath.
- samdoesnothing 7 hours ago ago
  
  Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.
siva7 9 hours ago ago

I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
[-]
- MaxL93 8 hours ago ago
  
  > It seems very US centric in its thinking
  I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.
golfer 10 hours ago ago

Supposedly this is the model card. Very impressive results.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
[-]
- tweakimp 10 hours ago ago
  
  Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?
  [-]
  - rvnx 10 hours ago ago
    
    This is a list of questions and answers that was created by different people.
    The questions AND the answers are public.
    If the LLM manages through reasoning OR memory to repeat back the answer then they win.
    The scores represent the % of correct answers they recalled.
    
    [-]
    
    tylervigen 5 hours ago ago
    
    That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.
    You could question how well this works, but it’s not like the answers are just hanging out on the public internet.
  - stavros 10 hours ago ago
    
    I estimate another 7 months before models start getting 115% on Humanity's Last Exam.
- HardCodedBias 10 hours ago ago
  
  If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.
  The person also claims that with thinking on the gap narrows considerably.
  We'll probably have 3rd party benchmarks in a couple of days.
  [-]
  - iamdelirium 9 hours ago ago
    
    This is easily shown that the numbers are for GPT 5.1 thinking high.
    Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard
meetpateltech 10 hours ago ago

DeepMind page: https://deepmind.google/models/gemini/
Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/
Developer blog: https://blog.google/technology/developers/gemini-3-developer...
Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Google Antigravity: https://antigravity.google/
zone411 8 hours ago ago

Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
yomismoaqui 9 hours ago ago

From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
mpeg 9 hours ago ago

Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
markdog12 6 hours ago ago

I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047
[-]
- BoorishBears 6 hours ago ago
  
  The default FPS it's analyzing video at is 1, and I'm not sure the max is anywhere near enough to catch a full speed tennis serve.
  [-]
  - markdog12 5 hours ago ago
    
    Ah, I should have mentioned it was a slow motion video.
    > The default FPS it's analyzing video at is 1
    Source?
    
    [-]
    
    JacobAsmuth 4 hours ago ago
    
    https://ai.google.dev/gemini-api/docs/video-understanding#cu...
    "By default 1 frame per second (FPS) is sampled from the video."
    
    [-]
    
    markdog12 19 minutes ago ago
    
    OK, I just used https://gemini.google.com/app, I wonder if it's the same there.
- strange_quark 6 hours ago ago
  
  I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.
santhoshr 10 hours ago ago

Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png
[-]
- xnx 9 hours ago ago
  
  2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
  [-]
  - nick32661123 6 hours ago ago
    
    Great improvement by only adding one feedback prompt: Change the rotation axis of the wheels by 90 degrees in the horizontal plane. Same for the legs and arms
    https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
  - knownjorbist 9 hours ago ago
    
    Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?
    
    [-]
    
    xnx 9 hours ago ago
    
    I hadn't! It looks like that is there to power the text box at the bottom of the app that allows for AI-powered changes to the scene.
  - agnosticmantis 7 hours ago ago
    
    This says Gemini 2.5 though.
    
    [-]
    
    xnx 6 hours ago ago
    
    Good observation. The app was created with Gemini 3 Pro Preview, but the app calls out to Gemini 2.5 if you use the embedded prompt box.
  - Alex-Programs 9 hours ago ago
    
    Incredible. Thanks for sharing.
- mohsen1 10 hours ago ago
  
  Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?
  [-]
  - AstroBen 10 hours ago ago
    
    IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?
- robterrell 10 hours ago ago
  
  At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.
  [-]
  - notatoad 10 hours ago ago
    
    i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.
    
    [-]
    
    imiric 9 hours ago ago
    
    It would be next to impossible for anyone without insider knowledge to prove that to be the case.
    Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
    So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.
- bn-l 10 hours ago ago
  
  It’s a good pelican. Not great but good.
  [-]
  - cubefox 7 hours ago ago
    
    The blue lines indicating wind really sell it.
sd9 11 hours ago ago

How long does it typically take after this to become available on https://gemini.google.com/app ?
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
[-]
- mpeg 11 hours ago ago
  
  Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code
  Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
  [-]
  - sd9 10 hours ago ago
    
    Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.
- netdur 7 hours ago ago
  
  On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro
- magicalhippo 10 hours ago ago
  
  > https://gemini.google.com/app
  How come I can't even see prices without logging in... they doing regional pricing?
- Squarex 11 hours ago ago
  
  Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.
- Romario77 9 hours ago ago
  
  It's available in cursor. Should be there pretty soon as well.
  [-]
  - ionwake 8 hours ago ago
    
    are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )
- csomar 10 hours ago ago
  
  It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...
svantana 10 hours ago ago

Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.
[1] https://lmarena.ai/leaderboard/text
[-]
- inkysigma 9 hours ago ago
  
  Is it just me or is that link broken because of the cloudflare outage?
  Edit: nvm it looks to be up for me again
- dyauspitr 9 hours ago ago
  
  Grok is heavily censored though
bilekas 10 hours ago ago

> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
[-]
- lalitmaganti 9 hours ago ago
  
  > Gemini app surpasses 650 million users per month
  Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
  Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.
  [-]
  - edaemon 9 hours ago ago
    
    I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app. There was a big "Get Started" button that I almost accidentally clicked because it was where I was about to tap for something else.
    As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.
  - mewpmewp2 6 hours ago ago
    
    I constantly accidentally use some btn and Gemini opens up on my Samsung Galaxy. I haven't bothered to figure this out.
  - aniforprez 9 hours ago ago
    
    I don't know for sure but they have to be counting users like me whose phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app
  - realusername 9 hours ago ago
    
    > it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
    Yes and no, my power button got remapped to opening Gemini in an update...
    I removed that but I can imagine that your average user doesn't.
- Yizahi 9 hours ago ago
  
  This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.
  For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.
- joaogui1 9 hours ago ago
  
  It says Gemini App, not AI Overviews, AI Mode, etc
  [-]
  - recitedropper 9 hours ago ago
    
    They claim AI overviews as having "2 billion users" in the sentences prior. They are clearly trying as hard as possible to show the "best" numbers.
    
    [-]
    
    bitpush 2 hours ago ago
    
    > They are clearly trying as hard as possible to show the "best" numbers.
    This isnt a hottake at all. Marketing (iPhone keynotes, product launches) are about showing impressive numbers. It isnt a gotcha you think it is.
- alecco 9 hours ago ago
  
  Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.
  I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.
- blinding-streak 9 hours ago ago
  
  Gemini app != Google search.
  You're implying they're lying?
  [-]
  - AstroBen 9 hours ago ago
    
    And you're implying they're being 100% truthful?
    Marketing is always somewhere in the middle
    
    [-]
    
    bitpush 8 hours ago ago
    
    Companies cant get away from egregious marketing. See Apple class action lawsuit for Apple Intelligence.
King-Aaron 2 hours ago ago

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
mil22 11 hours ago ago

It's available to be selected, but the quota does not seem to have been enabled just yet.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
[-]
- sarreph 11 hours ago ago
  
  Looks to be available in Vertex.
  I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.
- CjHuber 11 hours ago ago
  
  For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more
- lousken 11 hours ago ago
  
  I hope some users will switch from cerebras to free up those resources
- r0fl 10 hours ago ago
  
  Works for me.
- misiti3780 11 hours ago ago
  
  seeing the same issue.
  [-]
  - sottol 11 hours ago ago
    
    you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.
    when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
    i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
mparis 6 hours ago ago

I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
jpkw 2 hours ago ago

Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.
[-]
- rocqua 2 hours ago ago
  
  Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:
  - 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%
  The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.
  Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.
- terandle 2 hours ago ago
  
  https://artificialanalysis.ai/evaluations/omniscience
- energy123 2 hours ago ago
  
  OpenAI uses SimpleQA to assess hallucinations
nighwatch 8 hours ago ago

I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
senfiaj 7 hours ago ago

Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.
nickandbro 10 hours ago ago

What we have all been waiting for:
"Create me a SVG of a pelican riding on a bicycle"
https://www.svgviewer.dev/s/FfhmhTK1
[-]
- Thev00d00 10 hours ago ago
  
  That is pretty impressive.
  So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.
  [-]
  - burkaman 10 hours ago ago
    
    Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
    
    [-]
    
    jmmcd 10 hours ago ago
    
    "Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.
    
    [-]
    
    BoorishBears 7 hours ago ago
    
    The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.
    Like replacing named concepts with nonsense words in reasoning benchmarks.
    
    ddalex 10 hours ago ago
    
    https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari
  - rixed 10 hours ago ago
    
    I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.
- bitshiftfaced 10 hours ago ago
  
  It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.
  [-]
  - xnx 10 hours ago ago
    
    Very aero
recitedropper 7 hours ago ago

Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
[-]
- sosodev 7 hours ago ago
  
  Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390
  > When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?
  > Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
  [-]
  - recitedropper 5 hours ago ago
    
    I'm sure each of the frontier labs have some secret methods, especially in training the models and the engineering of optimizing inference. That said, I don't think them saying they'd keep a big breakthrough secret would be evidence in this case of a "secret sauce" on ARC-AGI-2.
    If they had found something fundamentally new, I doubt they would've snuck it into Gemini 3. Probably would cook on it longer and release something truly mindblowing. Or, you know, just take over the world with their new omniscient ASI :)
- HarHarVeryFunny 7 hours ago ago
  
  I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?
- horhay 7 hours ago ago
  
  They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1
icyfox 10 hours ago ago

Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
[-]
- rudedogg 9 hours ago ago
  
  Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.
  [-]
  - icyfox 9 hours ago ago
    
    I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model. If your product doesn't work and the new architectures unlock performance that would let you have a feasible business, even a 2x on input tokens shouldn't be the dealbreaker.
    If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.
creddit 7 hours ago ago

Gemini 3 is crushing my personal evals for research purposes.
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
[-]
- energy123 an hour ago ago
  
  I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.
- ethmarks 6 hours ago ago
  
  Genuinely curious here: why is the desktop app so important?
  I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?
  Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.
  [-]
  - creddit 6 hours ago ago
    
    I have a few reasons for the preference:
    (1) The ability to add context via a local apps integration into OS level resources is big. With Claude, eg, I hit Option-SPC which brings up a prompt bar. From there, taking a screenshot that will get sent my prompt is as simple as dragging a bounding box. This is great. Beyond that, I can add my own MCP connectors and give my desktop app direct access to relevant context in a way that doesn't work via web UI. It may also be inconvenient to give context to a web UI in some case where, eg, I may have a folder of PDFs I want it to be able to reference.
    (2) Its own icon that I can CMD-TAB to is so much nicer. Maybe that works with a PWA? Not really sure.
    (3) Even if I can't use an LLM when offline, having access to my chats for context has been repeatedly valuable to me.
    I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
    
    [-]
    
    ethmarks 5 hours ago ago
    
    > The ability to add context via a local apps integration into OS level resources is big
    Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
    > Its own icon that I can CMD-TAB to is so much nicer
    Fair enough. I personally prefer Firefox's tab organization to my OS's window organization, but I can see how separating the LLM into its own window would be helpful.
    > having access to my chats for context has been repeatedly valuable to me.
    I didn't at all consider this. Point ceded.
    > I haven't looked at provider-agnostic apps and, TBH, would be wary of them.
    Interesting. Why? Is it security? The ones I've listed are open source and auditable. I'm confident that they won't steal my API keys. Msty has a lot of advanced functionality that I haven't seen in other interfaces like allowing you to compare responses between different LLMs, export the entire conversation to Markdown, and edit the LLM's response to manage context. It also sidesteps the problem of '[provider] doesn't have a desktop app' because you can use any provider API.
    
    [-]
    
    creddit 5 hours ago ago
    
    > Good point. I can see why integrated support for local filesystem tools would be useful, even though I prefer manually uploading specific files to avoid polluting the context with irrelevant info.
    Access to OS level resources != context pollution. You still have control, just more direct and less manual.
    > The ones I've listed are open source and auditable.
    Yeah I don't plan on spending who knows how much time auditing some major app's code (lol) before giving it my API keys and access to my chats. Unless there's a critical mass of people I know and trust using something like that it's not going to happen for me.
    But also, I tried quickly looking up Msty to see if it is open source and what its adoption looked like and AFAICT it's not open source. Asked Gemini 3 if it was and it also said no. Frankly that makes it a very hard no for me. If you are using it because you think it's Open Source I suggest you stop.
    
    [-]
    
    ethmarks 4 hours ago ago
    
    > If you are using it because you think it's Open Source I suggest you stop.
    I did not know that. Thank you very much for the correction. I guess I have some keys to revoke now.
qustrolabe 9 hours ago ago

Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models
davide_benato 5 hours ago ago

I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.
zone411 8 hours ago ago

Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/
JacobiX 7 hours ago ago

Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.
[-]
- skrebbel 7 hours ago ago
  
  > it nailed it, but only solved it partially
  Hey either it nailed it or it didn't.
  [-]
  - joaogui1 7 hours ago ago
    
    Probably figured out the exact cause of the bug but not how to solve it
  - JacobiX 7 hours ago ago
    
    Yes; they nailed the root case but the implementation is not 100% correct
BugsJustFindMe 5 hours ago ago

The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.
iib 5 hours ago ago

As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
mrinterweb 7 hours ago ago

Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.
bespokedevelopr 10 hours ago ago

Wow so the polymarket insider bet was true then..
https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...
[-]
- giarc 9 hours ago ago
  
  These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.
  [-]
  - ATMLOTTOBEER 7 hours ago ago
    
    It’s not really abuse though. These markets aggregate information; when an insider takes one side of a trade, they are selling their information about the true price (probability of the thing happening) to the market (and the price will move accordingly).
    You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.
    Disallowing people from making terrible trades seems…paternalistic? Idk
    
    [-]
    
    simianwords 4 hours ago ago
    
    You don’t get it. Allowing insiders to trade disincentivizes normal people from putting money. Why else is it not allowed in stock market?
    
    [-]
    
    ATMLOTTOBEER 4 hours ago ago
    
    Why should normal people be incentivized to make trades on things they probably haven’t got the slightest idea about
  - ethmarks 9 hours ago ago
    
    The point of prediction markets isn't to be fair. They are not the stock market. The point of prediction markets is to predict. They provide a monetary incentive for people who are good at predicting stuff. Whether that's due to luck, analysis, insider knowledge, or the ability to influence the result is irrelevant. If you don't want to participate in an unfair market, don't participate in prediction markets.
    
    [-]
    
    giarc 3 hours ago ago
    
    But what's the point of predicting how many times Elon will say "Trump" on an earnings call (or some random event Kalshi or Polymarket make up)? At least the stock market serves a purpose. People will claim "prediction markets are great for price discovery!" Ok. I'm so glad we found out the chance of Nicki Minaj saying "Bible" during some recent remarks. In case you were wondering, the chance peaked at around 45% and she did not say 'bible'! She passed up a great opportunity to buy the "yes" and make a ton of money!
    https://kalshi.com/markets/kxminajmention/nicki-minaj/kxmina...
    
    [-]
    
    ethmarks 41 minutes ago ago
    
    I agree that the "will [person] say [word]" markets are stupid. "Will Brian Armstrong say the word 'Bitcoin' in the Q4 earnings call" is a stupid market because nobody a actually cares whether or not he actually says 'Bitcoin', they care about whether or not Coinbase is focusing on Bitcoin. If Armstrong manipulates the market by saying the words without actually doing anything, nobody wins except Armstrong. "Will Coinbase process $10B in Bitcoin transactions in Q4" is a much better market because, though Armstrong could still manipulate the market's outcome, his manipulation would influence a result that people actually care about. The existence of stupid markets doesn't invalidate the concept.
    
    suddenlybananas 3 hours ago ago
    
    That argument works for insider training too.
    
    [-]
    
    ethmarks 2 hours ago ago
    
    And? Insider trading is bad because it's unfair, and the stock market is supposed to be fair. Prediction markets are not fair. If you are looking for a fair market, prediction markets are not that. Insider trading is accepted and encouraged in prediction markets because it makes the predictions more accurate, which is the entire point.
    
    [-]
    
    suddenlybananas an hour ago ago
    
    The stock market isn't supposed to be fair.
    
    [-]
    
    ethmarks an hour ago ago
    
    By 'fair', I mean 'all parties have access to the same information'. The stock market is supposed to give everyone the same information. Trading with privileged information (insider trading), is illegal. Publicly traded companies are required to file 10-Qs and 10-Ks. SEC rule 10b5-1 prohibits trading with material non-public information. There are measures and regulations in place to try to make the stock market fair. There are, by design, zero such measures with prediction markets. Insider trading improves the accuracy of prediction markets, which is their whole purpose to begin with.
  - ur-whale an hour ago ago
    
    > people need to realize there are real people on the other side of these bets
    None of whom were forced by anyone to place bets in the first place.
  - Dilettante_ 7 hours ago ago
    
    >Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call.
    For the kind of person playing these sorts of games, that actually really "hype".
  - HDThoreaun 9 hours ago ago
    
    I’m pretty sure that these model release date markets are made to be abused. They’re just a way to pay insiders to tell you when the model will be released.
    The mention markets are pure degenerate gambling and everyone involved knows that
    
    [-]
    
    ATMLOTTOBEER 7 hours ago ago
    
    Correct, and this is actually how all markets work in the sense that they allow for price discovery :)
  - FergusArgyll 8 hours ago ago
    
    Abuse sounds bad, this is good! Now we have a sneak peek into the future, for free! Just don't bet on any markets where an insider has knowledge (or don't bet at all)
- fresh_broccoli 7 hours ago ago
  
  In hindsight, one possible reason to bet on November 18 was the deprecation date of older models: https://www.reddit.com/r/singularity/comments/1oom1lq/google...
zen_boy an hour ago ago

Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?
ponyous 10 hours ago ago

Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
GodelNumbering 11 hours ago ago

And of course they hiked the API prices
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
[-]
- panarky 10 hours ago ago
  
  Claude Opus is $15 input, $75 output.
- xnx an hour ago ago
  
  If the model solves your needs in fewer prompts, it costs less.
- CjHuber 10 hours ago ago
  
  Is it the first time long context has separate pricing? I hadn’t encountered that yet
  [-]
  - 1ucky 10 hours ago ago
    
    Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5
  - Topfi 10 hours ago ago
    
    Google has been doing that for a while.
  - brianjking 10 hours ago ago
    
    Google has always done this.
    
    [-]
    
    CjHuber 10 hours ago ago
    
    Ok wow then I‘ve always overlooked that.
dr_dshiv 6 hours ago ago

Make a pelican riding a bicycle in 3d: https://gemini.google.com/share/def18e3daa39
Amazing and hilarious
[-]
- xnx an hour ago ago
  
  Similar hilarious results (one shot): https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...
hamasho 2 hours ago ago

I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.
realty_geek 7 hours ago ago

I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.
[-]
- xnx an hour ago ago
  
  Gemini CLI can also control a browser: https://github.com/ChromeDevTools/chrome-devtools-mcp
- ZeroCool2u 6 hours ago ago
  
  Seems like their new Antigravity IDE specifically has this built in. https://antigravity.google/docs/browser
  [-]
  - realty_geek 6 hours ago ago
    
    Wow, that is awesome.
gertrunde 10 hours ago ago

"AI Overviews now have 2 billion users every month."
"Users"? Or people that get presented with it and ignore it?
[-]
- singhrac 9 hours ago ago
  
  They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.
- mNovak 8 hours ago ago
  
  Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).
  I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.
  [-]
  - gertrunde 7 hours ago ago
    
    You're right, I'm probably being a little uncharitable!
    Normal users (i.e. not grumpy techies ;) ) probably just go with the flow rather than finding it irritating.
- recitedropper 9 hours ago ago
  
  "Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."
  Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.
Retr0id 7 hours ago ago

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month
Do regular users know how to disable AI Overviews, if they don't love them?
[-]
- jeron 6 hours ago ago
  
  it's as low tech as using adblock - select element and block
oezi 5 hours ago ago

Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).
maczwei an hour ago ago

entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps
aliljet 10 hours ago ago

When will this be available in the cli?
[-]
- _ryanjsalva 10 hours ago ago
  
  Gemini CLI team member here. We'll start rolling out today.
  [-]
  - evandena 9 hours ago ago
    
    How about for Pro (not Ultra) subscribers?
  - aliljet 10 hours ago ago
    
    This is the heroic move everyone is waiting for. Do you know how this will be priced?
- Sammi 10 hours ago ago
  
  I'm already seeing it in https://aistudio.google.com/
briga 9 hours ago ago

Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
[-]
- stephc_int13 8 hours ago ago
  
  Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.
  Given the nature of how those models work, you don't need exact replicas.
raffkede 3 hours ago ago

Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,
vlmrun-admin 6 hours ago ago

https://www.youtube.com/watch?v=cUbGVH1r_1U
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
[-]
- acoustics 5 hours ago ago
  
  Don't self-promote without disclosure.
visioninmyblood 3 hours ago ago

Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators
CephalopodMD 7 hours ago ago

What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.
mikeortman 10 hours ago ago

Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
agentifysh 5 hours ago ago

my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
zurfer 10 hours ago ago

It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.
[-]
- mudkipdev 9 hours ago ago
  
  Isn't it the same cutoff as 2.5?
bilsbie 4 hours ago ago

Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?
[-]
- mtremsal 4 hours ago ago
  
  If you mean the "consumer ecosystem", then Gemini 3 should be available as an API through Google's AI Vertex platform. If you don't even want a Google Cloud account, then I think the answer is no unless they announce a partnership with an inference cloud like cerebras.
- tim333 3 hours ago ago
  
  You could probably do a new account. I have the odd junk google account.
qingcharles 5 hours ago ago

Somebody "two-shotted" Mario Bros NES in HTML:
https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...
vlmrun-admin 6 hours ago ago

https://www.youtube.com/watch?v=cUbGVH1r_1U
side by side comparison of gemini with other models
I_am_tiberius 7 hours ago ago

I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.
[-]
- gpm 7 hours ago ago
  
  > I would even pay for it.
  Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.
  [-]
  - I_am_tiberius 7 hours ago ago
    
    You never have to add your phone number in order to pay.
    
    [-]
    
    gpm 6 hours ago ago
    
    While I haven't tried leaving the field blank on every credit card form I've come across, I'm certain that at least some of them considered it required.
    Perhaps its country specific?
    
    [-]
    
    I_am_tiberius 4 hours ago ago
    
    I've never been asked a phone number. Maybe country specific. no idea.
jdthedisciple 5 hours ago ago

What I'd prefer over benchmarks is the answer to a simple question:
What useful thing can it demonstrably do that its predecessors couldn't?
[-]
- Ridius 3 hours ago ago
  
  Keep the bubble expanding for a few months longer.
taikahessu 5 hours ago ago

Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
nilsingwersen 11 hours ago ago

Feeling great to see something confidential
RobinL 11 hours ago ago

- Anyone have any idea why it says 'confidential'?
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
DrNosferatu 4 hours ago ago

Anyone has any idea if/when it’s coming to paid Perplexity?
energy123 8 hours ago ago

Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.
oceanplexian 7 hours ago ago

Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?
scrollop 9 hours ago ago

Here it makes a text based video editor that works:
https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797
sunaookami 9 hours ago ago

Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
thedelanyo 10 hours ago ago

Reading the introductory passage - all I can say now is, Ai is here to stay.
AstroBen 7 hours ago ago

First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5
hekkle 3 hours ago ago

GOOGLE: "We have a new product".
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
CjHuber 10 hours ago ago

Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over
smarx007 3 hours ago ago

Is it coming to Google Jules?
pflenker 9 hours ago ago

> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.
Come on, you can’t be serious.
[-]
- muzani 4 hours ago ago
  
  This is so disingenuous that it hurts the credibility of the whole thing.
Dquiroga 6 hours ago ago

I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
[-]
- jaapz 6 hours ago ago
  
  Gotta hand it to gemini, those are some top notch points
  [-]
  - fwip 6 hours ago ago
    
    The "Model card leak" point is worth negative points though, as it's clearly a misreading of reality.
  - Dquiroga 6 hours ago ago
    
    yeah hahahahah, it made me think!
- BoorishBears 6 hours ago ago
  
  > We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
  That feels like something between a hallucination and an intentional fallacy that popped up because you specifically said "intense discussion". The increase is 60% on input tokens from the old model, but it's not a markup, and especially not "sold back to us at X markup".
  I've seen more and more of these kinds of hallucinations as these models seem to be RL'd to not be a sycophant, they're slowly inching into the opposite direction where they tell small fibs or embellish in a way that seems like it's meant to add more weight to their answers.
  I wonder if it's a form of reward hacking, since it trades being maximally accurate for being confident, and that might result in better rewards than being accurate and precise
keepamovin 4 hours ago ago

I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?
serjester 9 hours ago ago

It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.
[-]
- aoeusnth1 9 hours ago ago
  
  Maybe they're slow rolling the announcements to be in the news more
  [-]
  - coffeebeqn 9 hours ago ago
    
    Most likely. And/or they use the full model to train the smaller ones somehow
    
    [-]
    
    FergusArgyll 7 hours ago ago
    
    The term of art is distillation
NullCascade 9 hours ago ago

I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
[-]
- ducttapecrown 8 hours ago ago
  
  Terry Tao is writing about this on his blog.
beezlewax 6 hours ago ago

Can't wait til Gemini 4 is out!
ilaksh 8 hours ago ago

okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.
guluarte 11 hours ago ago

it is live in the api
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
[-]
- spudlyo 10 hours ago ago
  
  Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.
samuelknight 10 hours ago ago

"Gemini 3 Pro Preview" is in Vertex
t_minus_40 6 hours ago ago

is there even a puzzle or math problem gemini 3 cant solve?
sylware 5 hours ago ago

Trained models should be able to use formal tools (for instance a logical solver, a computer?).
Good. That said, I wonder if those models are still LLMs.
DeathArrow 11 hours ago ago

It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh
[-]
- rixed 10 hours ago ago
  
  2025: solve the biking pelican problem
  2026: cure cancer
BoorishBears 7 hours ago ago

So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it
Glad to see Google still can't get out of its own way.
testfrequency 7 hours ago ago

I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.
Yes, I know the Workspaces workaround, but that’s silly.
cognitive-gl 2 hours ago ago

Wild
Der_Einzige 10 hours ago ago

When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
thrownaway561 5 hours ago ago

yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.
XCSme 9 hours ago ago

How's the pelican?
skerit 10 hours ago ago

Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?
[-]
- CjHuber 10 hours ago ago
  
  Honestly I liked 2.5 Pro preview much more than the final version
gabriel666smith 3 hours ago ago

Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
otikik 5 hours ago ago

… agentic …
Meh, not interested already
casey2 9 hours ago ago

The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.
denysvitali 10 hours ago ago

Finally!
rvz 10 hours ago ago

I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Where is the outrage?
[0] https://web.archive.org/web/20251118111103/https://storage.g...
[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...
[-]
- inkysigma 9 hours ago ago
  
  Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.
  [-]
  - andrewinardeer 8 hours ago ago
    
    The real question is, "For how long?"
- stefs 10 hours ago ago
  
  i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.
  "gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.
- recitedropper 9 hours ago ago
  
  I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.
  That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.
  If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.
- Yizahi 9 hours ago ago
  
  By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.
  Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.
- aoeusnth1 9 hours ago ago
  
  This seems like a dubious conclusion. I think you missed this part:
  > in accordance with Google’s relevant terms of service, privacy policy
nextworddev 8 hours ago ago

It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.
Also probably over for OpenAI
mihau 10 hours ago ago

@simonw wen pelican
poemxo 9 hours ago ago

It's amazing to see Google take the lead while OpenAI worsens their product every release.
WXLCKNO 9 hours ago ago

Valve could learn from Google here
alksdjf89243 7 hours ago ago

Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.
informal007 11 hours ago ago

It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com
irthomasthomas 9 hours ago ago

I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...
Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...
[-]
- stickfigure 9 hours ago ago
  
  He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.
  [-]
  - irthomasthomas 9 hours ago ago
    
    > Eliezer Yudkowsky is a central figure in the article, mentioned multiple times as the intellectual originator of the community from which the "Zizians" splintered. His ideas and organizations are foundational to the entire narrative.
    
    [-]
    
    stickfigure 8 hours ago ago
    
    And yet you could eliminate him entirely and the story is still coherent.
    The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.
    
    Dilettante_ 7 hours ago ago
    
    You're absolutely right! The AI said it, so it must be true!
    
    [-]
    
    irthomasthomas 7 hours ago ago
    
    At least read what you respond to... Imagine thinking Yudkowsky was NOT a central figure in the Zizians story.
    
    [-]
    
    Dilettante_ 7 hours ago ago
    
    You literally quoted the LLMs output verbatim as your proof.
    Edit: And upon skimming the article at the points where Yudkowsky's name is mentioned, I 100% agree with stickfigure.
    I challenge you to name one way in which the story falls apart without the mention of Yudkowsky.
    
    [-]
    
    irthomasthomas 7 hours ago ago
    
    It sounds like both of you are unfamiliar with the link between the Zizians and Yudkowsky. So let us just return to the discussion of gemini-3, do you think the model did a bad job then in it's second response?
    
    [-]
    
    Dilettante_ 6 hours ago ago
    
    It literally does not matter how much they are connected out here in reality, the AI was to summarize the information in the article and that is exactly what it did.
    >do you think the model did a bad job then in it's second response
    Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.
    
    [-]
    
    irthomasthomas 6 hours ago ago
    
    Why do you think I obviously wanted to hear that?
    
    [-]
    
    Dilettante_ 5 hours ago ago
    
    It's implicit in your prompt!
    "Wtf - no mention of Yudkowsky?"
    Also that is the position you've been defending this whole thread. This whole conversation is happening because you believe Yudkowsy is an important figure to the story.
    
    [-]
    
    irthomasthomas 5 hours ago ago
    
    Here's another attempt: llm --cid 01kabxtjq10exgk56yf802028f "I notice you did not mention Yudkowsky?" --no-log -m gem3-t1 Based on the text provided, Eliezer Yudkowsky is a central background figure to this story, serving as the intellectual progenitor of the movement from which the Zizians splintered.
    Here is specifically how he features in the article:
    * *The Founder:* Yudkowsky is identified as the founder of the Machine Intelligence Research Institute (MIRI) and the Center for Applied Rationality (CFAR), as well as the founder of the LessWrong forum. These are the institutions the Zizians originally belonged to and eventually declared war against. * *The Source Code:* The Zizians based their radical worldview on Yudkowsky’s concepts, specifically *"timeless decision theory."* However, they believed they were applying his logic more rigorously than he was. They became convinced that Yudkowsky’s organizations had betrayed these principles (specifically regarding a rumor about MIRI paying blackmail to cover up a scandal), which they viewed as a moral failing that justified their rebellion.
- gregsadetsky 9 hours ago ago
  
  Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.
  Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.
- briga 9 hours ago ago
  
  Maybe it has guard rails against such things? That would be my main guess on the Zizian one.
kachapopopow 6 hours ago ago

It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.