Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.
Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
Get the actual prompt and have Claude Code / Codex try it out via curl / python requests. The full prompt will yield debugging information. You have to set a few parameters to make sure you get the full gpt-5 performance. e.g. if your reasoning budget too low, you get gpt-4 grade performance.
IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.
I've had success asking it to specifically spawn a subagent to evaluate each work iteration according to some criteria, then to keep iterating until the subagent is satisfied.
I never saw that happen in Codex so there's a good chance that OpenClaw does something wrong. My main suspicion would be that it does not pass back thinking traces.
Gone are the days of deterministic programming, when computers simply carried out the operator’s commands because there was no other option but to close or open the relays exactly as the circuitry dictated. Welcome to the future of AI; the future we’ve been longing for and that will truly propel us forward, because AI knows and can do things better than we do.
I had this funny moment when I realized we went full circle...
"INTERCAL has many other features designed to make it even more aesthetically unpleasing to the programmer: it uses statements such as "READ OUT", "IGNORE", "FORGET", and modifiers such as "PLEASE". This last keyword provides two reasons for the program's rejection by the compiler: if "PLEASE" does not appear often enough, the program is considered insufficiently polite, and the error message says this; if it appears too often, the program could be rejected as excessively polite. Although this feature existed in the original INTERCAL compiler, it was undocumented.[7]"
Isn’t this the optimal behavior assuming that at times the service is compute-limited and that you’re paying less per token (flat fee subscription?) than some other customers? They would be strongly motivated to turn a knob to minimize tokens allocated to you to allow them to be allocated to more valuable customers.
well, I do understand the core motivation, but if the system prompt literally says “I am not budget constrained. Spend tokens liberally, think hardest, be proactive, never be lazy.” and I’m on an open pay-per-token plan on the API, that’s not what I consider optimal behavior, even in a business sense.
GPT 5.4 is really good at following precise instructions but clearly wouldn't innovate on its own (except if the instructions clearly state to innovate :))
Chatgpt is the most neutered LLM out there. Literally anything is better than that. So fucking sensitive. Looks like they modeled it after an average woke Kamala Harris voter.
Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.
Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?
OpenAI has been very generous with limit resets. Please don't turn this into a weird expectation to happen whenever something unrelated happens. It would piss me off if I were in their place and I really don't want them to stop.
The suggestion wasn't about general limit resets when there is bugs or outages, but commercially useful to let users try new models when they have already reached their weekly limits.
Sorry but why should we care if very reasonable suggestions "piss [them] off"? That sounds like a them problem. "Them" being a very wealthy business. I think OpenAI will survive this very difficult time that GP has put them through.
Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.
For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.
Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.
I feel like devs generally spend someone else's money on tokens. Either their employers or OpenAIs when they use a codex subscription.
If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).
I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.
I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.
Yeah and the increase in operating expenses is going to make managers start asking hard questions - this is good. It means eventually there will be budgets put in place - this will force OAI and Anthropic to innovate harder. Then we will see how things pan out. Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.
Sometimes I wonder if innovation in the AI space has stalled and recent progress is just a product of increased compute. Competence is increasing exponentially[1] but I guess it doesn't rule it out completely. I would postulate that a radical architecture shift is needed for the singularity though
> that devs get really reliant and even addicted on coding agents
An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.
On top of that I noticed just right now after updating macos dekstop codex app, I got again by default set speed to 'fast' ('about 1.5x faster with increased plan usage'). They really want you to burn more tokens.
>For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.
A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
Not just games. I have a hobby project where I want to build my own processor. I like the circuit design part. But I'm not keen on having to write a cross-assembler.
I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
Game created by Pietro Schirano, CEO of MagicPath
Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
- Think step by step, take a deep breath. Repeat the question back before answering.
- Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
-Then write all the code. Make the game low-poly but beautiful.
- Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
- You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.
I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
See up thread for anecdotes [1].
> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
This has allowed my workflows to float above the ups and downs of model performance.
That said, having the AI do the planning for a big request like this internally is not good outside a demo.
Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.
I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!
Never thought I'd say this but OpenAI is the 'open' option again.
The more interesting part of the announcement than "it's better at benchmarks":
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
There's already KernelBench which tests CUDA kernel optimizations.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?
SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.
This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.
As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.
As someone who always leaves intelligence at default, and am ok with existing models, should I be shifting gears more manually as providers sell us newer models? Is medium or lower better than free/cheaper models?
I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.
I have a directory of skills that I symlink to Codex/Claude/pi. I make scripts that correspond with them to do any heavy lifting, I avoid platform specific features like Claude's hooks. I also symlink/share a user AGENTS.md/CLAUDE.md
MCPs aren't as smooth, but I just set them up in each environment.
Except for history, I don’t find much that stops you from switching back and forth on the CLI. They both use tools, each has a different voice, but they both work. Have it summarize your existing history into a markdown file, and read it in with any engine.
The APIs are pretty interchangeable too. Just ask to convert from one to the other if you need to.
> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.
> CyberGym 81.8%
Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.
Isn't Mythos limited to a selected group of companies/organizations Anthropic chose themselves? If the OpenAI announcement for GPT-5.5 is accurate the "trusted cyber access" just requires an open, seemingly straightforward identity verification step.
> We are expanding access to accelerate cyber defense at every level. We are making our cyber-permissive models available through Trusted Access for Cyber , starting with Codex, which includes expanded access to the advanced cybersecurity capabilities of GPT‑5.5 with fewer restrictions for verified users meeting certain trust signals (opens in a new window) at launch.
> Broad access is made possible through our investments in model safety, authenticated usage, and monitoring for impermissible use. We have been working with external experts for months to develop, test and iterate on the robustness of these safeguards. With GPT‑5.5, we are ensuring developers can secure their code with ease, while putting stronger controls around the cyber workflows most likely to cause harm by malicious actors.
> Organizations who are responsible for defending critical infrastructure can apply to access cyber-permissive models like GPT‑5.4‑Cyber, while meeting strict security requirements to use these models for securing their internal systems.
"GPT‑5.4‑Cyber" is something else and apparently needs some kind of special access, but that CyberGym benchmark result seems to apply to the more or less open GPT-5.5 model that was just released.
If GPT-5.5 Pro really was Spud, and two years of pretraining culminated in one release, WOW, you cannot feel it at all from this announcement. If OpenAI wants to know why they like they’ve fallen behind the vibes of Anthropic, they need to look no further than their marketing department. This makes everything feel like a completely linear upgrade in every way.
Clearly they felt a big backlash when version 5 was released. Now they are afraid of another response like this. And effectively, for the user it will likely only be a small update.
How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?
I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.
Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.
It's literally a distinct model with a different optimisation goal compared to normal chat. There's a ton of public information around how they work and how they're trained
I can argue that disaster started mid-4.6, when they started juggling with rate limits while hitting uptime problems. Great we have some healthy competition and waiting for the next move from Deepmind.
On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.
The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.
We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.
yes but as far as i know gpt tokenizer is about the same as opus 4.6's, where 4.7 is seeing something in the ballpark of a 30% increase. this should still be cheaper even disregarding the concerns around 4.7 thinking burning tokens
For less than 10% bump across the benchmarks? Probably not, but if your employer is paying (which is probably what OAI is counting on) it's all good.
It's kind of starting to make sense that they doubled the usage on Pro plans - if the usage drains twice as fast on 5.5 after that promo is over a lot of people on the $100 plan might have to upgrade.
You are paying per token, but what you care about is token efficiency. If token efficiency has improved by as much as they claim it did (i.e. you need less tokens to complete a task successfully) all seems well.
Tbf, I have not super kept track of what is actually happening inside the "thinking" portion of recent releases. But last time I checked there still was a lot of verbosity and mistakes, that beat the actual amount of required, usable code generation by a wide margin.
Well, sort of. Imagine the case where it first scans the repo, then "intelligently" creates architecture files describing the project.
The level of intelligence will create a varying quality of summary, with varying need of deep-scans on subsequent sessions. Level of intelligence will also increase comprehension of these architecture files.
Same principle applies when designing plans for complex tasks, etc. Token amount to grasp a concept is what matters.
This happens with every new model release though. The model makes less mistakes and spends less time fixing them, resulting in a token usage reduction for the same difficulty of task. Almost any task other than straight boilerplate will benefit from this.
In the same vein, I would guess that Opus 4.7 is probably cheaper for most tasks than 4.6, even though the tokenizer uses more tokens for the same length of string.
Maybe you'll have better luck but our team just cannot use Opus 4.7.
Some say it goes off on endless tangents, others that it doesn't work enough. Personally, it acts, talks, and makes mistakes like GPT models, for a much more exorbitant price. Misses out on important edge cases, doesn't get off its ass to do more than the bare minimum I asked (I mention an error and it fixes that error and doesn't even think to see if it exists elsewhere and propose fixing it there).
I've slowly been moving to GPT5.4-xhigh with some skills to make it act a bit more like Opus 4.6, in case the latter gets discontinued in favour of Opus 4.7.
I hope the industry starts competing more on highest scores with lowest tokens like this. It's a win for everybody. It means the model is more intelligent, is more efficient to inference, and costs less for the end user.
So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.
Good job on the release notice. I appreciate that it isn't just marketing fluff, but actually includes the technical specs for those of us who care and not concentrated in coding agents only.
I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.
Looking at the space/game/earthquake tracker examples makes me hopeful that OpenAI is going to focus a bit more on interface visual development/integration from tools like Figma. This is one area where Anthropic definitely reigns supreme.
This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)
Chip costs strongly impact the economics of model serving.
It is entirely plausible to me that Opus 4.7 is designed to consume more tokens in order to artificially reduce the API cost/token, thereby obscuring the true operating cost of the model.
I agree though, I chose poor phrasing originally. Better to say that GB200 vs Tranium could contribute to the efficiency differential.
GPT is really great, but I wish the GPT desktop app supported MCP as well.
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?
It's behind Opus 4.7 in SWE-Bench Pro, if you care about that kind of thing. It seems on-trend, even though benchmarks are less and less meaningful for the stuff we expect from models now.
Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.
Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.
Very impressive! Interesting how all other benchmarks it seems to surpass Opus 4.7 except SWE-Bench Pro (Public). You would think that doing so well at Cyber, it would naturally possess more abilities there. Wonder what makes up the actual difference there
After 5.1, we haven’t seen a -codex-max model, presumably because the benefits of the special training gpt-5.1-codex-max got to improve long context work filtered into gpt-5.2-codex, making the variant no longer necessary (my personal experience accords with this). I’ve been using gpt-5.4 in Codex since it came out, it’s been great. I’ve never back-to-back tested a version against its -codex variant to figure out what the qualitative difference is (this would take a long time to get a really solid answer), but I wouldn’t be surprised if at some point the general-purpose model no longer needs whatever extra training the -codex model gets and they just stop releasing them.
I thought it was weird that for almost the entire 5.3 generation we only had a -codex model, I presume in that case they were seeing the massive AI coding wave this winter and were laser focused on just that for a couple months. Maybe someday someone will actually explain all of this.
> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
This might be great if it translates to agentic engineering and not just benchmarks.
It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.
Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.
With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?
Seems like a continuation of the current meta where GPT models are better in GPT-like ways and Claude models are better in Claude-like ways, with the differences between each slightly narrowing with each generation. 5.5 is noticeably better to talk to, 4.7 is noticeably more precise. Etc etc.
This is frankly exciting, outside of the politics of it all, it always feel great to wake up and a new model being released, I personally will stay awake quite long tonight if GPT-5.5 drop in codex.
I'd really like to see improvements like these:
- Some technical proof that data is never read by open ai.
- Proof that no logs of my data or derived data is saved.
etc...
Depends in goals. For long free-firm discussions I find Opus 4.7 Adaptive better/deeper than Opus 4.6 Extended. But usual caveats apply: first week of use and token budget seems generous now on Max 5X.
Cursor is what I daily-drive. 4.7 has been terrible for my mostly python-driven work (whereas Opus 4.6 was literally revolutionary to me). Our frontend folks are also complaining.
I am sceptical. The generation after 4o models have become crappier and crappier. Hope this one changes the trend. 5.4 is unusable for complex coding work.
Surprised to see SWE-Bench Pro only a slight improvement (57.7% -> 58.6%) while Opus 4.7 hit 64.3%. I wonder what Anthropic is doing to achieve higher scores on this - and also what makes this test particular hard to do well in compared to Terminal Bench (which 5.5 seemed to have a big jump in)
There's an asterisk right below that table stating that:
> *Anthropic reported signs of memorization on a subset of problems
And from the Anthropic's Opus 4.7 release page, it also states:
> SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.
... sigh. I realize there's little that can be done about this, but I just got through a real-world session determining of Opus 4.7 is meaningfully better than Opus 4.6 or GPT 5.4, and now there's another one to try things with. These benchmark results generally mean little to me in practice.
Nice to see them openly compare to Opus-4.7… but they don’t compare it against Mythos which says everything you need to know.
The LinkedIn/X influencers who hyped this as a Mythos-class model should be ashamed of themselves, but they’ll be too busy posting slop content about how “GPT-5.5 changes everything”.
I doubt this is representative of real world usage. There is a difference between a few turns on a web chatbot, vs many-turn cli usage on a real project.
It's possible that "smarter" AI won't lead to more productivity in the economy. Why?
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
> "information technology" generally didn't increase productivity
Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.
Productivity metrics were better when businesses were run on just pen and paper. Of course, there could be many confounding factors, but there are also many reasons why this could be so. Just a few hypotheses:
- Pen and paper become a limiting factor on bureaucratic BS
- Pen and paper are less distracting
- Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive
Its quite possible the use of LLMs means that we are using less effort to produce the same output. This seems good.
But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.
Which effect dominates? Difficult to say.
Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.
It's a hypothesis that "smarter" AI models, ie GPT-5.5, may not be a great boon to productivity. Given that this is the raison d'etre of AI models, and improving them, I don't see why it is any less useful than any other discussion.
Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.
(I work at OpenAI.)
Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
Get the actual prompt and have Claude Code / Codex try it out via curl / python requests. The full prompt will yield debugging information. You have to set a few parameters to make sure you get the full gpt-5 performance. e.g. if your reasoning budget too low, you get gpt-4 grade performance.
IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.
I've had success asking it to specifically spawn a subagent to evaluate each work iteration according to some criteria, then to keep iterating until the subagent is satisfied.
I’ve had great success replacing it with Kimi 2.6
I never saw that happen in Codex so there's a good chance that OpenClaw does something wrong. My main suspicion would be that it does not pass back thinking traces.
Anecdata, but I see this in Codex all the time. It takes about two rounds before it realises it's supposed to continue.
Gone are the days of deterministic programming, when computers simply carried out the operator’s commands because there was no other option but to close or open the relays exactly as the circuitry dictated. Welcome to the future of AI; the future we’ve been longing for and that will truly propel us forward, because AI knows and can do things better than we do.
I had this funny moment when I realized we went full circle...
"INTERCAL has many other features designed to make it even more aesthetically unpleasing to the programmer: it uses statements such as "READ OUT", "IGNORE", "FORGET", and modifiers such as "PLEASE". This last keyword provides two reasons for the program's rejection by the compiler: if "PLEASE" does not appear often enough, the program is considered insufficiently polite, and the error message says this; if it appears too often, the program could be rejected as excessively polite. Although this feature existed in the original INTERCAL compiler, it was undocumented.[7]"
— https://en.wikipedia.org/wiki/INTERCAL
These are orthogonal from each other.
Isn’t this the optimal behavior assuming that at times the service is compute-limited and that you’re paying less per token (flat fee subscription?) than some other customers? They would be strongly motivated to turn a knob to minimize tokens allocated to you to allow them to be allocated to more valuable customers.
well, I do understand the core motivation, but if the system prompt literally says “I am not budget constrained. Spend tokens liberally, think hardest, be proactive, never be lazy.” and I’m on an open pay-per-token plan on the API, that’s not what I consider optimal behavior, even in a business sense.
Fair, if you’re paying per token (at comparable rates to other customers) I wouldn’t expect this behavior from a competent company.
GPT 5.4 is really good at following precise instructions but clearly wouldn't innovate on its own (except if the instructions clearly state to innovate :))
Chatgpt is the most neutered LLM out there. Literally anything is better than that. So fucking sensitive. Looks like they modeled it after an average woke Kamala Harris voter.
Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.
Why would you be confused?
The UI tells you which model you're using at any given time.
Congrats on the release! Is Images 2.0 rolling out inside ChatGPT as well, or is some of the functionality still going to be API/Playground-only for a while?
Images 2.0 is already in ChatGPT.
Great, thanks for clarifying :)
Just a tip: add [translated] subtitles to the top video.
Great stuff! Congrats on the release!
Please next time start with azure foundry lol thanks!
With Anthropic, newer models often lead to quality degradation. Will you keep GPT 5.4 available for some time?
can't wait! Thanks guys. PS: when you drop a new model, it would be smart to reset weekly or at least session limits :)
OpenAI has been very generous with limit resets. Please don't turn this into a weird expectation to happen whenever something unrelated happens. It would piss me off if I were in their place and I really don't want them to stop.
The suggestion wasn't about general limit resets when there is bugs or outages, but commercially useful to let users try new models when they have already reached their weekly limits.
There is absolutely nothing wrong with asking or suggesting. They are adults. I'm sure they can handle it.
Sorry but why should we care if very reasonable suggestions "piss [them] off"? That sounds like a them problem. "Them" being a very wealthy business. I think OpenAI will survive this very difficult time that GP has put them through.
https://news.ycombinator.com/newsguidelines.html
Limits were just reset two days ago.
And yet there was an outage last night
And they're having an outage right now.
This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838
And that backdoor API has GPT-5.5.
So here's a pelican: https://gist.github.com/simonw/edda1d98f7ba07fd95eeff473cb16...
I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex
It's... like no pelican I've ever seen before.
That pelican you posted yesterday from a local model looks nicer than this one.
Edit: this one has crossed legs lol
I'd like to draw people's attention to this section of this page:
https://developers.openai.com/codex/pricing?codex-usage-limi...
Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.
For API usage, GPT-5.5 is 2x the price of GPT-5.4, ~4x the price of GPT-5.1, and ~10x the price of Kimi-2.6.
Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.
I feel like devs generally spend someone else's money on tokens. Either their employers or OpenAIs when they use a codex subscription.
If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).
I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.
I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.
Yeah and the increase in operating expenses is going to make managers start asking hard questions - this is good. It means eventually there will be budgets put in place - this will force OAI and Anthropic to innovate harder. Then we will see how things pan out. Ultimately a firm is not going to pay rent to these firms if the benefits dont exceed the costs.
Sometimes I wonder if innovation in the AI space has stalled and recent progress is just a product of increased compute. Competence is increasing exponentially[1] but I guess it doesn't rule it out completely. I would postulate that a radical architecture shift is needed for the singularity though
[1]https://arxiv.org/html/2503.14499v1* *Source is from March 2025 so make of it what you will.
> that devs get really reliant and even addicted on coding agents
An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.
[delayed]
On top of that I noticed just right now after updating macos dekstop codex app, I got again by default set speed to 'fast' ('about 1.5x faster with increased plan usage'). They really want you to burn more tokens.
what's the source on that?
In the announcement webpage:
>For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.
oops, thanks. i had just been looking at their api docs
A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
> It still struggles to create shaders from scratch
Oh just like a real developer
Much respect for shader developers, it's a different way of thinking/programming
Not just games. I have a hobby project where I want to build my own processor. I like the circuit design part. But I'm not keen on having to write a cross-assembler.
An hour or so with Claude Code and https://paste.c-net.org/ParmesanControl
I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
The prompt did not specify advanced gameplay.
I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
See up thread for anecdotes [1].
> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
This has allowed my workflows to float above the ups and downs of model performance.
That said, having the AI do the planning for a big request like this internally is not good outside a demo.
Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
[1] https://news.ycombinator.com/item?id=47879819
It comes across as an elaborate, sparkly motivational cat poster.
*BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E
> Think Step By Step
What is this, 2023?
I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.
I personally don't think the gameplay itself is that impressive.
Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.
I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!
Never thought I'd say this but OpenAI is the 'open' option again.
isnt it like cyber question are being routed to dumper models at openai?
Do you have a source for that?
Neither the release post, nor the model card seems to indicate anything like this?
The more interesting part of the announcement than "it's better at benchmarks":
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
There's already KernelBench which tests CUDA kernel optimizations.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?
Look a cost per intelligence or cost per task instead of cost per token.
Such an increase tracks the company's valuation trend, which they constantly, somehow have to justify (let alone break even on costs).
SOTA models get distilled to open source weights in ~6 months. So paying premium for bleeding edge performance sounds like a fair compensation for enormous capex.
We know they cost much more than this for OpenAI. Assume prices will continue to climb until they are making money.
If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.
Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch
I love when Apple says they’re releasing their best iPhone yet so I know the new model is better than the old ones.
"our newest and most expensive model yet"
can't wait for "our worst and dumbest model yet"
Apple should have used that one for the 2016 MacBook.
I'm here for the pelicans and I'm not leaving until I see one!
I've come to prompt pelicans and chew gum, and I'm all outta gum!
That's a true CTO right there.
I know a 10x engineer when i see one.
Ctrl+F: pelican
F5
simonw pls
This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.
As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.
As someone who always leaves intelligence at default, and am ok with existing models, should I be shifting gears more manually as providers sell us newer models? Is medium or lower better than free/cheaper models?
I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.
Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.
I have a directory of skills that I symlink to Codex/Claude/pi. I make scripts that correspond with them to do any heavy lifting, I avoid platform specific features like Claude's hooks. I also symlink/share a user AGENTS.md/CLAUDE.md
MCPs aren't as smooth, but I just set them up in each environment.
Except for history, I don’t find much that stops you from switching back and forth on the CLI. They both use tools, each has a different voice, but they both work. Have it summarize your existing history into a markdown file, and read it in with any engine.
The APIs are pretty interchangeable too. Just ask to convert from one to the other if you need to.
Is this the first time OpenAI has published comparisons to other labs?
Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.
Might be an tacit admission of being behind.
[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/
> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.
> CyberGym 81.8%
Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.
Isn't Mythos limited to a selected group of companies/organizations Anthropic chose themselves? If the OpenAI announcement for GPT-5.5 is accurate the "trusted cyber access" just requires an open, seemingly straightforward identity verification step.
https://openai.com/index/scaling-trusted-access-for-cyber-de...
"GPT‑5.4‑Cyber" is something else and apparently needs some kind of special access, but that CyberGym benchmark result seems to apply to the more or less open GPT-5.5 model that was just released.Isn't CyberGym an open benchmark so trivial to benchmaxx anyway?
Not good for employees that are being measured by their token usage.
If GPT-5.5 Pro really was Spud, and two years of pretraining culminated in one release, WOW, you cannot feel it at all from this announcement. If OpenAI wants to know why they like they’ve fallen behind the vibes of Anthropic, they need to look no further than their marketing department. This makes everything feel like a completely linear upgrade in every way.
Clearly they felt a big backlash when version 5 was released. Now they are afraid of another response like this. And effectively, for the user it will likely only be a small update.
Also the naming department. You can tell that this is the AI company Microsoft chose to back because their naming scheme is as bad as .NET's.
I actually have no problem with the 5.x line... but if Pro really was an entirely new pretrain, they did a horrible job conveying that.
> It excels at ... researching online
How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?
I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.
Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.
It's literally a distinct model with a different optimisation goal compared to normal chat. There's a ton of public information around how they work and how they're trained
I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at
Well anectodally, 5.4 was already better than opus 4.7 so it should not have been hard.
I like that Anthropic rushed 4.7 out to get a couple days of coverage before 5.5 hit
Everything since that launch to this release has been a PR disaster for Anthropic.
I can argue that disaster started mid-4.6, when they started juggling with rate limits while hitting uptime problems. Great we have some healthy competition and waiting for the next move from Deepmind.
Pricing: $5/1M input, $30/1M output
(same input price and 20% more output price than Opus 4.7)
Yep, it's more expensive per token.
However, I do want to emphasize that this is per token, not per task.
If we look at Opus 4.7, it uses smaller tokens (1-1.35x more than Opus 4.6) and it was also trained to think longer. https://www.anthropic.com/news/claude-opus-4-7
On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.
The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.
We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.
(I work at OpenAI.)
yes but as far as i know gpt tokenizer is about the same as opus 4.6's, where 4.7 is seeing something in the ballpark of a 30% increase. this should still be cheaper even disregarding the concerns around 4.7 thinking burning tokens
That pricing is extremely spicy, wow.
Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.
https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...
Worth the 100% price increase over GPT-5.4?
For less than 10% bump across the benchmarks? Probably not, but if your employer is paying (which is probably what OAI is counting on) it's all good.
It's kind of starting to make sense that they doubled the usage on Pro plans - if the usage drains twice as fast on 5.5 after that promo is over a lot of people on the $100 plan might have to upgrade.
You are paying per token, but what you care about is token efficiency. If token efficiency has improved by as much as they claim it did (i.e. you need less tokens to complete a task successfully) all seems well.
Not for coding because it actually needs to read and write large files
Tbf, I have not super kept track of what is actually happening inside the "thinking" portion of recent releases. But last time I checked there still was a lot of verbosity and mistakes, that beat the actual amount of required, usable code generation by a wide margin.
Well, sort of. Imagine the case where it first scans the repo, then "intelligently" creates architecture files describing the project. The level of intelligence will create a varying quality of summary, with varying need of deep-scans on subsequent sessions. Level of intelligence will also increase comprehension of these architecture files.
Same principle applies when designing plans for complex tasks, etc. Token amount to grasp a concept is what matters.
If it uses half the tokens to complete a task, then doubling the cost is perfectly fine. But is that actually true?
This happens with every new model release though. The model makes less mistakes and spends less time fixing them, resulting in a token usage reduction for the same difficulty of task. Almost any task other than straight boilerplate will benefit from this.
In the same vein, I would guess that Opus 4.7 is probably cheaper for most tasks than 4.6, even though the tokenizer uses more tokens for the same length of string.
Maybe you'll have better luck but our team just cannot use Opus 4.7.
Some say it goes off on endless tangents, others that it doesn't work enough. Personally, it acts, talks, and makes mistakes like GPT models, for a much more exorbitant price. Misses out on important edge cases, doesn't get off its ass to do more than the bare minimum I asked (I mention an error and it fixes that error and doesn't even think to see if it exists elsewhere and propose fixing it there).
I've slowly been moving to GPT5.4-xhigh with some skills to make it act a bit more like Opus 4.6, in case the latter gets discontinued in favour of Opus 4.7.
Doesn't look like it's cheaper, better or uses fewer tokens: https://www.reddit.com/r/Anthropic/comments/1stf6fz/one_week...
YMMV, I know.
We'll find out!
I hope the industry starts competing more on highest scores with lowest tokens like this. It's a win for everybody. It means the model is more intelligent, is more efficient to inference, and costs less for the end user.
So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.
The premise of the trillion dollars in AI investments is not that it’ll be as good as it currently is but cheaper. It’s AGI or bust at this point.
Yeah, but don’t you agree that less tokens to accomplish the same goal is a sign of increasing intelligence?
Less cost to accomplish the same goal is a sign of intelligence. That's not necessarily achieved with less tokens but it may be.
Kind of? But I really care about price speed and quality. If it used 10x tokens at 1/10th the tokens and same latency I would be neutral on it.
Kimmi 2.6 for example seems to throw more tokens to improve performance (for better or worse)
GPT-5.5 System Card:
https://deploymentsafety.openai.com/gpt-5-5
Good job on the release notice. I appreciate that it isn't just marketing fluff, but actually includes the technical specs for those of us who care and not concentrated in coding agents only.
I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.
Looking at the space/game/earthquake tracker examples makes me hopeful that OpenAI is going to focus a bit more on interface visual development/integration from tools like Figma. This is one area where Anthropic definitely reigns supreme.
This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)
*I work at OAI.
Could be a great feature, can't wait to test! Tired of other models (looking at you Opus) constantly stuck mid-task lately.
For a 56.7 score on the Artificial Intelligence Index, GPT 5.5 used 22m output tokens. For a score of 57, Opus 4.7 used 111m output tokens.
The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?
why would chip affect token quantity. this is all models.
Chip costs strongly impact the economics of model serving.
It is entirely plausible to me that Opus 4.7 is designed to consume more tokens in order to artificially reduce the API cost/token, thereby obscuring the true operating cost of the model.
I agree though, I chose poor phrasing originally. Better to say that GB200 vs Tranium could contribute to the efficiency differential.
Chips doesn’t impact output quality in this magnitude
True, but the qualifying the power played a large part. Most likely nuclear power for this high quality token efficiency.
GPT is really great, but I wish the GPT desktop app supported MCP as well.
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
Use codex app
Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?
It's behind Opus 4.7 in SWE-Bench Pro, if you care about that kind of thing. It seems on-trend, even though benchmarks are less and less meaningful for the stuff we expect from models now.
Will be interesting to try.
Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.
Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.
Very impressive! Interesting how all other benchmarks it seems to surpass Opus 4.7 except SWE-Bench Pro (Public). You would think that doing so well at Cyber, it would naturally possess more abilities there. Wonder what makes up the actual difference there
Will we also see a GPT-5.5-Codex version of this model? Or will the same version of it be served both in the web app and in Codex?
After 5.1, we haven’t seen a -codex-max model, presumably because the benefits of the special training gpt-5.1-codex-max got to improve long context work filtered into gpt-5.2-codex, making the variant no longer necessary (my personal experience accords with this). I’ve been using gpt-5.4 in Codex since it came out, it’s been great. I’ve never back-to-back tested a version against its -codex variant to figure out what the qualitative difference is (this would take a long time to get a really solid answer), but I wouldn’t be surprised if at some point the general-purpose model no longer needs whatever extra training the -codex model gets and they just stop releasing them.
I thought it was weird that for almost the entire 5.3 generation we only had a -codex model, I presume in that case they were seeing the massive AI coding wave this winter and were laser focused on just that for a couple months. Maybe someday someone will actually explain all of this.
> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
This might be great if it translates to agentic engineering and not just benchmarks.
It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.
Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.
With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?
They say latency is the same as 5.4 and 5.5 is served on GB200 NVL72, so I assume 5.4 was served on hopper.
Seems like a continuation of the current meta where GPT models are better in GPT-like ways and Claude models are better in Claude-like ways, with the differences between each slightly narrowing with each generation. 5.5 is noticeably better to talk to, 4.7 is noticeably more precise. Etc etc.
Interesting. At least for me it's pegged to 256k tokens vs the 1M for 5.4, or it's not respecting toml config.
I might just be following too many AI-related people on X, but omg the media blitz around 5.5 is aggressive.
Soo many unconvincing "I've had access for three weeks and omg it's amazing" takes, it actually primes me for it to be a "meh".
I prefer to see for myself, but the gradual rollout, combined with full-on marketing campaign, is annoying.
What is the reason behind OpenAI being able to release new models very fast?
Since Feb when we got Gemini 3.1, Opus 4.6, and GPT-5.3-Codex we have seen GPT-5.4 and GPT-5.5 but only Opus 4.7 and no new Gemini model.
Both of these are pretty decent improvements.
Competition.
This is frankly exciting, outside of the politics of it all, it always feel great to wake up and a new model being released, I personally will stay awake quite long tonight if GPT-5.5 drop in codex.
I wonder if it's the same model and they just keep adding more post-training.
The rumor was that the 5.5 is a brand new pretrain. But who knows, it's 2x as expensive as 5.4, so it would check out.
Anthropic is really tiny, and Google is just being Google, their models are just to show that they're hip with what the kids are doing.
They aren't new models.
I'd really like to see improvements like these: - Some technical proof that data is never read by open ai. - Proof that no logs of my data or derived data is saved. etc...
82.7% on Terminal Bench is crazy
Is it? There are 5 other models near ~80% and it was achieved in March... which in AI-world seems like a century ago.
https://www.tbench.ai/leaderboard/terminal-bench/2.0
So according to the benchmarks somewhere in between Opus 4.7 and Mythos
GPT 5.4 is already better than Opus 4.7 to me. But, then again, Opus 4.7 is a massive disappointment. I hope they don't discontinue 4.6.
Depends in goals. For long free-firm discussions I find Opus 4.7 Adaptive better/deeper than Opus 4.6 Extended. But usual caveats apply: first week of use and token budget seems generous now on Max 5X.
I’ve had great experience using opus 4.7 in cursor. Works for everything including iOS frontend
Cursor is what I daily-drive. 4.7 has been terrible for my mostly python-driven work (whereas Opus 4.6 was literally revolutionary to me). Our frontend folks are also complaining.
I left a comment here with this sentiment https://news.ycombinator.com/item?id=47879896
> A playable 3D dungeon arena
Where's the demo link?
I am sceptical. The generation after 4o models have become crappier and crappier. Hope this one changes the trend. 5.4 is unusable for complex coding work.
Good timing I had just renewed my subscription.
Surprised to see SWE-Bench Pro only a slight improvement (57.7% -> 58.6%) while Opus 4.7 hit 64.3%. I wonder what Anthropic is doing to achieve higher scores on this - and also what makes this test particular hard to do well in compared to Terminal Bench (which 5.5 seemed to have a big jump in)
There's an asterisk right below that table stating that:
> *Anthropic reported signs of memorization on a subset of problems
And from the Anthropic's Opus 4.7 release page, it also states:
> SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.
Was 4.7 distilled off Mythos (which got 77.8%)? Interesting how mythos got 82% on terminal-bench 2.0 compared to 82.7% for GPT-5.5.
Also notice how they state just for SWE-Bench Pro: "*Anthropic reported signs of memorization on a subset of problems"
is there anywhere I can try it? ( I just stopped my pro sub ) but was wondering if there is a playground or 3rd party so i can just test it briefly?
Cannot see it in Codex CLI
... sigh. I realize there's little that can be done about this, but I just got through a real-world session determining of Opus 4.7 is meaningfully better than Opus 4.6 or GPT 5.4, and now there's another one to try things with. These benchmark results generally mean little to me in practice.
Anyways, still exciting to see more improvements.
How does it compare to mythos?
finally
Nice to see them openly compare to Opus-4.7… but they don’t compare it against Mythos which says everything you need to know.
The LinkedIn/X influencers who hyped this as a Mythos-class model should be ashamed of themselves, but they’ll be too busy posting slop content about how “GPT-5.5 changes everything”.
Are there faster mini/nano versions as well?
Not this time, no.
Usually, those get released a few weeks later.
I've stopped trusting these "trust me bro" benchmarks and just started going to LM Arena and looking for the actual benchmark comparisons.
https://arena.ai/leaderboard/code
I doubt this is representative of real world usage. There is a difference between a few turns on a web chatbot, vs many-turn cli usage on a real project.
This is not any better of a benchmark
I'm still using 5.3 in codex. Are 5.4 and 5.5 better than 5.3 in concrete ways?
The benchmarks say so, but try it out with actual tasks and be the judge.
Is this the first time OpenAI compared their new release to Anthropic models? Previously they were comparing only to GPT's own previous versions.
ARC-AGI 3 is missing on this list - given that the SOTA before 5.5 <1% if I recall, I wonder if this didn't make meaningful progress.
It's a silly benchmark anyways.
If anyone tried it already, how do you feel?
Numbers look too good, wondering if it is benchmaxxed or not
It's possible that "smarter" AI won't lead to more productivity in the economy. Why?
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
> "information technology" generally didn't increase productivity
Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.
Productivity metrics were better when businesses were run on just pen and paper. Of course, there could be many confounding factors, but there are also many reasons why this could be so. Just a few hypotheses:
- Pen and paper become a limiting factor on bureaucratic BS
- Pen and paper are less distracting
- Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive
etc etc
Its quite possible the use of LLMs means that we are using less effort to produce the same output. This seems good.
But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.
Which effect dominates? Difficult to say.
Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.
Downvoted by the AI Nazis. They are running a tight ship before the IPOs.
I downvoted it because it doesn't add anything useful to the conversation, and I don't own any AI stock.
It's a hypothesis that "smarter" AI models, ie GPT-5.5, may not be a great boon to productivity. Given that this is the raison d'etre of AI models, and improving them, I don't see why it is any less useful than any other discussion.
Not rolled out to my Codex CLI yet, but some users on Reddit claiming it's on theirs.
Next up: Google I/O on May 19?
I have to imagine they'll go to Gemini 3.5 if only for marketing reasons.
they are using ethical training weights this time!!! /j
Two hundred pages of shilling and it’s a 1% improvement in the benchmarks. They’re dead in the water.
Imagine spending 100m on some of these AI “geniuses” and this is the best they can do.
the attenuation of man nears
< 5 years until humans are buffered out of existence tbh
may the light of potentia spread forth beyond us
Great modal, I have been using codex and its awesome. Lets see what GPT-5.5 does to it
I just can't bear to use services from this company after what they did to the global DRAM markets.
I'm not trying to make any kind of moral statement, but the company just feels toxic to me.