I would love to see all the big players put 1% of the effort they put into model training into making the basic process of paying and signing in suck less.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
I just won’t even waste my time with the google stuff cuz I can’t figure out how to pay with it.
And that’s a problem everywhere at google. Our google play account is suspended cuz I can’t verify the company. It won’t let me cuz it says I’m not the owner. I’ve always been the owner of my company. For 18 years. There is no one else.
Once some error said make sure the owner email matches your profile in google payments and I was like, what is google payments and where do I even begin with that? I’ve never paid for google play so what does payments have to do with anything?
It’s totally random stuff. Get your shit together, google. Make your products and payment systems coherent, rather than it obviously looking like it was designed by a fiefdom full of territorial managers.
The "Owner" accounts in Google Play and Apple's App Store are so freaking annoying. The only time they make sense is for solo-founders and even then I've had issues. Now expand it to working at a larger company and it's a joke, a bad one. Oh sure, I'll just get the CEO (or other higher-up) to login and accept new agreements, that will be easy. Even more fun when you tell a client (who logged in exactly 1 time to set up the account) that they need to use a generic email (not a personal one or an employee-specific one), the ignore your suggestion, and then they can't get back in because the person who set up the account left the company. It's a mess.
Also, re "Google Payments", I tried to transfer an app from my personal/solo Google Play account to a new business one I set up for my LLC and it was like pulling teeth. They wanted me to find some payment id from the original $20 purchase I made to get access to Google Play, something I did right around when they first launched and while I still have/use the same email, Google came out with approximately 1 googol different "payment solutions" in the interim and their engineers don't care about data migrations. Finally, after many support emails, they just transferred it without me giving that code which just shows how silly the whole thing was from the start.
Can relate. My inactive google ads account all of a sudden got banned. No explanation except some generic link to their terms of service. Appealed, got automatic denial, no reason given. Have retried multiple times, same result
Google keeps changing their privacy and “don’t train on my data/code” options. When gemini-cli launched, there was a clear toggle for “don’t train on my code.” That’s now gone; it just links to a generic privacy page for me. Maybe something with my account changed, I can't figure it out. Deep in the Cloud Gemini console, there’s another setting that might control training, but it’s not clear what products it actually covers.
Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra personal subscription? I already pay for OpenAI and Anthropic’s pro/max plans and would happily pay Google too. But the only obvious option is a $250/month tier, and its documentation indicates Google can train on your code unless you find and enable the correct opt-out. If that opt-out exists in all the products, it’s not obvious where it lives or what products it applies to.
Workspace complicates it further. Google advertises that with business workspace accounts your data isn’t used for training. So, I was going to try Antigravity on our codebase. At this point I know I can't trust Google, so I read the ToS carefully. They train on your prompts and source code, and there doesn't appear to be a way to pay them and opt out right now. Be careful, paying for Google Workspace does not protect you, always read the ToS.
Be careful with AI-studio and your Google Workspace accounts. They train on your prompts unless you switch it to API mode.
The result is a lot of uncertainty. I genuinely have no idea how to pay Google for Gemini without risking my code being used for training. And if I do pay, I can’t tell whether they’ll train on my prompts anyway.
The marketing for their coding products does not clearly state when they do or do not train on your prompts and code.
I had to run deep research to understand the risks with using Gemini 3 for agentic work, and I still don't feel confident that I understand the risks. I might have said some incorrect things above, but I am just so confused. I feel like I have a <75% grasp on the situation.
I don't have a lot of trust. And honestly, this feels confusing and deceptive. One could easily confuse it as deliberate strategy to gather training data through ambiguity and dark patterns, it certainly looks like this could be Google's strategy to win the AI race. I assume this is just how it looks, and that they aren't being evil on purpose.
OpenAI in particular has my trust. They get it. They are carefully building the customer experience, they are product and customer driven from the top.
At this point I’m not convinced that Gemini 3 Pro was post-trained on data Google had permission to use, going by the myriad of issues on the Gemini CLI tracker around Google AI/Google One/Google Cloud/Google Workspaces.
Couldn't agree more about the google product offerings. Vertex AI? AI Studio? Maker studio? Gemini? The documentation is fragmented with redundant offerings making it confusing to determine what is what. GCS billing is complicated to figure out vs OpenAI billing or anthropic.
Sad part is Google does offer a ChatML/OpenAI compliant endpoint to do LLM calls and I believe they in an experiment also reduced friction in getting an API key to start making calls right away but discoverability ever remains a challenge with google services.
I've just found myself using OpenRouter if we need Google models for a project, it's worth the extra 5% just not to have to deal with the utter disaster that is their product offering.
Adding to this, Google's models can only be used with GCP while OpenAI's models can be used with Azure, Anthropic's models can be used with AWD Bedrock, in addition to their own platforms.
I'd love to see the Gemini models being available by other providers :) or if they just build a simple prepaid wallet like OpenAI and Anthropic.
Didn't realize these stipulations for the models. Looking at devops-y job descriptions the last few months I noticed nearly everyone has some kind of Azure requirement now (which I've mostly avoided because I don't want to end up managing someone's AD), but is openai the actual reason for it?
We're just using Github Copilot as our primary entrypoint for all of the model families. Its the only way we can easily offer our devs some level of Claude, Gemini, and Codex all in one place.
Last night, just after Gemini 3 was released and became available for Gemini-CLI, I saw Gemini-CLI's team post that you could access Gemini 3 with either an API key OR with _Gemini AI Ultra_, so I thought: great, I'll get that!
Now you CAN NOT get the Google One stuff if your account is part of a workspace.
I thought: how awful. I want to pay, but I simply can't?
Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license via the Google Workspace Admin area, great!
Turns out: you fucking can't. That's _Google AI Ultra FOR BUSINESS_ and that IS NOT supported.
So I had to get the Google One subscription on my personal account after all.
Combine that with the _pathetic_ usage limits: somehow not token-based, but amount of requests per 24 hour window (which is 500 for Gemini 3) and Gemini 3's incredible chattiness (it uses A LOT more requests to get something done compared to Claude) and you hit the usage limits in just 2 hours.
The fun one for me is that I moved countries and last I checked there’s still no way to change your phone number on ChatGPT short making a new account, so now my account is associated with a phone number that I no longer have access to and will eventually be reassigned to someone else.
> Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
Please give me an option for a password (or passkey) or literally anything else that doesn't require either linking with google or going through an email flow for every login
Such great case studies of how LLM coding will make all of your employees 1000x more productive at coding, design, and UX. They really are leading the way showing us into the brighter future of AI software /s
I've been using a lot of Claude and Codex recently.
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
I've been really impressed with codex so far. I have been working on a flight simulator hobby project for the last 6 months and finally came to the conclusion that I need to switch from floating origin, which my physics engine assumes with the coordinate system it uses, to a true ECEF coordinate system (what underpins GPS). This involved a major rewrite of the coordinate system, the physics engine, even the graphics system and auxilary stuff like asset loading/unloading etc. that was dependent on local X,Y,Z. It even rewrote the PD autopilot to account for the changes in the coordinate system. I gave it about a paragraph of instructions with a couple of FYIs and... it just worked! No major graphical glitches except a single issue with some minor graphical jitter, which it fixed on the first try. In total took about 45 minutes but I was very impressed.
I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.
> Claude basically disregards your instructions (CLAUDE.md) entirely
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently
Highly recommend adding some kind of canary like this in all LLM project instructions. I prefer my instructions to say 'always start output with an (uniquely decided by you) emoji' as it's easier to visually scan for one when reading a wall of LLM output, and use a different emoji per project because what's life without a little whim?
> Codex will rewrite the entire V8 engine to break arithmetic.
This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.
I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.
In my AGENTS.md (which CLAUDE.md et al soft link to), I instruct them to, on phase completion, "Explicitly write that you followed these guidelines." This text always shows up on Codex and very rarely on Claude Code (TBF, Claude Code is doing it more often lately).
Medium has things dialed in. When both high and low are coherent but medium goes to cubism? That’s intent. Or it had a miscue on proportions vs shape placement. Either way, it’s great, sandwiched the way it is, between the other two. Did it put a comment in all of them or just the one w/ the hat?
Also, thanks for the posts— it’s hugely helpful to have a continuity of insightful perspective throughout.
I currently use GPT‑5.1-Codex High and have a workflow that works well with the 5-hour/weekly limits, credits, et al. If I use GPT‑5.1-Codex-Max Medium or GPT‑5.1-Codex-Max High, how will that compare cost / credits / limits wise to GPT‑5.1-Codex High? I don't think that's clear. "Reduced tokens" makes me think it'll be priced similarly / lower. But, "Max" makes me think it'll be priced higher.
Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:
research -> implementation plan -> actual implementation (based on research + plan) -> validation
I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).
Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.
Will -minis come for the codex family of models? About two months ago I used 5-mini as a daily driver for a few weeks and quite liked it, it seemed capable enough on small tasks with some hand holding and the speed/price were great as well.
Sorry don’t like the max model, feels like it needs a lot more guiding. The plans it writes however are better, so I tried feeding it back in (meta prompt style) and working okay so far. Very large repository.
I think the point here is not that it does compaction (which Codex also already does) - but that the model was trained with examples of the Codex compaction, so it should perform better when compaction has taken place (a common source for drops in performance for earlier models).
I am also trying to understand the difference between compaction, and what IDEs like Cursor do when they "summarize" context over long-running conversations.
Is this saying that said summarization now happens at the model level? Or are there other differences?
OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
That’s how the game is played. We should be grateful for all the competition that is driving these improvements, not whinging about the realities of what companies have to do to contest each other’s position.
These 2 sentences right next to each other stood out to me:
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
Absolutely contradictory. The long-running tendency for Codex is why I cannot understand the hype around it: if you bother to watch what it does and read its code the approaches it takes are absolutely horrifying. It would rather rewrite a TLS library from scratch than bother to ask you if the network is available.
these things are actually fixable with prompting. is it easy? no. is it PEBKaC if you don’t do anything to change course as it builds a TLS library? yes, but paperclip maximized! xD
(Disclaimer: Am on the Codex team.)
We're basically trying to build a teammate that can do both short, iterative work with you, then as you build trust (and configuration), you can delegate longer tasks to it.
If you haven't, give Cursor's Composer model a shot. It might not be quite as good as the top models, but in my experience it's almost as good, and the lightning fast feedback is more than worth the tradeoff. You can give it a task, wait ten seconds, and evaluate the results. It's quite common for it to not be good enough, but no worse than Sonnet, and if it doesn't work you just wasted 30 seconds instead of 10 minutes.
I really would prefer them to start creating customized models.
I've vibe coded Godot games extensively.
Just about every model I've tried likes to invent imaginary functions.
I was really prefer for there to be a way for me to pick model trained in whatever framework I need.
Reviewing AI generated code feels like editing a long book, and every now and then you notice some words are just completely made up. You then ask the AI to fix its book, and it will just add more AI generated words.
On one hand I want this to be a reality check to everyone who's trying to lay off real software engineers to replace us with AI.
On the other hand half of the stock market is held up by overhyped AI valuations. If the tide goes out too fast, and there is a mass realization that this stuff just isn't as good as it's hyped to be, it's not going to be fun for anyone.
This is a tangent: Has anyone noticed that GPT-5.0 at some point started producing much faster, crappier answers, then 5.1 made it slower + better again? (Both in Thinking mode)
Weird how they only share three hand-picked evals, ignoring the evals where they were left in the dust like ARC-AGI2. This post is so misleading, I don't even know whether to trust the numbers they did share. One is just fraction of a percentage point away from Gemini 3 pro, which is awfully convenient for marketing and easy to hide. Very open, OpenAI.
Not really that weird. This isn't intended to be a "general" model. This is a coding model so they showed the coding evals. The assumption would be relative to GPT5.1, non-coding evals would be likely regress or be similar.
Like when advertising the new airliner, most people don't care about how fast it taxis.
I've been dealing with Codex CLI for a while and I love it, but I'm wondering if my thinking is just limited. While I'm starting discussions and creating plan docs, I've never been able to ask it to do anything that takes it longer than 25 minutes or so. Usually far less. I'm having trouble imagining what I can ask it to do that would make it take hours - like, wouldn't that require putting together an absolutely massive planning doc that would take hours to put together anyway? I'd rather just move incrementally.
Easy way to get an agent to run a long time is just to get it to babysit CI/CD, tell it to iterate on it until it passes. I got Sonnet 4 to run for >6 hours that way.
Perhaps they're combining an incredibly complex product that has a lot of interactive features, a big codebase, test creation, and maybe throwing some MCP stuff in there such as creating creating a ticket in Jira if a test fails?
> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?
Attention is quadratic, so you have to pick a cutoff for context window size. In addition, the error/noise in state space increases with longer contexts, resulting in poorer performance. So even if you're willing to take the O(n^2) slowdown of a larger context window, it still won't work.
Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.
So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.
MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.
DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.
Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.
There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.
In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.
In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.
Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.
It's a fair question, even if it might be coming from a place of misunderstanding.
For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).
so this was arctic fox it seems, lot of us ended up downgrading to codex 5.0 because of the token burn was too much, i see codex max is a step up which is welcome but still unsure if they solved that github issue around tool use that impacts tokens
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority
im sticking with codex for now and using gemini 3 for frontend
Have you found that Gemini is better than Codex for front end generation? I'm trying to bring some Figma screens into a small React project I have, and Codex will occasionally screw up the implementation despite the fact that I'm using the MCP server.
"Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces."
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!
Gemini is still the best oracle/planner by a mile. It's just a bad agent. Give it a bundle of your repo and get it to plan your changes, then hand it off to codex to implement.
I still want something no one has, which is the ability to launch agents in different git worktrees simultaneously and check the results out on my main branch for testing when they are finished.
does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed
you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.
My observation has been that Codex tends to hit logical/data-driven/back-end tasks out of the park while doing weird, random nonsense with even simple UI tasks. This could me needing to improve how I phrase my prompts, but it will be interesting to see if it's improved in that arena at all.
I rarely used Codex compared to Claude because it was extremely slow in GitHub copilot
. Like maybe 2-5X slower than Claude Sonnet. I really wish they just made their models faster than “better”
Very interesting to see the range of peoples' preferences. I would almost always prefer smart over fast; I have all my LLMs to be all-thinking-all-the-time.
It’s a balance, I haven’t felt like codex provided anything that Sonnet 4.5 didn’t. Why wait longer for getting the same results.
Though that does bring up an interesting point. Anecdotally, Sonnet does a lot more grep-ing while Codex reads files straight up. Might be the difference in speed and maybe smarter models will do better. Once this model is on copilot, I can test it out.
GPT-5 was recently updated to make it more "thinking" and "warmer" or whatever and now a task (semantically compare these two short files) that used to take 5 seconds and reliably produce useful and consistent output now takes 90 seconds to "think" (while it's thinking output makes it pretty clear there is zero thinking happening) and produces a completely differently structured output every single time, making the tool not only slower and more expensive to use, but worse at a simple task that LLMs should be very good at.
There's an option to "get a quick answer" and I hoped clicking that would revert to previous performance and instead what it does is ignore that I uploaded two files and asks me to upload the files
Literally the only real good task I've found for these dumb things and they still found a way to fuck it up because they need to keep the weirdos and whales addicted. It's now almost easier to go back to comparing these files by eye, or just bite the bullet and finally write a few lines of python to actually do it right and reliably.
Somewhat related, after seeing the praise for codex in the Sonnet 4.5 release thread I gave it a go, and I must say, that CLI is much worse than Claude Code (even if the model is great, I’m not sure where the issue really lies between the two).
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.
sure claude code has better ux but honestly its hard to get any good amount of usage out of the subscriptions vs what codex offers at the same price
with claude im constantly hitting rate limits with codex getting substantially more and "slow" isn't really a problem for me as long as it keep working
the only complaint i have is that codex itself has usage limited now (Either due to outstanding git issues around tools or by throttling on their end) compared to a few months ago
the true magical moment was codex pro letting me run swarms of agents day in day out without any worries about rate limits it truly felt unlimited
if claude manages to release a smaller model or some way to deal with the rapidly depleting usage limits (this is the top complaint on reddit and they eventually just stopped allowing threads about it) it would definitely be used more
but for now codex is clearly the workhorse and claude used side by side.
Well as I said, codex didn’t adhere to codebase standards for me and the code quality was worse (very defensive), so even after waiting longer, results weren’t there for me.
But the subscription thing is a non-issue for me as I use the API, and mostly use Claude Code synchronously, with the occasional rare background agent.
I love how programming discussions du jour have basically devolved into "really? my socks definitely smell better after using 2 scoops of last month's soap. what spin cycle are you using?"
I have been using GPT 5 High Fast in Cursor primarily over Codex, because Codex seems to take way longer and generally annoy me by doing strange CLI stuff, but hopefully I can switch to this new one. I also tried it against Gemini 3 Pro in Cursor and it's hard to tell but at least in some cases I felt like GPT5 was giving better results.
I would love to see all the big players put 1% of the effort they put into model training into making the basic process of paying and signing in suck less.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
https://github.com/openai/codex/issues/2798
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
Agree 1,000%.
I just won’t even waste my time with the google stuff cuz I can’t figure out how to pay with it.
And that’s a problem everywhere at google. Our google play account is suspended cuz I can’t verify the company. It won’t let me cuz it says I’m not the owner. I’ve always been the owner of my company. For 18 years. There is no one else.
Once some error said make sure the owner email matches your profile in google payments and I was like, what is google payments and where do I even begin with that? I’ve never paid for google play so what does payments have to do with anything?
It’s totally random stuff. Get your shit together, google. Make your products and payment systems coherent, rather than it obviously looking like it was designed by a fiefdom full of territorial managers.
> designed by a fiefdom full of territorial managers
What's harder than herding cats? Herding cats with MBAs and OKRs.
The "Owner" accounts in Google Play and Apple's App Store are so freaking annoying. The only time they make sense is for solo-founders and even then I've had issues. Now expand it to working at a larger company and it's a joke, a bad one. Oh sure, I'll just get the CEO (or other higher-up) to login and accept new agreements, that will be easy. Even more fun when you tell a client (who logged in exactly 1 time to set up the account) that they need to use a generic email (not a personal one or an employee-specific one), the ignore your suggestion, and then they can't get back in because the person who set up the account left the company. It's a mess.
Also, re "Google Payments", I tried to transfer an app from my personal/solo Google Play account to a new business one I set up for my LLC and it was like pulling teeth. They wanted me to find some payment id from the original $20 purchase I made to get access to Google Play, something I did right around when they first launched and while I still have/use the same email, Google came out with approximately 1 googol different "payment solutions" in the interim and their engineers don't care about data migrations. Finally, after many support emails, they just transferred it without me giving that code which just shows how silly the whole thing was from the start.
Can relate. My inactive google ads account all of a sudden got banned. No explanation except some generic link to their terms of service. Appealed, got automatic denial, no reason given. Have retried multiple times, same result
Google keeps changing their privacy and “don’t train on my data/code” options. When gemini-cli launched, there was a clear toggle for “don’t train on my code.” That’s now gone; it just links to a generic privacy page for me. Maybe something with my account changed, I can't figure it out. Deep in the Cloud Gemini console, there’s another setting that might control training, but it’s not clear what products it actually covers.
Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra personal subscription? I already pay for OpenAI and Anthropic’s pro/max plans and would happily pay Google too. But the only obvious option is a $250/month tier, and its documentation indicates Google can train on your code unless you find and enable the correct opt-out. If that opt-out exists in all the products, it’s not obvious where it lives or what products it applies to.
Workspace complicates it further. Google advertises that with business workspace accounts your data isn’t used for training. So, I was going to try Antigravity on our codebase. At this point I know I can't trust Google, so I read the ToS carefully. They train on your prompts and source code, and there doesn't appear to be a way to pay them and opt out right now. Be careful, paying for Google Workspace does not protect you, always read the ToS.
Be careful with AI-studio and your Google Workspace accounts. They train on your prompts unless you switch it to API mode.
The result is a lot of uncertainty. I genuinely have no idea how to pay Google for Gemini without risking my code being used for training. And if I do pay, I can’t tell whether they’ll train on my prompts anyway.
The marketing for their coding products does not clearly state when they do or do not train on your prompts and code.
I had to run deep research to understand the risks with using Gemini 3 for agentic work, and I still don't feel confident that I understand the risks. I might have said some incorrect things above, but I am just so confused. I feel like I have a <75% grasp on the situation.
I don't have a lot of trust. And honestly, this feels confusing and deceptive. One could easily confuse it as deliberate strategy to gather training data through ambiguity and dark patterns, it certainly looks like this could be Google's strategy to win the AI race. I assume this is just how it looks, and that they aren't being evil on purpose.
OpenAI in particular has my trust. They get it. They are carefully building the customer experience, they are product and customer driven from the top.
At this point I’m not convinced that Gemini 3 Pro was post-trained on data Google had permission to use, going by the myriad of issues on the Gemini CLI tracker around Google AI/Google One/Google Cloud/Google Workspaces.
https://github.com/google-gemini/gemini-cli/issues/12121
It is far too easy to accidentally end up under the wrong privacy agreement, to the point of where some workplaces are banning use of the Gemini CLI!
Couldn't agree more about the google product offerings. Vertex AI? AI Studio? Maker studio? Gemini? The documentation is fragmented with redundant offerings making it confusing to determine what is what. GCS billing is complicated to figure out vs OpenAI billing or anthropic.
Sad part is Google does offer a ChatML/OpenAI compliant endpoint to do LLM calls and I believe they in an experiment also reduced friction in getting an API key to start making calls right away but discoverability ever remains a challenge with google services.
I've just found myself using OpenRouter if we need Google models for a project, it's worth the extra 5% just not to have to deal with the utter disaster that is their product offering.
Adding to this, Google's models can only be used with GCP while OpenAI's models can be used with Azure, Anthropic's models can be used with AWD Bedrock, in addition to their own platforms.
I'd love to see the Gemini models being available by other providers :) or if they just build a simple prepaid wallet like OpenAI and Anthropic.
Didn't realize these stipulations for the models. Looking at devops-y job descriptions the last few months I noticed nearly everyone has some kind of Azure requirement now (which I've mostly avoided because I don't want to end up managing someone's AD), but is openai the actual reason for it?
We're just using Github Copilot as our primary entrypoint for all of the model families. Its the only way we can easily offer our devs some level of Claude, Gemini, and Codex all in one place.
Last night, just after Gemini 3 was released and became available for Gemini-CLI, I saw Gemini-CLI's team post that you could access Gemini 3 with either an API key OR with _Gemini AI Ultra_, so I thought: great, I'll get that!
Now you CAN NOT get the Google One stuff if your account is part of a workspace. I thought: how awful. I want to pay, but I simply can't?
Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license via the Google Workspace Admin area, great!
Turns out: you fucking can't. That's _Google AI Ultra FOR BUSINESS_ and that IS NOT supported.
So I had to get the Google One subscription on my personal account after all.
Combine that with the _pathetic_ usage limits: somehow not token-based, but amount of requests per 24 hour window (which is 500 for Gemini 3) and Gemini 3's incredible chattiness (it uses A LOT more requests to get something done compared to Claude) and you hit the usage limits in just 2 hours.
Careful, their ToS makes it clear they train on your Antigravity prompts (even on AI Ultra) and there is no opt-out that I can find.
And stop asking for phone numbers for "fraud prevention" when I've already given you my name, address and credit card.
The fun one for me is that I moved countries and last I checked there’s still no way to change your phone number on ChatGPT short making a new account, so now my account is associated with a phone number that I no longer have access to and will eventually be reassigned to someone else.
> Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
Please give me an option for a password (or passkey) or literally anything else that doesn't require either linking with google or going through an email flow for every login
Such great case studies of how LLM coding will make all of your employees 1000x more productive at coding, design, and UX. They really are leading the way showing us into the brighter future of AI software /s
I've been using a lot of Claude and Codex recently.
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
I've been really impressed with codex so far. I have been working on a flight simulator hobby project for the last 6 months and finally came to the conclusion that I need to switch from floating origin, which my physics engine assumes with the coordinate system it uses, to a true ECEF coordinate system (what underpins GPS). This involved a major rewrite of the coordinate system, the physics engine, even the graphics system and auxilary stuff like asset loading/unloading etc. that was dependent on local X,Y,Z. It even rewrote the PD autopilot to account for the changes in the coordinate system. I gave it about a paragraph of instructions with a couple of FYIs and... it just worked! No major graphical glitches except a single issue with some minor graphical jitter, which it fixed on the first try. In total took about 45 minutes but I was very impressed.
I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.
> Claude basically disregards your instructions (CLAUDE.md) entirely
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently
Yep, it's David Lee Roth's brown M&M trick https://www.smithsonianmag.com/arts-culture/why-did-van-hale...
Highly recommend adding some kind of canary like this in all LLM project instructions. I prefer my instructions to say 'always start output with an (uniquely decided by you) emoji' as it's easier to visually scan for one when reading a wall of LLM output, and use a different emoji per project because what's life without a little whim?
> Codex will rewrite the entire V8 engine to break arithmetic.
This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.
I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.
In my AGENTS.md (which CLAUDE.md et al soft link to), I instruct them to, on phase completion, "Explicitly write that you followed these guidelines." This text always shows up on Codex and very rarely on Claude Code (TBF, Claude Code is doing it more often lately).
Thinking level medium: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...
Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...
Medium has things dialed in. When both high and low are coherent but medium goes to cubism? That’s intent. Or it had a miscue on proportions vs shape placement. Either way, it’s great, sandwiched the way it is, between the other two. Did it put a comment in all of them or just the one w/ the hat?
Also, thanks for the posts— it’s hugely helpful to have a continuity of insightful perspective throughout.
Rest assured that we are better at training models than naming them ;D
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!
I currently use GPT‑5.1-Codex High and have a workflow that works well with the 5-hour/weekly limits, credits, et al. If I use GPT‑5.1-Codex-Max Medium or GPT‑5.1-Codex-Max High, how will that compare cost / credits / limits wise to GPT‑5.1-Codex High? I don't think that's clear. "Reduced tokens" makes me think it'll be priced similarly / lower. But, "Max" makes me think it'll be priced higher.
It would be great to have access to this model via the chat interface, even if it was gated behind the "other models" dropdown or something.
did you address this https://github.com/openai/codex/issues/6426 ?
how much more token efficient is this compared to 5.0
had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable
Looks like a great change! I'll take it for a spin in a moment.
I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...
Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:
I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.
So context window is still 400k but the model got good at removing irrelevant context?
Codex is an outstanding product and incremental upgrades are always welcome. I'll make sure to give it a try in the coming days. Great work! :)
Will -minis come for the codex family of models? About two months ago I used 5-mini as a daily driver for a few weeks and quite liked it, it seemed capable enough on small tasks with some hand holding and the speed/price were great as well.
codex-mini was released a couple of weeks ago: https://platform.openai.com/docs/models/gpt-5.1-codex-mini
Thanks! I somehow missed that. Will check it out.
Sorry don’t like the max model, feels like it needs a lot more guiding. The plans it writes however are better, so I tried feeding it back in (meta prompt style) and working okay so far. Very large repository.
Compaction is just what Claude Code has done forever, right?
I think the point here is not that it does compaction (which Codex also already does) - but that the model was trained with examples of the Codex compaction, so it should perform better when compaction has taken place (a common source for drops in performance for earlier models).
Codex previously did only manual compaction, but yeah, maybe some extra training for compaction, too?
I am also trying to understand the difference between compaction, and what IDEs like Cursor do when they "summarize" context over long-running conversations.
Is this saying that said summarization now happens at the model level? Or are there other differences?
OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
GPT 5.1 / Codex already beats Gemini 3 on SWE Bench Verified and Terminal Bench and this pushes the gap further. Seems like a decent improvement.
That’s how the game is played. We should be grateful for all the competition that is driving these improvements, not whinging about the realities of what companies have to do to contest each other’s position.
Gemini is eating their lunch, and OpenAI knows it.
it's really getting old
These 2 sentences right next to each other stood out to me:
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
Absolutely contradictory. The long-running tendency for Codex is why I cannot understand the hype around it: if you bother to watch what it does and read its code the approaches it takes are absolutely horrifying. It would rather rewrite a TLS library from scratch than bother to ask you if the network is available.
these things are actually fixable with prompting. is it easy? no. is it PEBKaC if you don’t do anything to change course as it builds a TLS library? yes, but paperclip maximized! xD
Or you can have a model with some semblance of common sense that will stop and say "Hey I can I have access to the network to do X?"
Codex feels like a tool designed to run after all the humans are gone.
(Disclaimer: Am on the Codex team.) We're basically trying to build a teammate that can do both short, iterative work with you, then as you build trust (and configuration), you can delegate longer tasks to it.
The "# of model-generated tokens per response" chart in [the blog introducing gpt-5-codex](https://openai.com/index/introducing-upgrades-to-codex/) shows an example of how we're improving the model good at both.
If you haven't, give Cursor's Composer model a shot. It might not be quite as good as the top models, but in my experience it's almost as good, and the lightning fast feedback is more than worth the tradeoff. You can give it a task, wait ten seconds, and evaluate the results. It's quite common for it to not be good enough, but no worse than Sonnet, and if it doesn't work you just wasted 30 seconds instead of 10 minutes.
I really would prefer them to start creating customized models.
I've vibe coded Godot games extensively.
Just about every model I've tried likes to invent imaginary functions.
I was really prefer for there to be a way for me to pick model trained in whatever framework I need.
Reviewing AI generated code feels like editing a long book, and every now and then you notice some words are just completely made up. You then ask the AI to fix its book, and it will just add more AI generated words.
On one hand I want this to be a reality check to everyone who's trying to lay off real software engineers to replace us with AI.
On the other hand half of the stock market is held up by overhyped AI valuations. If the tide goes out too fast, and there is a mass realization that this stuff just isn't as good as it's hyped to be, it's not going to be fun for anyone.
I had this problem 2 years ago. All the models were telling me use libraries that hadn't been invented yet.
That was annoying back then, but these days that's not so much of a problem.
You can write your program and then simply have it invent the library as well, while it's at it! ;)
I’ve found writing a MCP server with access to the docs cloned locally does wonders.
Add the documentation to the context window in that case, a bit of context engineering.
This is a tangent: Has anyone noticed that GPT-5.0 at some point started producing much faster, crappier answers, then 5.1 made it slower + better again? (Both in Thinking mode)
I did notice that, I thought maybe I’d exceeded my thinking requests
Weird how they only share three hand-picked evals, ignoring the evals where they were left in the dust like ARC-AGI2. This post is so misleading, I don't even know whether to trust the numbers they did share. One is just fraction of a percentage point away from Gemini 3 pro, which is awfully convenient for marketing and easy to hide. Very open, OpenAI.
Not really that weird. This isn't intended to be a "general" model. This is a coding model so they showed the coding evals. The assumption would be relative to GPT5.1, non-coding evals would be likely regress or be similar.
Like when advertising the new airliner, most people don't care about how fast it taxis.
I've been dealing with Codex CLI for a while and I love it, but I'm wondering if my thinking is just limited. While I'm starting discussions and creating plan docs, I've never been able to ask it to do anything that takes it longer than 25 minutes or so. Usually far less. I'm having trouble imagining what I can ask it to do that would make it take hours - like, wouldn't that require putting together an absolutely massive planning doc that would take hours to put together anyway? I'd rather just move incrementally.
Easy way to get an agent to run a long time is just to get it to babysit CI/CD, tell it to iterate on it until it passes. I got Sonnet 4 to run for >6 hours that way.
Perhaps they're combining an incredibly complex product that has a lot of interactive features, a big codebase, test creation, and maybe throwing some MCP stuff in there such as creating creating a ticket in Jira if a test fails?
The graph showing higher performance for fewer thinking tokens is really interesting!
It would be even more interesting to see how Sonnet and Haiku compare with that curve.
Is "compaction" a trained-in feature of the model, or just tooling around the model calls? Agents already do compaction.
> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?
Attention is quadratic, so you have to pick a cutoff for context window size. In addition, the error/noise in state space increases with longer contexts, resulting in poorer performance. So even if you're willing to take the O(n^2) slowdown of a larger context window, it still won't work.
> Attention is quadratic
Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.
So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.
MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.
DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.
Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.
There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.
> due to context-window limits
context window is not some physical barrier but rather the attention just getting saturated. what did i get wrong here?
In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.
In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.
Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.
> what did i get wrong here?
You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.
Ask a frontier LLM what a context window is, it will tell you.
Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory
It's a fair question, even if it might be coming from a place of misunderstanding.
For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).
[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929
so this was arctic fox it seems, lot of us ended up downgrading to codex 5.0 because of the token burn was too much, i see codex max is a step up which is welcome but still unsure if they solved that github issue around tool use that impacts tokens
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority im sticking with codex for now and using gemini 3 for frontend
Have you found that Gemini is better than Codex for front end generation? I'm trying to bring some Figma screens into a small React project I have, and Codex will occasionally screw up the implementation despite the fact that I'm using the MCP server.
"Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces."
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!
Codex CLI 0.59 got released (but has no changelog text)
https://github.com/openai/codex/releases/tag/rust-v0.59.0
Gemini 3 had a great 24 hour SOTA run for coding
Gemini is still the best oracle/planner by a mile. It's just a bad agent. Give it a bundle of your repo and get it to plan your changes, then hand it off to codex to implement.
I still want something no one has, which is the ability to launch agents in different git worktrees simultaneously and check the results out on my main branch for testing when they are finished.
lots of tools that do this and I ended up going down this rabbit hole something that could just plug in to codex instead of requiring a fork
http://github.com/agentify-sh/10x
does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed
you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.
Your link is a 404.
I think I’ve described how I achieve kinda your desired workflow in a comment yesterday [0].
[0]: https://news.ycombinator.com/item?id=45970668
ha! very interesting how slept on jj is
its been essential to my workflow as well
i use both jj and git and jj is great for just creating a snapshot that i can revert to incase it fails
im still exploring it to see what else i can do with it for agentic use
Would this be similar to how Charlie and Jules work?
Cursor has this too
Woah, metr results look impressive. Still looking exponential
My observation has been that Codex tends to hit logical/data-driven/back-end tasks out of the park while doing weird, random nonsense with even simple UI tasks. This could me needing to improve how I phrase my prompts, but it will be interesting to see if it's improved in that arena at all.
I rarely used Codex compared to Claude because it was extremely slow in GitHub copilot . Like maybe 2-5X slower than Claude Sonnet. I really wish they just made their models faster than “better”
OpenAI doesn't want you to use their models outside of their own products, which is why the API and integrations like Github Copilot are super slow.
Very interesting to see the range of peoples' preferences. I would almost always prefer smart over fast; I have all my LLMs to be all-thinking-all-the-time.
It’s a balance, I haven’t felt like codex provided anything that Sonnet 4.5 didn’t. Why wait longer for getting the same results.
Though that does bring up an interesting point. Anecdotally, Sonnet does a lot more grep-ing while Codex reads files straight up. Might be the difference in speed and maybe smarter models will do better. Once this model is on copilot, I can test it out.
GPT-5 was recently updated to make it more "thinking" and "warmer" or whatever and now a task (semantically compare these two short files) that used to take 5 seconds and reliably produce useful and consistent output now takes 90 seconds to "think" (while it's thinking output makes it pretty clear there is zero thinking happening) and produces a completely differently structured output every single time, making the tool not only slower and more expensive to use, but worse at a simple task that LLMs should be very good at.
There's an option to "get a quick answer" and I hoped clicking that would revert to previous performance and instead what it does is ignore that I uploaded two files and asks me to upload the files
Literally the only real good task I've found for these dumb things and they still found a way to fuck it up because they need to keep the weirdos and whales addicted. It's now almost easier to go back to comparing these files by eye, or just bite the bullet and finally write a few lines of python to actually do it right and reliably.
Have you tried Mistral ? Definitely one of the fastest models
My employer doesn’t offer/allow anything besides the “traditional” offerings on GitHub copilot.
500 Internal Server Error.
ditto. Also OpenAI vector stores are down right now across the board
Somewhat related, after seeing the praise for codex in the Sonnet 4.5 release thread I gave it a go, and I must say, that CLI is much worse than Claude Code (even if the model is great, I’m not sure where the issue really lies between the two).
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.
sure claude code has better ux but honestly its hard to get any good amount of usage out of the subscriptions vs what codex offers at the same price
with claude im constantly hitting rate limits with codex getting substantially more and "slow" isn't really a problem for me as long as it keep working
the only complaint i have is that codex itself has usage limited now (Either due to outstanding git issues around tools or by throttling on their end) compared to a few months ago
the true magical moment was codex pro letting me run swarms of agents day in day out without any worries about rate limits it truly felt unlimited
if claude manages to release a smaller model or some way to deal with the rapidly depleting usage limits (this is the top complaint on reddit and they eventually just stopped allowing threads about it) it would definitely be used more
but for now codex is clearly the workhorse and claude used side by side.
Well as I said, codex didn’t adhere to codebase standards for me and the code quality was worse (very defensive), so even after waiting longer, results weren’t there for me.
But the subscription thing is a non-issue for me as I use the API, and mostly use Claude Code synchronously, with the occasional rare background agent.
It’s good but Gemini 3 beats it.
Sizeable if veracious!
The new detergent now washes even whiter
I love how programming discussions du jour have basically devolved into "really? my socks definitely smell better after using 2 scoops of last month's soap. what spin cycle are you using?"
Come on folks, this is funny. They also have industrial strength laundromats to go with the detergent.
I have been using GPT 5 High Fast in Cursor primarily over Codex, because Codex seems to take way longer and generally annoy me by doing strange CLI stuff, but hopefully I can switch to this new one. I also tried it against Gemini 3 Pro in Cursor and it's hard to tell but at least in some cases I felt like GPT5 was giving better results.
all i care about is performance on metr benchmark
That was quick
My first thought was "they must not be seeing as many Claude Code conversions as they hoped"
Whenever one of them releases a milestone release the rest start publishing big milestones too. I'm waiting for Opus 5 next.
So they all release before the Nvidia numbers tonight. The real question is: How well can Nvidia hide the circular deals in the books?
Here we go again....
Sigh. Time to try it again I guess. I give OpenAI way more chances than it deserves.