I write detailed specs. Multifile with example code. In markdown.
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.
The fact that I know first hand prior to Gen AI is that writing code is way easier. It is reading the code, understanding it and making a mental model that's way more labour intensive.
Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
Writing detailed specs and then giving them to an AI is not the optimal way to work with AI.
That's vibecoding with an extra documentation step.
Also, Sonnet is not the model you'd want to use if you're trying to vibecode complex projects. Opus or GPT-5.5 are the only ways to even attempt this.
> Therefore I need more time and effort with Gen AI than I needed before
Stop trying to use it as all-or-nothing. You can still make the decisions, call the shots, write code where AI doesn't help and then use AI to speed up parts where it does help.
That's how most non-junior engineers settle into using AI.
Ignore all of the LinkedIn and social media hype about prompting apps into existence.
I must be doing something very different from everyone else, but I write what I want and how I want it and Opus 4.7 plans it for me, then I carefully review. Often times I need to validate and check things, sometimes I’ve revised the plan multiple times. Then implementation which I still use Opus for because I get a warning that my current model holds the cache so Sonnet shouldn’t implement. And honestly, I’m mostly within my Pro subscription, granted I also have ChatGPT Plus but I’ve mostly only used that as the chat/quick reference model. But yeah takes some time to read and understand everything, a lot of the time I make manual edits too.
Or just don't use AI to write code. Use it as a code reviewer assistant along with your usual test-lint development cycle. Use it to help evaluate 3rd party libraries faster. Use it to research new topics. Use it to help draft RFCs and design documents. Use it as a chat buddy when working on hard problems.
I think the AI companies all stink to high heaven and the whole thing being built on copyright infringement still makes me squirm. But the latest models are stupidly smart in some cases. It's starting to feel like I really do have a sci-fi AI assistant that I can just reach for whenever I need it, either to support hard thinking or to speed up or entirely avoid drudgery and toil.
You don't have to buy into the stupid vibecoding hype to get productivity value out of the technology.
You of course don't have to use it at all. And you don't owe your money to any particular company. Heck for non-code tasks the local-capable models are great. But you can't just look at vibecoding and dismiss the entire category of technology.
I feel like I'm using Claude Opus pretty effectively and I'm honestly not running up against limits in my mid-tier subscriptions. My workflow is more "copilot" than "autopilot", in that I craft prompts for contained tasks and review nearly everything, so it's pretty light compared to people doing vibe coding.
The market-leading technology is pretty close to "good enough" for how I'm using it. I look forward to the day when LLM-assisted coding is commoditized. I could really go for an open source model based on properly licensed code.
> the day when LLM-assisted coding is commoditized
Like yesterday? LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
My definition of LLM-assisted coding is that you fully understand every change and every single line of the code. Otherwise it's vibe coding. And I believe if one is honest to this principle, it's very hard to deplete the quota of the $100 tier.
> fully understand every change and every single line of the code.
im probably just not being charitable enough to what you mean, but thats an absurd bar that almost nobody conforms to even if its fully handwritten. nothing would get done if they did. But again, my emphasis is on that im probably just not being charitable to what you mean.
Well that is how it mostly worked until recently... unless if the developer copied and pasted from stackoverflow without understanding much. Which did happen.
I also use it this way and I'm overall pretty happy with it, but it feels like they really want us to use it in "autopilot" mode. It's like they have two conflicting priorities of "make people use more tokens so we can bill them more" and "people are using more tokens than expected, our pricing structure is no longer sustainable"
(but I guess they're not really conflicting, if the "solution" involves upgrading to a higher plan)
I feel like they are making it harder to use it this way. Encouraging autonomous is one thing, but it really feels more like they are handicapping engaged use. I suspect it reflects their own development practices and needs.
This is something I've thought of as well. The way the caps are implemented, it really disincentivizes engaged use. The 5-hour window especially is very awkward and disruptive. The net result is that I have to somewhat plan my day around when the 5-hour window will affect it. That by itself is a powerful disincentive from using Claude. It has also caused me to use different tools for things I previously would have used Claude for. For example, detailed plans I use codex now rather than Claude, because I hit the limit way too fast when doing documentation work. It certainly doesn't hurt that codex seems to be better at it, but I wouldn't even have a codex subscription if it wasn't for claude's usage limits
Another big one for me is that they dropped the cache TTLs. It is normal for me to come back to a session an hour later, but someone "autopilot"-ing won't have such gaps.
not just the cache though. every time you stop and come back, it basically reloads the whole session. if you just let it keep going, it counts like one smooth run. you hit the wall faster for actually checking its work.
Do you have any good resources on how to work like that? I made the move from "auto complete on steroids" to "agents write most of my code". But I can't imagine running agents unchecked (and in parallel!) for any significant amount of time.
I think the culty element of AI development is really blinding a lot of these companies to what their tools are actually useful for. They’re genuinely great productivity enhancers, but the boosters are constantly going on about how it’s going to replace all your employees and it’s just. . .not good for that! And I don’t mean “not yet” I mean I don’t see it ever getting there barring some major breakthrough on the order of inventing a room-temp superconductor.
I agree with you, the "replacing people" narrative is not only wrong, it's inflammatory and brand suicide for these AI companies who don't seem to realize (or just don't care) the kind of buzz saw of public opinion they're walking straight towards.
That said, looking at the way things work in big companies, AI has definitely made it so one senior engineer with decent opinions can outperform a mediocre PM plus four engineers who just do what they're told.
Similar with the copilot and not autopilot usage. I find its the best of them all. Mostly i just use it as an occasionnal search engine. I've never found LLMs to be efficient to actually do work. I do miss the day when tech docs were usable. Claude seems like a crutch for gaps in developer experience more than anything.
Same. Never hit a limit. Use it heavily for real work. Never even thought of firing off an LLM for hours of...something. Seems like a recipe for wasting my time figuring out what it did and why.
I have Max 5x and use only Claude Opus on xhigh mode. I don't use agents, or even MCPs, and stick to Claude Code.
I find it incredibly difficult to saturate my usage. I'm ending the average week at 30-ish percentage, despite this thing doing an enormous amount of work for (with?) me.
Now I will say that with pro I was constantly hitting the limit -- like comically so, and single requests would push me over 100% for the session and into paying for extra usage -- and max 5x feels like far more than 5x the usage, but who knows. Anthropic is extremely squirrely about things like surge rates, and so on.
I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
I was in the same boat until last few days, where just a handful queries were enough to saturate my 5h session in about 30 mins.
Recently I've gotten Qwen 3.6 27b working locally and it's pretty great, but still doesn't match Opus; I've gotten check out that new Deepseek model sometime.
If only OpenAI spent a significant amount of money on some kind of generative software that was predominantly trained on internet comments that'd be able to do all the astroturfing for them...
Yea, I never got how people are even able to hit the weekly limits so consistently. Maybe it's because they use it for work? But in that case, you would expect the employer to cover it so idk.
>I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
A lot of people are angry about the whole openclaw situation. They are especially bitter that when they attempted to justify exfiltrating the OAuth token to use for openclaw, nobody agreed with them that they had the right to do so, and sided with Claude that different limits for first-party use is standard. So they create threads like this, and complain about some opaque reason why Anthropic is finished (while still keeping their subscription, of course).
Fascinated, a bummer that DeepSeek does not offer a DPA or opt-out for training. This renders it unusable for my use cases unfortunately. At least z.ai GLM has a somewhat DPA in Singapore.
Honestly, it sounds like, assuming you have no ethical qualms, you could get by with a Mac or AMD 395+ and the newest models, specifically QWEN3.5-Coder-Next. It does exactly as you describe. It maxes out around 85k context, which if you do a good job providing guard rails, etc, is the length of a small-medium project.
It does seem like the sweet spot between WallE and the destroyed earth in WallE.
This is what worries me. People become dependent on these GenAI products that are proprietary, not transparant, and need a subscription. People build on it like it is a solid foundation. But all of a sudden the owner just pulls the foundation from under your building.
But these products are all drop in replacements for each other. I've recently favored Codex more than CC, just because rate limits got mildly annoying. I really didn't have to change anything about my workflow in doing that.
> But these products are all drop in replacements for each other
For now. That doesn't really change the risk, that just means they are all hyper competitive right this moment, and so they are comparable. If one of them becomes king of the hill, nothing stops them from silently degrading or jacking prices.
The only shield is to not be dependent in the first place. That means keeping your skills sharp and being willing to pass on your knowledge to juniors, so they aren't dependent on these things.
Of course, many people are building their business on huge AI scaffolding. There's nothing they can do.
I'm curious - why for now? This stuff is practically commoditized. Trying to think of anything that ever successfully got back into proprietary land from there.
The thing is that AI is still more akin to a glorified autocomplete than something that can really supersede your skills. Proprietary model suppliers are constantly trying to obscure this basic underlying fact, without much success (much of the unpredictable shifts you see in proprietary AI behavior ultimately boils down to this); so it becomes far more crystal-clear when using open models that really are a pure commodity.
It doesn't look commoditized to me, it looks subsidized. It looks like everyone is trying to be "the one" and running as competitively as possible until the others fail. Commoditized would imply these services are all going to mellow into a stable state and mostly compete on price. I don't think that's happening. These aren't paper clips, they are courting governments and trying to pull the ladder up behind them. That's why both Anthropic and OpenAI are preaching doomsday and trying to build a moat with regulations.
Fair. I have high hope for local inference, feel like right now it is simply cost prohibitive to get the hardware. It will be interesting to see what happens.
It feels more and more like OpenAI/Anthoropic aren't the future but Qwen, Kimi, or Deepseek are. You can run them locally, but that isn't really the point, it is about democratization of service providers. You can run any of them on a dozen providers with different trade-offs/offerings OR locally.
They won't ever be SOTA due to money, but "last year's SOTA" when it costs 1/4 or less, may be good enough. More quantity, more flexibility, at lower edge quality. It can make sense. A 7% dumber agent TEAM Vs. a single objectively superior super-agent.
That's the most exciting thing going on in that space. New workflows opening up not due to intelligence improvements but cost improvements for "good enough" intelligence.
Open Source isn't even within 50% of what the SOTA models are. Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.
I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.
It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.
Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.
But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.
Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.
I think that there will come a point when open source models are "good enough" for many tasks (they probably already are for some tasks; or at least, some small number of people seem happy with them), but, as you suggest, it will likely always (for the forseeable future at least) be the case that closed SOTA models are significantly ahead of open models, and any task which can still benefit from a smarter model (which will probably always remain some large subset of tasks) will be better done on a closed model.
The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.
Unless you are getting outside of your comfort zone and taking a month off from your $200 subscription, every other month, I can’t see how you can make the universal claim that the open weights models are all 50% as good. Just today, DeepSeek released a new model, so nobody knows how that will compare, a week ago it was Gemma 4, etc. I’m okay with you making a comparison, but state the model and the timeframe in which it was tested that you are basing your conclusions on.
People pirate photoshop and office if they don't want to pay for it, making it as "free" as GIMP. If there is a free option people will use it. never underestimate the cheapskates.
> Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?
You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.
> Benchmarks are toys, real world use is vastly different...Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters.
This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.
Also, this space will (and perhaps already is for some of us) be an arms race. Sure you can go local but hosted will always be able to offer more and if you want to be competitive, you'll need to be using the most capable.
IMO It's a different and new model. We're engineers, and we're rich. It's not going to be good enough for us. But the much larger market by far is all the people who used to HAVE to work with engineers. They now have optionality; the pendulum is going to swing.
> Open Source isn't even within 50% of what the SOTA models are.
When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.
Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.
Their opinion is also behind on LibreOffice, too. I won't defend GIMP's monstrosity, but I finished a whole dissertation, do all my regular spreadsheet work (that isn't done via R), and have created plenty of visual mockups with LibreOffice. Plus, I don't have to deal with a spammy Windows environment.
Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.
Qwen3.6 at which model size and quantization? I already think Opus 4.6 is usable but still dumb as bricks. A 20% cut off that feels like it would still be unusable. And that's not even getting to the annoyance of setting everything up to run locally & getting HW that can run it locally which basically looks like a Macbook M4 these days as the x86 side is ridiculously pricey to get decent performance out of models.
> Open Source isn't even within 50% of what the SOTA models are
Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.
Maybe for folks who are deep into this, but it’s not exactly accessible. I tried reading up on it a couple of months ago, but parsing through what hardware I needed, the model and how to configure it (model size vs quantization), how I’d get access to the hardware (which for decent results in coding, new hardware runs $4k-$10k last I checked)—it had a non trivial barrier of entry. I was trying to do this over a long weekend and ran out of time. I’ll have to look into it again because having the local option would be great.
Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).
$10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.
(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)
Feasibility on commodity hardware would be the true watermark. Running high end computers is the only way to get decent results at the moment, but if we can run inference on CPUs, NPUs, and GPUs on everyday hardware, the moat should disappear.
You can already run inference on ordinary hardware but if you want workable throughput you're limited to small models, and these have very poor world-knowledge.
I've been using local AI via LM Studio ever since I canceled my Claude subscription. It's obviously slower than Claude on my M1 Studio[†], but like someone else said, I use AI more like a copilot than an autopilot. I'm pretty enthused that I can give it a small task and let it churn through it for a few minutes, while I work on something alongside – all for free with no goddamned arbitrary limits.
[†] The latest Qwen 3.6 whatever has been a noticeable improvement, and I'm not even at the point where I tweak settings like sampling, temperature, etc. No idea what that stuff does, I just use the staff picks in LM Studio and customize the system prompts.
Indeed, I feel like we are in the early computer equivalent phase of AI, where giant expensive hardware is still required for frontier models. In 5 years I bet there will be fully open models we'll be able to run on a few $1000 of consumer hardware with equivalent performance to opus 4.7/4.6.
I think intelligence per compute will go up significantly in the coming years, while the cost per compute will drop significantly. No way to know for sure, so I guess we'll see
Sure, but local AI is still a black box. They can be influenced by training data selection, poisoning, hidden system prompts, etc. That recent Wordpress supply chain hack goes to show that the rug can still be pulled even if the software is FOSS.
I love how it's just a tacit understanding that these companies' entire MO is to carve out a territory, get everyone hooked on the good stuff and then jack up the price when they're addicted and captured -- literally the business plan of crack dealers, and it's just business as usual in the tech industry.
I was recently introduced to the term "vcware", ala shareware or vaporware, to describe these products. "Don't use that, it's vcware, enshitification is coming soon."
Not really. The hardware requirements remain indefinitely out of reach.
Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.
The memory requirements aren't that intense. You can run useful (not frontier) models on a $2-5K machine at reasonable speeds. The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.
Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.
At least some of the investors in this tech are hoping for a monopoly position. They'd like to outspend the competition to get an insurmountable lead, at which point they can set their price.
But, so far, competition remains fierce. Anthropic still has the best tools for writing code. That lead is smaller than it's ever been, though. But, honestly, Opus 4.5 is when it got Good Enough. If Anthropic suddenly increased prices beyond what I'm willing to pay, any model that gives me Opus 4.5 or better performance is good enough for the vast majority of the work I do with agents. And, there are a bunch of models at that level, now maybe including some discount Chinese models. Certainly Gemini Pro 3.1 is on par with Opus 4.5. Current Codex is better than Opus 4.5 and close to Opus 4.7 (though I won't use OpenAI because I don't trust them to be the dominant player in AI).
I often switch agents/models on the same project because I like tinkering with self-hosted and I like to keep an eye on the most efficient way to work...which models wastes less of my time on silly stuff. Switching is literally nothing; I run `gemini` or `copilot` or `hermes` instead of `claude`. There's simply no deep dependency on a specific model or agent. They're all trying to find ways to make unique features for people to build a dependence on, of course, but the top models are all so fucking smart you can just tell them to do whatever thing it is that you need done. That feature could probably be a skill, whatever it is, and the model can probably write the skill. Or, even better, it could be actual software, also written by the model, rather than a set of instructions for the model to interpret based on the current random seed.
Currently, the only consistent moat is making the best model. Anthropic makes the best model and tools for coding, but that's a pretty shallow moat...I could live with several other models for coding. I'll gladly pay a premium for the best model and tools for coding, but I also won't be devastated if I suddenly don't have Claude Code tomorrow. Even open models I can host myself are getting very close to Good Enough.
Anthropic sells due to unrelenting pressure and unachievable demand > new owner cuts costs > models become worse > new owner sells > the capitalistic cycle wins > we, the people, suffer
This is why, despite enjoying all of this, I really want to focus on locally hosted models. If we don't host the technology ourselves, we're setting ourselves up for a hard fall down the line.
Until very recently, local models been little more than brittle toys in my experience, if you're trying to use them for coding.
But lately I've been running Pi (minimal coding agent harness) with Gemma4 and Qwen3.6 and I've been blown away by how capable and fast they are compared to other models of their size. (I'm using the biggest that can fit into 24gb, not the smaller ones.) In fact, I don't really need to reach for Claude and friends much of the time (for my use cases at least).
Claude with Sonnet medium effort just used 100% of my session limit, some extra dollars, thought for 53 minutes, and said:
API Error: Claude's response exceeded the 32000 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.
Just copy and past the error back to Claude and you will be able to continue. I have seen this many times over the past few months. I thought it was related to AWS bedrock that I have been using - but probably not.
You're using it within their high usage rate window. I hope you're aware of this, if you use it out of the high usage time window it's supposed to use less, but it does seem a little odd that Sonnet uses so much, even on Medium.
Please. This is a toy. A novel little tech-toy. If you depend on it now for doing your job then, frankly, you deserve to have your rug pulled now and then.
If you didn't found the way to use the tool constructively, keep trying.
If you didn't try to use it to work for you, that's okay, but maybe try once more? It does work and adds value. It's a non-standard and weirdly flexible tool with limitations.
...but in retrospect, seeing how you finished your comment, maybe you really want to remain angry and misinformed.
I'm on max x5. No limit problems, but I am definetly feeling the decline. Early stopping and being hellbent on taking shortcuts being the main culprits, closely followed by over optimistic (stale) caching (audit your hooks!).
All mostly mitigatable by rigorous audits and steering, but man, it should not have to be.
Me and so many coworkers have been struggling with a big cognitive decline in Claude over the last two months. 4.5 was useful and 4.6 was great. I had my own little benchmark and 4.5 could just about keep track of a two way pointer merge loop whereas 4.6 managed a 3 way and the 1M context managed k-way. And this ability to track braids directly helped it understand real production code and make changes and be useful etc.
but then two months ago 4.6 started getting forgetful and making very dumb decisions and so on. Everyone started comparing notes and realising it wasn’t “just them”. And 4.7 isn’t much better and the last few weeks we keep having to battle the auto level of effort downgrade and so on. So much friction as you think “that was dumb” and have to go check the settings again and see there has been some silent downgrade.
We all miss the early days of 4.6, which just show you can have a good useful model. LLMs can be really powerful but in delivering it to the mass market Anthropic throttle and downgrade it to not useful.
My thinking is that soon deepseek reaches the more-than-good-enough 4.6+ level and everyone can get off the Claude pay-more-for-less trajectory. We don’t need much more than we’ve already had a glimpse of and now know is possible. We just need it in our control and provisioned not metered so we can depend upon it.
Of course, it sucks when companies screw up ... but at the same time, they "paid everyone back" by removing limits for awhile, and (more importantly to me) they were transparent about the whole thing.
I have a hard time seeing any other major AI provider being this transparent, so while I'm annoyed at Claude ... I respect how they handled it.
Yes that was one issue. It’s not the general degradation I have been talking about though, which is ongoing.
I recall reading similar tales of woe with other providers here on HN. I think the gradual dialling back of capability as capacity becomes strained as users pile on is part of the MO of all the big AI companies.
My max20 sub is sitting unused since april mostly now, codex with 5.4 (and now 5.5) even with fast mode (= double token costs) is night and day. Opus is doing convincing failures and either forgets half the important details or decides to do "pragmatic" (read: technical debt bandaids or worse) silently and claims success even with everything crashing and burning after the changes. and point out the errors it will make even more messes. Opus works really well for oneshotting greenfield scopes, but iterating on it later or doing complex integrations its just unusable and even harmfully bad.
GPT 5.4+ takes its time and considers even edgecases unprovoked that in fact are correct and saves me subsequent error hunting turns and finally delivers. Plus no "this doesn't look like malware" or "actually wait" thinking loops for minutes over a oneliner script change.
My mental model for LLM is I don't expect them to chew gum and walk at the same time. Cleaning code up is a different task from building new functionality.
GLM always feels like it's doing things smarter, until you actually review the code. So you still need the build/prune cycle. That's my experience anyway.
AI services are only minorly incentivized to reduce token usage. They want high token usage, it makes you pay more. They are going to continually test where the limit is, what is the max token usage before you get angry. All AI companies will continue to trade places for token use and cost as cost increases. We are in tepid water pretending it is a bath pretending we aren’t about to be boiled frogs.
People said this about AWS too. "Why would they save you money??". It turns out that every time they reduce prices, they make more money, because more people use their services.
AI companies have the same incentive. Make it cheaper and people will use it more, making you more money (assuming your price is still above cost). And of course they have every reason to reduce their on costs.
I am betting on the fact that people will get increasingly frustrated at closed agent lock-ins. I built (cline fork) and open-sourced https://github.com/dirac-run/dirac with the sole focus on token efficiency expecting that the closed-lock-in vendors will do enough to frustrate their users over time. Looking for contributors
To an extent. That economic incentive stops making sense when a) capacity is an actual constraint and b) Anthropic is not a monopoly and is subject to pressure from competitors who are more user-friendly.
That's what I am thinking, too. It sound's like a conspiracy theory, but at the end Anthropic et al benefits from models that don't finish their jobs. I recently read about this "over editing phenomenon". The machine is never done. It doesn't want to.
It's like dating apps. They don't want you to find a good match, because then you cancel the subscription.
Which works fine, right up until China releases a new DeepSeek model that's 85% as capable as an Anthropic or OpenAI premium model but costs a fraction of what either of those US companies are charging.
Yesterday was a realization point for me. I gave a simple extraction task to Claude code with a local LLM and it "whirred" and "purred" for 10 minutes. Then I submitted the same data and prompt directly to model via llama_cpp chat UI and the model single-shotted it in under a minute. So obviously something wrong with coding agent or the way it is talking to LLM.
Now I'm looking for an extremely simple open-source coding agent. Nanocoder doesn't seem install on my Mac and it brings node-modules bloat, so no. Opencode seems not quite open-source. For now, I'm doing the work of coding agent and using llama_cpp web UI. Chugging it along fine.
Probably a silly idea, but I'll throw it into the mix - have your current AI build one for you. You can have exactly the coding agent you want, especially if you're looking for "extremely simple".
I got annoyed enough with Anthropic's weird behavior this week to actually try this, and got something workable up & running in a few days. My case was unique: there's no Claude Code for BeOS, or my older / ancient Macs, so it was easier to bootstrap & stitch something together if I really wanted an agentic coding agent on those platforms. You'll learn a lot about how models actually work in the process too, and how much crazy ridiculous bandaid patching is happening Claude Code. Though you might also appreciate some of the difficulties that the agent / harnesses have to solve too. (And to be clear, I'm still using CC when I'm on a platform that supports it.)
As for the llama_cpp vs Claude Code delays - I've run into that too. My theory is API is prioritized over Claude Code subscription traffic. API certainly feels way faster. But you're also paying significantly more.
I've noticed that sometimes the same Claude model will make logical errors sometimes but not other times. Claude's performance is highly temporal. There's even a graph! https://marginlab.ai/trackers/claude-code/
I haven't seen anyone mention this publicly, but I've noticed that the same model will give wildly different results depending on the quantization. 4-bit is not the same as 8-bit and so on in compute requirements and output quality. https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
I'm aware that frontier models don't work in the same way, but I've often wondered if there's a fidelity dial somewhere that's being used to change the amount of memory / resources each model takes during peak hours v. off hours. Does anyone know if that's the case?
I'm not sure that graph shows a time-based correlation. The 60% line stays inside the 95% confidence interval. Is that not just a measurement of noise?
The timeline doesn't make any sense. How can you subscribe a couple weeks ago and the problem start 3 weeks ago and yet things also went well for the first few weeks. was this written by GPT 5.5?
The author is not a native English speaker it seems.
They might mean "few weeks ago" and the phrase "couple of weeks ago" might not be exactly as "Vor ein paar Wochen" in their mind rather could be as "few weeks ago."
Rest of the prose in the article seems to support the assumption.
I've been a fan since the launch of the first Sonnet model and big props for standing up to the government, but you can sure lose that good faith fast when you piss off your paying customers with bad communication, shaky model quality and lowered usage limits.
I've see a post like this every week for the last 2 years. Are these models actually getting worse? Or do folks start noticing the cracks as they use them more and more?
Shameless self plug but also worried about the silent quality regressions, I started building a tool to track coding agent performance over time.. https://github.com/s1liconcow/repogauge
Here is a sample report that tries out the cheaper models + the newest Kimi2.6 model against the 5.4 'gold' testcases from the repo: https://repogauge.org/sample_report.
I also cancelled my subscription.The $20 Pro plan has become completely unusable for any real work. What is especially frustrating is that Claude Chat and Claude Code now share the exact same usage limits — it makes zero sense from a product standpoint when the workflows are so different. Even the $200 Max plan got heavily nerfed. What used to easily last me a full week (or more) of solid daily use now burns out in just a few days. Combined with the quality drop and unpredictable token consumption, it simply stopped being worth it.
Looking at Anthropic's new products I think they understand they don't really have a cutting edge other than the brand.
I tried Kimi 2.6 and it's almost comparable to Opus. Anthropic lost the ball. I hope this is a sign the we are moving towards a future where model usage is a commodity with heavy competition on price/performance
I've been very happy using Codex in the VScode extension. Very high quality coding and generous token limits. I've been running Claude in the CLI over the last couple of months to compare and overall I prefer Codex, but would be happy with either.
I feel like almost everyone using AI for support systems is utterly failing at the same incredibly obvious place.
The first job of any support system—both in terms of importance and chronologically—is triage. This is not a research issue and it's not an interaction issue. It's at root a classification problem and should be trained and implemented as such.
There are three broad categories of interaction: cranks, grandmas, and wtfs.
Cranks are the people opening a support chat to tell you they have vital missing information about the Kennedy Assassintion or they want your help suing the government for their exposure to Agent Orange when they were stationed at Minot. "Unfortunately I can't help with that. We are a website that sells wholesale frozen lemonade. Good luck!"
Grandma questions are the people who can't navigate your website. (This isn't meant to be derogatory, just vivid; I have grandma questions often enough myself.) They need to be pointed toward some resource: a help page, a kb article, a settings page, whatever.
WTFs are everything else. Every weird undocumented behavior, every emergent circumstance, every invalid state, etc. These are your best customers and they should be escalated to a real human, preferably a smart one, as soon as realistically possible. They're your best customers because (a) they are investing time into fixing something that actually went wrong; (b) they will walk you through it in greater detail than a bug report, live, and help you figure it out; and (c) they are invested, which means you have an opportunity for real loyalty and word-of-mouth gains.
What most AI systems (whether LLMs or scripts) do wrong is that they treat WTFs like they're grandmas. They re spending significant money on building these systems just to destroy the value they get from the most intelligent and passionate people in their customer base doing in-depth production QC/QA.
This rings true. However I have used one AI automated support chat that didn't behave that way. I wish I could remember the vendor but I do remember being blown away when it said something like "that sounds like a real problem would you like me to open a support ticket for this?". Which it then did and subsequently a human addressed my issue.
My recent frustration with Claude has been it feels like I'm waiting on responses more. I don't have historical latency to compare this with, but I feel like it has been getting slower. I may be wrong, and maybe its just spending more time thinking than it used to. My guess is Anthropic is having capacity issues. I hope I'm wrong because I don't want to switch.
There was a really good point in this podcast episode about the speed of LLMs. They are so slow that all of the progress messages and token streaming are necessary. But the core problem is that the technology is so darn slow.
As someone who both uses and builds this technology I think this is a core UX issue we’re going to be improving for a while. At times it really feels like a choose 2+ of: slow, bad, and expensive.
I’ve noticed most of the complaints are about the Pro plan. Anecdotally I pay for the $200 Max plan and haven’t noticed anything radically different re: tokens or thinking time (availability is still a crapshoot)
I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising. Since it’s technically usable, but not really.
> I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising
Its particularly noticeable when for a long time you could work an 8 hour day in codex on ChatGPT´s $20/month plan (though they too started tightening the screws a couple of weeks back)
I can agree. ChatGPT 5.5 made this a no-brainer choice. Anthropic are idiots removing Claude Code from the Pro plan. They need to ask Claude if what they did was a natural intelligence bug! Greed kills companies, too!
> I can agree. ChatGPT 5.5 made this a no-brainer choice
The new model that came out less than 24 hours ago made this obvious? This feels like when a new video game comes out and there's 1,000 steam reviews glazing it in the first hours of release. Don't you think you should use it for longer than a day before declaring it a game changer?
I feel like Anthropic is forcing their new model (Opus 4.7) to do much less guess work when making architectural choices, instead it prefers to defer back decisions to the user. This is likely done to mine sessions for Reinforcement-Learning signals which is then used to make their future models even smarter.
On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6.
On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
Same. I think one of the issues is that Claude reached a treshold where I could just rely on it being good and having to manually fix it up less and less and other models hadn't reached that point yet so I was aware of that and knew I had to fix things up or do a second pass or more. Other providers also move you to a worse model after you run out which is key in setting expectation as well. Developers knew that that was the trade-off.
I think even with the worse limits people still hated it but when you start to either on purpose or inadvertently make the model dumber that's when there's really no purpose to keep using Claude anymore.
If someone wants to move off Claude what are the alternatives? More importantly can another system pick up from where Claude left off or is there some internal knowledge Claude keeps in their configuration that I need to extract before canceling?
I am trying Qwen3.5-9B-Claude-4.6 since a couple of days now locally coming from OMLX. Either via Hermes or Continue in VS Code. It's oka'ish, even performance-wise.
Yep. I think the sentiment here isn't lagging too much in terms of the day to day experience of what is being offered. Kind of makes HN very useful in this regard.
We've seen this sentiment shift on HN like 20 times in the past year, too often for it to be a real reflection of service quality. Feels more like people rooting for sports teams.
The services (OpenAI, Anthropic) are not wildly changing that much. People are just using LLMs more and getting frustrated because they were told it would change the world, and then they take it out on their current patron. Give it a month and we'll be hearing how far OpenAI has fallen behind.
Same. I am getting crazy good value from Claude at work, on both scientific applications and deployment environments.
There is one caveat, and that is you have to give the model well thought out constraints to guide it properly, and absolutely take the time to read all the thinking it's doing and not be afraid to stop the process whenever things go sideway.
People who just let Claude roam free on their repository deserve everything they end up with.
Switched to local models after quality dropped off a cliff and token consumption seemed to double. Having some success with Qwen+Crush and have been more productive.
Would love some more info on how you got any local model working with Crush. Love charmbracelet but the docs are all over the place on linking into arbitrary APIs.
Obviously the context window settings are going to depend on what you've got set on the llama-server/llama-swap side. Multiple models on the same server like I have in the config snippet above is mostly only relevant if you're using llama-swap.
TL;DR is you need to set up a provider for your local LLM server, then set at least one model on that server, then set the large and small models that crush actually uses to respond to prompts to use that provider/model combo. Pretty straightforward but agree that their docs could be better for local LLM setups in particular.
For me, I've got llama-swap running and set up on my tailnet as a [tailscale service](https://tailscale.com/docs/features/tailscale-services) so I'm able to use my local LLMs anywhere I would use a cloud-hosted one, and I just set the provider baseurl in crush.json to my tailscale service URL and it works great.
i ran prompts used up a ton of usage, and got no return just showed error.
Asked support hey i got nothing back i tried prompting several times used a ton of usage and it gave no response. I'd just like usage back. What I payed for I never got.
Just bot response we don't do refunds no exceptions. Even in the case they don't serve you what your plan should give you.
you still have to search stack overflow and sift, but I'm surprised someone doesn't just make a TLDR style or shortcuts-expanding product out of it that you can just pop in code for this use case or that. vs. the current product spending a few datacenter's worth of energy to give you the answer that's only correct x% of the time
These changes fixed some of the token issues, but the token bloat is an intrinsic problem to the model, and Anthropic's solution of defaulting to xhigh reasoning for Opus 4.7 just means you'll go through tokens faster anyways.
The problem is they changed people's default settings, and if you're like me, you keep a Claude Code session open for days, maybe weeks and even a month, and just come back to it and keep going. I wouldn't be surprised if there's hundreds if not thousands of people still on these broken configurations / models.
Dear Anthropic:
Please, for the love of all things holy, NEVER change someone's defaults without INFORMING the end user first, because you will wind up with people confused, upset, and leaving your service.
I used Opus via Copilot until December and then largely switched over to Claude Code. I'm not sure what the difference is but I haven't seen any of these issues in daily use.
The usage metering is just so incredibly inconsistent, sometimes 4 parallel Opus sessions for 3 hours straight on max effort only uses up 70% of a session, other times 20 mins / 3 prompts in one session completely maxes it out. (Max x20 plan)
Is this just a bug on anthropic side or is the usage metering just completely opaque and arbitrary?
It's something strange because i never have these issues. I often run two in parallel (though not all day), and generally have something running anytime i look at my laptop to advance the steps/tasks/etc. Usually i struggle to hit 50% on my Max20.
Heck two weeks ago i tried my hardest to hit my limit just to make use of my subscription (i sometimes feel like i'm wasting it), and i still only managed to get to 80% for the week.
I generally prune my context frequently though, each new plan is a prune for example, because i don't trust large context windows and degradation. My CLAUDE.md's are also somewhat trim for this same fear and i don't use any plugins, and only a couple MCPs (LSP).
No idea why everyone seems to be having such wildly different experiences on token usage.
I don't get it. I use Claude Code every day, what I would consider pretty heavy usage...at least as heavy as I can use it while actually paying attention to what it's producing and guiding it effectively into producing good software. I literally never run into usage limits on the $100 plan, even when the bugs related to caching, etc. were happening that led to inflated token usage.
WTF are y'all doing that chews tokens so fast? I mean, sure, I could spin up Gas Town and Beads and produce infinite busy work for the agents, but that won't make useful software, because the models don't want anything. They don't know what to build without pretty constant guidance. Left to their own devices, they do busy work. The folks who "set and forget" on AI development are producing a whole lot of code to do nothing that needed doing. And, a lot of those folks are proud of their useless million lines of code.
I'm not trying to burn as many tokens as a possible, I'm trying to build good software. If you're paying attention to what you're building, there's so many points where a human is in the loop that it's unusual to run up against token limits.
Anyway, I assume that at some point they have to make enough money to pay the bills. Everything has been subsidized by investors for quite some time, and while the cost per token is going down with efficiency gains in the models/harnesses and with newer compute hardware tuned for these workloads, I think we're all still enjoying subsidized compute at the moment. I don't think Anthropic is making much profit on their plans, especially with folks who somehow run right at the edge of their token limit 24/7. And, I would guess OpenAI is running an even lossier balance sheet (they've raised more money and their prices are lower).
I dunno. I hear a lot of complaining about Claude, but it's been pretty much fine for me throughout 4.5, 4.6 and 4.7. It got Good Enough at 4.5, and it's never been less than Good Enough since. And, when I've tried alternatives, they usually proved to be not quite Good Enough for some reason, sometimes non-technical reasons (I won't use OpenAI, anymore, because I don't trust OpenAI, and Gemini is just not as good at coding as Claude).
I'm torn because I use it in my spare time, so I've missed some of these issues, I don't use it 9 to 5, but I've built some amazing things, when 1 Million tokens dropped, that was peak Claude Code for me, it was also when I suspect their issues started. I've built up some things I've been drafting in my head for ages but never had time for, and I can review the code and refine it until it looks good.
I'm debating trying out Codex, from some people I hear its "uncapped" from others I hear they reached limits in short spans of time.
There's also the really obnoxious "trust me bro" documentation update from OpenClaw where they claim Anthropic is allowing OpenClaw usage again, but no official statement?
Dear Anthropic:
I would love to build a custom harness that just uses my Claude Code subscription, I promise I wont leave it running 24/7, 365, can you please tell me how I can do this? I don't want to see some obscure tweet, make official blog posts or documentation pages to reflect policies.
Can I get whitelisted for "sane use" of my Claude Code subscription? I would love this. I am not dropping $2400 in credits for something I do for fun in my free time.
It sounds like we have very similar usage/projects. codex had been essentially uncapped (via combination of different x-factors between Plus and Pro and promotions) until very recently when they copied Anthropic's notes.
Plus is still very usable for me though. I have not tried Claude Pro in quite a while and if people are complaining about usage limits I know it's going to be a bad time for me. I had to move up from Claude Pro when the weekly limits were introduced because it was too annoying to schedule my life around 5hr windows.
I started using codex around December when I started to worry I was becoming too dependent on Claude and need to encourage competition. codex wasn't particularly competitive with Claude until 5.4 but has grown on me.
The only thing I really care about is that whatever I'm using "just works" and doesn't hurt limits and Claude code has been flaky as all hell on multiple fronts ever since everyone discovered it during the Pentagon flap. So I tend to reach for ChatGPT and codex at the moment because it will "just work" and there's a good chance Claude will not.
Claude Code now has an official telegram plugin and cron jobs and can do 80% of the things people used OpenClaw for if you just give it access to tools and run it with --dangerously-skip-permissions.
I just noticed today that it doesn't warn about approaching limits and just blows straight into billing extra tokens.
I'm pretty sure it used to warn when you got close to your 5hr limit, but no, it happily billed extra usage. Granted only about $10 today, but over the span of like 45 minutes. Not super pleased.
Codex is becoming such a good product. I have the 100$ pro lite. I have Claude still but 20$. I rarely use it. Let’s see if they give generous limits and more importantly a model that’s better than 5.5. The mythos fear mongering did not give me a good impression that they care about the average developer.
Maybe this is an unpopular opinion, but I think choosing which companies to support during this period of pre-alignment is one way to vote which direction this all goes. I'm happy to accept a slightly worse coding agent if it means I don't get exterminated someday.
Imagine vibe coding your core consumer application and associated backend…
Oh wait, I don’t have to imagine. That’s what Anthropic does. A nice preview for what is in store for those who chose to turn off their brains and turn on their AI agents.
It also seems to me they route prompts to cheaper dumber models that present themselves as e.g. Opus 4.7. Perhaps that's what is "adaptive reasoning" aka we'll route your request to something like Qwen saying it's Opus. Sometimes I get a good model, so I found I'll ask a difficult question first and if answer is dumb, I terminate the session and start again and only then go with the real prompt. But there is no guarantee model will be downgraded mid session. I wish they just charged real price and stopped these shenanigans. It wastes so much time.
You're describing a Taravangian prompt situation (a character in a book series who wakes up with a different/random intelligence level each day and has a series of tests for himself to determine which kind of decisions he's capable of that day). https://coppermind.net/wiki/Taravangian
Welcome to the future. Anthropic is currently speed running it but this is what all LLM tools are going to look like in the next few years, once they turn the enshitification corner.
I've spent thousands of dollars on API tokens in the last few months. Out of my own pocket, as an indie contractor. I used the API specifically instead of Pro/Max/Plus/Silver/Gold/Platinum/Diamond to avoid all of the mess there regarding usage resets and potential hidden routing to worse models. It worked great for months, I got a ton of shit done, shipped a bunch of features. I really began to rely on the tech. I was not happy about the cost, but the value proposition was there.
Then within the last few months everything changed and went to shit. My trust was lost. Behavior became completely inconsistent.
During the height of Claude's mental retardation (now finally acknowledged by the creators) I had an incident where CC ran a query against an unpartitioned/massive BQ table that resulted in $5,000 in extra spend because it scanned a table which should have been daily partitioned 30 times. 27 TB per scan. I recall going over and over the setup and exhaustively refining confidence. After I realized this blunder, I referred to it in the same CC session, "jesus fucking christ, I flagged this issue earlier" -- it responded, "you did. you called out the string types and full table scans and I said "let's do it later." That was wrong. I should have prioritized it when you raised it". Now obviously this is MY fault. I fucked up here, because I am the operator, and the buck stops with me. But this incident really galvinized that the Claude I had come to vibe with so well over the last N months was entirely gone.
We all knew it was making making mistakes, becoming fully retarded. We all felt and flagged this. When Anthropic came out and said, "yeah ... you guys are using it wrong, its a skill issue" I knew this honeymoon was over. Then recently when they finally came out and ack'd more of the issues (while somehow still glossing over how bad they fucked up?) it was the final nail. I'm done spending $ on Anthropic ecosystem. I signed up for OpenAI pro $200/mo and will continue working on my own local inference in the meantime.
The thing that irks me the most is that I was paying full price. Literally spreading my cheeks wide open and saying, "here ya go Anthropic, all yours." and for actually close to a year it was great. Looking back I see I began using it in March of 2025. Thats roughly a year of integrating it into my day to day, pairing with it constantly, and as I said shipping really well engineered meaningful work.
They could have just kept doing this - literally printing money. Literally: do absolutely nothing, go on vacation, profit $$$. So why did so much change? I think that the issue is they were trying to optimize CC for the monthly plan folks, the ones who are likely losing the company money, but API users became collateral damage.
Same here. The single prompt burnt all my tokens in 3 minutes for the day. What happened to Claude in the last 2 months? I was happy with what they were providing and was happy to pay whatever for it. Why did they mess with it? Why are they destroying the tool we all loved?
I hate enshittification and I hate seeing this happening to Claude Code right now.
The great de-skilling programme continues in Anthropic's casino. They completely want you dependent on gambling tokens on their slot machines with extortionate prices, fees and limits.
Anthropic can't even scale their own infrastructure operations, because it does not exist and they do not have the compute; even when they are losing tens of billions and can nerf models when they feel like it.
Once again, local models are the answer and Anthropic continues to get you addicted to their casino instead of running your own cheaper slot machine, which you save your money.
Every time you go to Anthropic's casino, the house always wins.
I would love to just say that if you are using claude code, you should no be on pro. I feel like all the people complaining are complaining that an agent cant handle the work of a developer for $20/m. Get on at least max 5, its a world of a difference.
It's impossible to justify the jump in the expense unless you are directly working on something that makes you money. Messing around on a hobby project, doing some quick research, and getting personalized notifications was a no brainer for 200 a year.
The product keeps getting worse so I will definitely evaluate options and possibly switch if management keeps screwing up the product.
For some reason you are being downvoted but I wanted to echo your sentiment. As someone who tries to switch things up for every next task, the productivity of Claude Max is worth every penny.
And I actually read the output to fix what I don't like and ever since Opus 4.5, I've had to less and less. 4.6 had issues at the beginning but that's because you have to manually make sure you change the effort level.
I found the perfect sweet spot for my hobby development. I pay 7 euros for Gemini plus and use it for creating the architecture and technical specs. Those are fed to Sonnet on Pro that just implements the instructions. This gives plenty of space to do long sessions several times a week.
I write detailed specs. Multifile with example code. In markdown.
Then hand over to Claude Sonnet.
With hard requirements listed, I found out that the generated code missed requirements, had duplicate code or even unnecessary code wrangling data (mapping objects into new objects of narrower types when won't be needed) along with tests that fake and work around to pass.
So turns out that I'm not writing code but I'm reading lots of code.
The fact that I know first hand prior to Gen AI is that writing code is way easier. It is reading the code, understanding it and making a mental model that's way more labour intensive.
Therefore I need more time and effort with Gen AI than I needed before because I need to read a lot of code, understand it and ensure it adheres to what mental model I have.
Hence Gen AI at this price point which Anthropic offers is a net negative for me because I am not vibe coding, I'm building real software that real humans depend upon and my users deserve better attention and focus from me hence I'll be cancelling my subscription shortly.
Writing detailed specs and then giving them to an AI is not the optimal way to work with AI.
That's vibecoding with an extra documentation step.
Also, Sonnet is not the model you'd want to use if you're trying to vibecode complex projects. Opus or GPT-5.5 are the only ways to even attempt this.
> Therefore I need more time and effort with Gen AI than I needed before
Stop trying to use it as all-or-nothing. You can still make the decisions, call the shots, write code where AI doesn't help and then use AI to speed up parts where it does help.
That's how most non-junior engineers settle into using AI.
Ignore all of the LinkedIn and social media hype about prompting apps into existence.
I must be doing something very different from everyone else, but I write what I want and how I want it and Opus 4.7 plans it for me, then I carefully review. Often times I need to validate and check things, sometimes I’ve revised the plan multiple times. Then implementation which I still use Opus for because I get a warning that my current model holds the cache so Sonnet shouldn’t implement. And honestly, I’m mostly within my Pro subscription, granted I also have ChatGPT Plus but I’ve mostly only used that as the chat/quick reference model. But yeah takes some time to read and understand everything, a lot of the time I make manual edits too.
Or just don't use AI to write code. Use it as a code reviewer assistant along with your usual test-lint development cycle. Use it to help evaluate 3rd party libraries faster. Use it to research new topics. Use it to help draft RFCs and design documents. Use it as a chat buddy when working on hard problems.
I think the AI companies all stink to high heaven and the whole thing being built on copyright infringement still makes me squirm. But the latest models are stupidly smart in some cases. It's starting to feel like I really do have a sci-fi AI assistant that I can just reach for whenever I need it, either to support hard thinking or to speed up or entirely avoid drudgery and toil.
You don't have to buy into the stupid vibecoding hype to get productivity value out of the technology.
You of course don't have to use it at all. And you don't owe your money to any particular company. Heck for non-code tasks the local-capable models are great. But you can't just look at vibecoding and dismiss the entire category of technology.
I feel like I'm using Claude Opus pretty effectively and I'm honestly not running up against limits in my mid-tier subscriptions. My workflow is more "copilot" than "autopilot", in that I craft prompts for contained tasks and review nearly everything, so it's pretty light compared to people doing vibe coding.
The market-leading technology is pretty close to "good enough" for how I'm using it. I look forward to the day when LLM-assisted coding is commoditized. I could really go for an open source model based on properly licensed code.
> the day when LLM-assisted coding is commoditized
Like yesterday? LLM-assisted coding is $100/mo. It looks very commoditized when most houses in developed world pay more for electricity than that.
My definition of LLM-assisted coding is that you fully understand every change and every single line of the code. Otherwise it's vibe coding. And I believe if one is honest to this principle, it's very hard to deplete the quota of the $100 tier.
> fully understand every change and every single line of the code.
im probably just not being charitable enough to what you mean, but thats an absurd bar that almost nobody conforms to even if its fully handwritten. nothing would get done if they did. But again, my emphasis is on that im probably just not being charitable to what you mean.
Well that is how it mostly worked until recently... unless if the developer copied and pasted from stackoverflow without understanding much. Which did happen.
It's a good point. To me this really comes down to the economics of the software being written.
If it's low-stakes, then the required depth to accept the code is also low.
I also use it this way and I'm overall pretty happy with it, but it feels like they really want us to use it in "autopilot" mode. It's like they have two conflicting priorities of "make people use more tokens so we can bill them more" and "people are using more tokens than expected, our pricing structure is no longer sustainable"
(but I guess they're not really conflicting, if the "solution" involves upgrading to a higher plan)
I feel like they are making it harder to use it this way. Encouraging autonomous is one thing, but it really feels more like they are handicapping engaged use. I suspect it reflects their own development practices and needs.
This is something I've thought of as well. The way the caps are implemented, it really disincentivizes engaged use. The 5-hour window especially is very awkward and disruptive. The net result is that I have to somewhat plan my day around when the 5-hour window will affect it. That by itself is a powerful disincentive from using Claude. It has also caused me to use different tools for things I previously would have used Claude for. For example, detailed plans I use codex now rather than Claude, because I hit the limit way too fast when doing documentation work. It certainly doesn't hurt that codex seems to be better at it, but I wouldn't even have a codex subscription if it wasn't for claude's usage limits
Another big one for me is that they dropped the cache TTLs. It is normal for me to come back to a session an hour later, but someone "autopilot"-ing won't have such gaps.
not just the cache though. every time you stop and come back, it basically reloads the whole session. if you just let it keep going, it counts like one smooth run. you hit the wall faster for actually checking its work.
autopilot (yolo mode) is amazing and feels great, truly delegate instead of hand-holding on every step
Do you have any good resources on how to work like that? I made the move from "auto complete on steroids" to "agents write most of my code". But I can't imagine running agents unchecked (and in parallel!) for any significant amount of time.
I would also be interested on resources on "agents write most of your code" if you can share some.
I think the culty element of AI development is really blinding a lot of these companies to what their tools are actually useful for. They’re genuinely great productivity enhancers, but the boosters are constantly going on about how it’s going to replace all your employees and it’s just. . .not good for that! And I don’t mean “not yet” I mean I don’t see it ever getting there barring some major breakthrough on the order of inventing a room-temp superconductor.
I agree with you, the "replacing people" narrative is not only wrong, it's inflammatory and brand suicide for these AI companies who don't seem to realize (or just don't care) the kind of buzz saw of public opinion they're walking straight towards.
That said, looking at the way things work in big companies, AI has definitely made it so one senior engineer with decent opinions can outperform a mediocre PM plus four engineers who just do what they're told.
Similar with the copilot and not autopilot usage. I find its the best of them all. Mostly i just use it as an occasionnal search engine. I've never found LLMs to be efficient to actually do work. I do miss the day when tech docs were usable. Claude seems like a crutch for gaps in developer experience more than anything.
Same. Never hit a limit. Use it heavily for real work. Never even thought of firing off an LLM for hours of...something. Seems like a recipe for wasting my time figuring out what it did and why.
I have Max 5x and use only Claude Opus on xhigh mode. I don't use agents, or even MCPs, and stick to Claude Code.
I find it incredibly difficult to saturate my usage. I'm ending the average week at 30-ish percentage, despite this thing doing an enormous amount of work for (with?) me.
Now I will say that with pro I was constantly hitting the limit -- like comically so, and single requests would push me over 100% for the session and into paying for extra usage -- and max 5x feels like far more than 5x the usage, but who knows. Anthropic is extremely squirrely about things like surge rates, and so on.
I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
I was in the same boat until last few days, where just a handful queries were enough to saturate my 5h session in about 30 mins.
Recently I've gotten Qwen 3.6 27b working locally and it's pretty great, but still doesn't match Opus; I've gotten check out that new Deepseek model sometime.
If only OpenAI spent a significant amount of money on some kind of generative software that was predominantly trained on internet comments that'd be able to do all the astroturfing for them...
This kind of "if only" sarcastic comment belongs on reddit from 5 years ago
Yea, I never got how people are even able to hit the weekly limits so consistently. Maybe it's because they use it for work? But in that case, you would expect the employer to cover it so idk.
>I'm super skeptical of the influx of "DAE think Opus sucks now. Let's all move to Codex!" nonsense that has flooded HN. A part of it is the ex-girlfriend thing where people are angry about something and try to force-multiply their disagreement, but some of it legitimately smells like astroturfing. Like OpenAI got done pay $100M for some unknown podcaster and start hiring people to write this stuff online.
A lot of people are angry about the whole openclaw situation. They are especially bitter that when they attempted to justify exfiltrating the OAuth token to use for openclaw, nobody agreed with them that they had the right to do so, and sided with Claude that different limits for first-party use is standard. So they create threads like this, and complain about some opaque reason why Anthropic is finished (while still keeping their subscription, of course).
I'd recommend Kimi k2.6 for your use. It is an excellent model at a fraction of the cost, and you can use Claude Code with it.
I did a 1:1 map of all my Claude Code skills, and it feels like I never left Opus.
Super happy with the results.
If you don't use a lot of quota the cheapest monthly Claude Code is $20, Kimi Code is $19, i.e. the cost difference is minuscule.
Kimi wants my phone number on signup so a no-go for me.
I was saying the same until DeepSeek v4 this morning... sorry, Kimi. The competition is intense!
Fascinated, a bummer that DeepSeek does not offer a DPA or opt-out for training. This renders it unusable for my use cases unfortunately. At least z.ai GLM has a somewhat DPA in Singapore.
What provider do you use for Kimi
Straight from them, but I know other providers like io.net can be faster but I like to directly support the project.
Honestly, it sounds like, assuming you have no ethical qualms, you could get by with a Mac or AMD 395+ and the newest models, specifically QWEN3.5-Coder-Next. It does exactly as you describe. It maxes out around 85k context, which if you do a good job providing guard rails, etc, is the length of a small-medium project.
It does seem like the sweet spot between WallE and the destroyed earth in WallE.
Sorry, out of the loop. Which ethical qualms are you referring to?
Using a Mac, obviously.
My guess - China.
This is what worries me. People become dependent on these GenAI products that are proprietary, not transparant, and need a subscription. People build on it like it is a solid foundation. But all of a sudden the owner just pulls the foundation from under your building.
But these products are all drop in replacements for each other. I've recently favored Codex more than CC, just because rate limits got mildly annoying. I really didn't have to change anything about my workflow in doing that.
> But these products are all drop in replacements for each other
For now. That doesn't really change the risk, that just means they are all hyper competitive right this moment, and so they are comparable. If one of them becomes king of the hill, nothing stops them from silently degrading or jacking prices.
The only shield is to not be dependent in the first place. That means keeping your skills sharp and being willing to pass on your knowledge to juniors, so they aren't dependent on these things.
Of course, many people are building their business on huge AI scaffolding. There's nothing they can do.
I'm curious - why for now? This stuff is practically commoditized. Trying to think of anything that ever successfully got back into proprietary land from there.
The thing is that AI is still more akin to a glorified autocomplete than something that can really supersede your skills. Proprietary model suppliers are constantly trying to obscure this basic underlying fact, without much success (much of the unpredictable shifts you see in proprietary AI behavior ultimately boils down to this); so it becomes far more crystal-clear when using open models that really are a pure commodity.
It doesn't look commoditized to me, it looks subsidized. It looks like everyone is trying to be "the one" and running as competitively as possible until the others fail. Commoditized would imply these services are all going to mellow into a stable state and mostly compete on price. I don't think that's happening. These aren't paper clips, they are courting governments and trying to pull the ladder up behind them. That's why both Anthropic and OpenAI are preaching doomsday and trying to build a moat with regulations.
Fair. I have high hope for local inference, feel like right now it is simply cost prohibitive to get the hardware. It will be interesting to see what happens.
Luckily local AI is becoming more feasible every day.
It feels more and more like OpenAI/Anthoropic aren't the future but Qwen, Kimi, or Deepseek are. You can run them locally, but that isn't really the point, it is about democratization of service providers. You can run any of them on a dozen providers with different trade-offs/offerings OR locally.
They won't ever be SOTA due to money, but "last year's SOTA" when it costs 1/4 or less, may be good enough. More quantity, more flexibility, at lower edge quality. It can make sense. A 7% dumber agent TEAM Vs. a single objectively superior super-agent.
That's the most exciting thing going on in that space. New workflows opening up not due to intelligence improvements but cost improvements for "good enough" intelligence.
Open Source isn't even within 50% of what the SOTA models are. Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.
I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.
It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.
Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.
But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.
Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.
I think that there will come a point when open source models are "good enough" for many tasks (they probably already are for some tasks; or at least, some small number of people seem happy with them), but, as you suggest, it will likely always (for the forseeable future at least) be the case that closed SOTA models are significantly ahead of open models, and any task which can still benefit from a smarter model (which will probably always remain some large subset of tasks) will be better done on a closed model.
The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.
Unless you are getting outside of your comfort zone and taking a month off from your $200 subscription, every other month, I can’t see how you can make the universal claim that the open weights models are all 50% as good. Just today, DeepSeek released a new model, so nobody knows how that will compare, a week ago it was Gemma 4, etc. I’m okay with you making a comparison, but state the model and the timeframe in which it was tested that you are basing your conclusions on.
People pirate photoshop and office if they don't want to pay for it, making it as "free" as GIMP. If there is a free option people will use it. never underestimate the cheapskates.
> Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.
I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?
You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.
There's going to be a day when we look back at $200/mo price tags and say "wow that was cheap".
The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.
Okay, but then by that logic a person making only $20k would break even at about an hour.
Are you suggesting that someone making $20k should be spending $200/mo on Claude?
Everyone is arguing why I'm wrong or that I should have presented more data.
You've got the real insight with this claim.
This is the way the world is moving. Open source isn't even going where the ball is being tossed. There is no leadership here.
You're spot on.
> Benchmarks are toys, real world use is vastly different...Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters.
This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.
Also, this space will (and perhaps already is for some of us) be an arms race. Sure you can go local but hosted will always be able to offer more and if you want to be competitive, you'll need to be using the most capable.
IMO It's a different and new model. We're engineers, and we're rich. It's not going to be good enough for us. But the much larger market by far is all the people who used to HAVE to work with engineers. They now have optionality; the pendulum is going to swing.
> Open Source isn't even within 50% of what the SOTA models are.
When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.
Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.
Their opinion is also behind on LibreOffice, too. I won't defend GIMP's monstrosity, but I finished a whole dissertation, do all my regular spreadsheet work (that isn't done via R), and have created plenty of visual mockups with LibreOffice. Plus, I don't have to deal with a spammy Windows environment.
Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.
Qwen3.6 at which model size and quantization? I already think Opus 4.6 is usable but still dumb as bricks. A 20% cut off that feels like it would still be unusable. And that's not even getting to the annoyance of setting everything up to run locally & getting HW that can run it locally which basically looks like a Macbook M4 these days as the x86 side is ridiculously pricey to get decent performance out of models.
> Why should anyone waste time on poorer results?
Because in almost no real-world project is "programming time" the limiting factor?
No, it's rate at which you can solve problems, and weaker models waste your time because they don't solve problems at the same speed.
> Open Source isn't even within 50% of what the SOTA models are
Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.
Maybe for folks who are deep into this, but it’s not exactly accessible. I tried reading up on it a couple of months ago, but parsing through what hardware I needed, the model and how to configure it (model size vs quantization), how I’d get access to the hardware (which for decent results in coding, new hardware runs $4k-$10k last I checked)—it had a non trivial barrier of entry. I was trying to do this over a long weekend and ran out of time. I’ll have to look into it again because having the local option would be great.
Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).
Just get a decent macbook, use LM Studio or OMLX and the latest qwen model you can fit in unified ram.
Hooking up Claude Code to it is trivial with omlx.
https://github.com/jundot/omlx
> new hardware runs $4k-$10k last I checked
Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.
$10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.
(If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)
You can't put "SSD offload" and "workable speed" in the same sentence.
Feasibility on commodity hardware would be the true watermark. Running high end computers is the only way to get decent results at the moment, but if we can run inference on CPUs, NPUs, and GPUs on everyday hardware, the moat should disappear.
You can already run inference on ordinary hardware but if you want workable throughput you're limited to small models, and these have very poor world-knowledge.
I've been using local AI via LM Studio ever since I canceled my Claude subscription. It's obviously slower than Claude on my M1 Studio[†], but like someone else said, I use AI more like a copilot than an autopilot. I'm pretty enthused that I can give it a small task and let it churn through it for a few minutes, while I work on something alongside – all for free with no goddamned arbitrary limits.
[†] The latest Qwen 3.6 whatever has been a noticeable improvement, and I'm not even at the point where I tweak settings like sampling, temperature, etc. No idea what that stuff does, I just use the staff picks in LM Studio and customize the system prompts.
Indeed, I feel like we are in the early computer equivalent phase of AI, where giant expensive hardware is still required for frontier models. In 5 years I bet there will be fully open models we'll be able to run on a few $1000 of consumer hardware with equivalent performance to opus 4.7/4.6.
You'll never have the power of what they have though. Cloud capital is insane.
So you can run 1 agent locally on $1k to $3k hardware
They can run a fleet of thousands
I think intelligence per compute will go up significantly in the coming years, while the cost per compute will drop significantly. No way to know for sure, so I guess we'll see
Sure, but local AI is still a black box. They can be influenced by training data selection, poisoning, hidden system prompts, etc. That recent Wordpress supply chain hack goes to show that the rug can still be pulled even if the software is FOSS.
I love how it's just a tacit understanding that these companies' entire MO is to carve out a territory, get everyone hooked on the good stuff and then jack up the price when they're addicted and captured -- literally the business plan of crack dealers, and it's just business as usual in the tech industry.
I was recently introduced to the term "vcware", ala shareware or vaporware, to describe these products. "Don't use that, it's vcware, enshitification is coming soon."
https://en.wikipedia.org/wiki/Enshittification 101
Not really. The hardware requirements remain indefinitely out of reach.
Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.
The memory requirements aren't that intense. You can run useful (not frontier) models on a $2-5K machine at reasonable speeds. The capabilities of Qwen3.6 27B or 35B-A3B are dramatically better than what was available even a few months ago.
Practical? Maybe not (unless you highly value privacy) because you can get better models and better performance with cheap API access or even cheaper subscriptions. As you said, this may indefinitely be the case.
At least some of the investors in this tech are hoping for a monopoly position. They'd like to outspend the competition to get an insurmountable lead, at which point they can set their price.
But, so far, competition remains fierce. Anthropic still has the best tools for writing code. That lead is smaller than it's ever been, though. But, honestly, Opus 4.5 is when it got Good Enough. If Anthropic suddenly increased prices beyond what I'm willing to pay, any model that gives me Opus 4.5 or better performance is good enough for the vast majority of the work I do with agents. And, there are a bunch of models at that level, now maybe including some discount Chinese models. Certainly Gemini Pro 3.1 is on par with Opus 4.5. Current Codex is better than Opus 4.5 and close to Opus 4.7 (though I won't use OpenAI because I don't trust them to be the dominant player in AI).
I often switch agents/models on the same project because I like tinkering with self-hosted and I like to keep an eye on the most efficient way to work...which models wastes less of my time on silly stuff. Switching is literally nothing; I run `gemini` or `copilot` or `hermes` instead of `claude`. There's simply no deep dependency on a specific model or agent. They're all trying to find ways to make unique features for people to build a dependence on, of course, but the top models are all so fucking smart you can just tell them to do whatever thing it is that you need done. That feature could probably be a skill, whatever it is, and the model can probably write the skill. Or, even better, it could be actual software, also written by the model, rather than a set of instructions for the model to interpret based on the current random seed.
Currently, the only consistent moat is making the best model. Anthropic makes the best model and tools for coding, but that's a pretty shallow moat...I could live with several other models for coding. I'll gladly pay a premium for the best model and tools for coding, but I also won't be devastated if I suddenly don't have Claude Code tomorrow. Even open models I can host myself are getting very close to Good Enough.
Anthropic sells due to unrelenting pressure and unachievable demand > new owner cuts costs > models become worse > new owner sells > the capitalistic cycle wins > we, the people, suffer
The owner rug-pulls, or Broadcom buys the owner and starts squeezing.
True. That is why it is key important to have open source and sovereign models that will be accessible to all and always on / local.
Competition (OpenAI vs Anthropic is fun to watch) and open source will get us there soon I think.
Some people are so dependent on it they can't even say it without twisting words to hide the fact that they're now stuck at zero
The sooner you cancel the sooner you become independent of them
This is why, despite enjoying all of this, I really want to focus on locally hosted models. If we don't host the technology ourselves, we're setting ourselves up for a hard fall down the line.
Until very recently, local models been little more than brittle toys in my experience, if you're trying to use them for coding.
But lately I've been running Pi (minimal coding agent harness) with Gemma4 and Qwen3.6 and I've been blown away by how capable and fast they are compared to other models of their size. (I'm using the biggest that can fit into 24gb, not the smaller ones.) In fact, I don't really need to reach for Claude and friends much of the time (for my use cases at least).
Claude with Sonnet medium effort just used 100% of my session limit, some extra dollars, thought for 53 minutes, and said:
API Error: Claude's response exceeded the 32000 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.
And on the seventh day, API Error: Claude's response exceeded the 32000 output token maximum
More on the 7th minute if you’re using opus
I don't think i'd let it think more than 5 minutes without killing the process.
Just copy and past the error back to Claude and you will be able to continue. I have seen this many times over the past few months. I thought it was related to AWS bedrock that I have been using - but probably not.
Just curious, what version of Max are you on: 5x or 20x?
You're using it within their high usage rate window. I hope you're aware of this, if you use it out of the high usage time window it's supposed to use less, but it does seem a little odd that Sonnet uses so much, even on Medium.
Ah so we are only supposed to use this work tool outside of work hours?
“Work tool”
Please. This is a toy. A novel little tech-toy. If you depend on it now for doing your job then, frankly, you deserve to have your rug pulled now and then.
If you didn't found the way to use the tool constructively, keep trying.
If you didn't try to use it to work for you, that's okay, but maybe try once more? It does work and adds value. It's a non-standard and weirdly flexible tool with limitations.
...but in retrospect, seeing how you finished your comment, maybe you really want to remain angry and misinformed.
No, you're supposed to make all your hours work hours. This is the way of AI.
I'm on max x5. No limit problems, but I am definetly feeling the decline. Early stopping and being hellbent on taking shortcuts being the main culprits, closely followed by over optimistic (stale) caching (audit your hooks!).
All mostly mitigatable by rigorous audits and steering, but man, it should not have to be.
Me and so many coworkers have been struggling with a big cognitive decline in Claude over the last two months. 4.5 was useful and 4.6 was great. I had my own little benchmark and 4.5 could just about keep track of a two way pointer merge loop whereas 4.6 managed a 3 way and the 1M context managed k-way. And this ability to track braids directly helped it understand real production code and make changes and be useful etc.
but then two months ago 4.6 started getting forgetful and making very dumb decisions and so on. Everyone started comparing notes and realising it wasn’t “just them”. And 4.7 isn’t much better and the last few weeks we keep having to battle the auto level of effort downgrade and so on. So much friction as you think “that was dumb” and have to go check the settings again and see there has been some silent downgrade.
We all miss the early days of 4.6, which just show you can have a good useful model. LLMs can be really powerful but in delivering it to the mass market Anthropic throttle and downgrade it to not useful.
My thinking is that soon deepseek reaches the more-than-good-enough 4.6+ level and everyone can get off the Claude pay-more-for-less trajectory. We don’t need much more than we’ve already had a glimpse of and now know is possible. We just need it in our control and provisioned not metered so we can depend upon it.
This was a real issue, and Anthropic recently awknowledged it:
https://www.anthropic.com/engineering/april-23-postmortem
Of course, it sucks when companies screw up ... but at the same time, they "paid everyone back" by removing limits for awhile, and (more importantly to me) they were transparent about the whole thing.
I have a hard time seeing any other major AI provider being this transparent, so while I'm annoyed at Claude ... I respect how they handled it.
Yes that was one issue. It’s not the general degradation I have been talking about though, which is ongoing.
I recall reading similar tales of woe with other providers here on HN. I think the gradual dialling back of capability as capacity becomes strained as users pile on is part of the MO of all the big AI companies.
the 'general degradation' is a myth. Check out https://isitnerfed.org/.
did you set your 4.7 to xhigh or max effort? anything else is basically not worth your time...
My max20 sub is sitting unused since april mostly now, codex with 5.4 (and now 5.5) even with fast mode (= double token costs) is night and day. Opus is doing convincing failures and either forgets half the important details or decides to do "pragmatic" (read: technical debt bandaids or worse) silently and claims success even with everything crashing and burning after the changes. and point out the errors it will make even more messes. Opus works really well for oneshotting greenfield scopes, but iterating on it later or doing complex integrations its just unusable and even harmfully bad.
GPT 5.4+ takes its time and considers even edgecases unprovoked that in fact are correct and saves me subsequent error hunting turns and finally delivers. Plus no "this doesn't look like malware" or "actually wait" thinking loops for minutes over a oneliner script change.
My mental model for LLM is I don't expect them to chew gum and walk at the same time. Cleaning code up is a different task from building new functionality.
GLM always feels like it's doing things smarter, until you actually review the code. So you still need the build/prune cycle. That's my experience anyway.
Can I get that max20 if you are not using it?
AI services are only minorly incentivized to reduce token usage. They want high token usage, it makes you pay more. They are going to continually test where the limit is, what is the max token usage before you get angry. All AI companies will continue to trade places for token use and cost as cost increases. We are in tepid water pretending it is a bath pretending we aren’t about to be boiled frogs.
Up to a point. There is incentive when they get to the point where they literally can't serve their userbase and customers start leaving.
People said this about AWS too. "Why would they save you money??". It turns out that every time they reduce prices, they make more money, because more people use their services.
AI companies have the same incentive. Make it cheaper and people will use it more, making you more money (assuming your price is still above cost). And of course they have every reason to reduce their on costs.
jevons paradox
I am betting on the fact that people will get increasingly frustrated at closed agent lock-ins. I built (cline fork) and open-sourced https://github.com/dirac-run/dirac with the sole focus on token efficiency expecting that the closed-lock-in vendors will do enough to frustrate their users over time. Looking for contributors
To an extent. That economic incentive stops making sense when a) capacity is an actual constraint and b) Anthropic is not a monopoly and is subject to pressure from competitors who are more user-friendly.
That's what I am thinking, too. It sound's like a conspiracy theory, but at the end Anthropic et al benefits from models that don't finish their jobs. I recently read about this "over editing phenomenon". The machine is never done. It doesn't want to.
It's like dating apps. They don't want you to find a good match, because then you cancel the subscription.
Which works fine, right up until China releases a new DeepSeek model that's 85% as capable as an Anthropic or OpenAI premium model but costs a fraction of what either of those US companies are charging.
Speaking of which:
https://www.cnbc.com/2026/04/24/deepseek-v4-llm-preview-open...
Well that's why threads like this are important to upvote. On hacker news , they're angry !
Yesterday was a realization point for me. I gave a simple extraction task to Claude code with a local LLM and it "whirred" and "purred" for 10 minutes. Then I submitted the same data and prompt directly to model via llama_cpp chat UI and the model single-shotted it in under a minute. So obviously something wrong with coding agent or the way it is talking to LLM.
Now I'm looking for an extremely simple open-source coding agent. Nanocoder doesn't seem install on my Mac and it brings node-modules bloat, so no. Opencode seems not quite open-source. For now, I'm doing the work of coding agent and using llama_cpp web UI. Chugging it along fine.
https://pi.dev/ seems popular, whats not open source about opencode? The repo has an MIT License.
Maybe it's just my feeling. It asks to update/upgrade continuously.
Some people believe only copyleft licenses are open source. They're right on principle, wrong in (legal) practice.
They're not even right on principle: https://www.gnu.org/licenses/license-list.html
Even the FSF recognizes that non-copyleft licenses still follow the Freedoms, and therefore are still Free Software.
Been LOVING Pi so far!
Probably a silly idea, but I'll throw it into the mix - have your current AI build one for you. You can have exactly the coding agent you want, especially if you're looking for "extremely simple".
I got annoyed enough with Anthropic's weird behavior this week to actually try this, and got something workable up & running in a few days. My case was unique: there's no Claude Code for BeOS, or my older / ancient Macs, so it was easier to bootstrap & stitch something together if I really wanted an agentic coding agent on those platforms. You'll learn a lot about how models actually work in the process too, and how much crazy ridiculous bandaid patching is happening Claude Code. Though you might also appreciate some of the difficulties that the agent / harnesses have to solve too. (And to be clear, I'm still using CC when I'm on a platform that supports it.)
As for the llama_cpp vs Claude Code delays - I've run into that too. My theory is API is prioritized over Claude Code subscription traffic. API certainly feels way faster. But you're also paying significantly more.
Just in case it didn't occur to you already, you can just build whatever coding agent you want. They're pretty simple
You can run CC with local models, it's pretty straightforward. I've done this with vLLM + a thin shim to change the endpoint syntax.
I use both Cursor and Claude Code, and yes, the latter is noticeably slower with the same model at the same settings.
However, it's hard to justify Cursor's cost. My bill was $1,500/mo at one point, which is what encouraged me to give CC a try.
what model you used with llama_cpp?
Qwen3.6-35B quant-4 gguf
Swival is not bloated and was specifically made for local agents: https://swival.dev
pi.dev as well
I've noticed that sometimes the same Claude model will make logical errors sometimes but not other times. Claude's performance is highly temporal. There's even a graph! https://marginlab.ai/trackers/claude-code/
I haven't seen anyone mention this publicly, but I've noticed that the same model will give wildly different results depending on the quantization. 4-bit is not the same as 8-bit and so on in compute requirements and output quality. https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
I'm aware that frontier models don't work in the same way, but I've often wondered if there's a fidelity dial somewhere that's being used to change the amount of memory / resources each model takes during peak hours v. off hours. Does anyone know if that's the case?
I'm not sure that graph shows a time-based correlation. The 60% line stays inside the 95% confidence interval. Is that not just a measurement of noise?
I have a simple rule: I won't pay for that stuff. First they steal all my work to feed into those models, afterwards I shall pay for it? No way!
I use AI, but only what is free-of-charge, and if that doesn't cut it, I just do it like in the good old times, by using my own brain.
The timeline doesn't make any sense. How can you subscribe a couple weeks ago and the problem start 3 weeks ago and yet things also went well for the first few weeks. was this written by GPT 5.5?
The author is not a native English speaker it seems.
They might mean "few weeks ago" and the phrase "couple of weeks ago" might not be exactly as "Vor ein paar Wochen" in their mind rather could be as "few weeks ago."
Rest of the prose in the article seems to support the assumption.
The post is handwritten with no LLMs involved.
I've been a fan since the launch of the first Sonnet model and big props for standing up to the government, but you can sure lose that good faith fast when you piss off your paying customers with bad communication, shaky model quality and lowered usage limits.
I've see a post like this every week for the last 2 years. Are these models actually getting worse? Or do folks start noticing the cracks as they use them more and more?
Shameless self plug but also worried about the silent quality regressions, I started building a tool to track coding agent performance over time.. https://github.com/s1liconcow/repogauge
Here is a sample report that tries out the cheaper models + the newest Kimi2.6 model against the 5.4 'gold' testcases from the repo: https://repogauge.org/sample_report.
This is cool - just wanted to note https://marginlab.ai is one that has been around for a while.
are there any tools anyone knows to collect this kind of telemetry while using the tools instead of offline evals.
running evals seems like it may be a bit too expensive as a solo dev.
I also cancelled my subscription.The $20 Pro plan has become completely unusable for any real work. What is especially frustrating is that Claude Chat and Claude Code now share the exact same usage limits — it makes zero sense from a product standpoint when the workflows are so different. Even the $200 Max plan got heavily nerfed. What used to easily last me a full week (or more) of solid daily use now burns out in just a few days. Combined with the quality drop and unpredictable token consumption, it simply stopped being worth it.
for all the drama, its pretty clear both openai, google, and anthropic have had to degrade some of their products because of a lack of supply.
There's really no immediate solution to this other than letting the price float or limiting users as capacity is built out this gets better.
Looking at Anthropic's new products I think they understand they don't really have a cutting edge other than the brand.
I tried Kimi 2.6 and it's almost comparable to Opus. Anthropic lost the ball. I hope this is a sign the we are moving towards a future where model usage is a commodity with heavy competition on price/performance
How are you using kimi 2.6? I am considering their coding plan to replace my claude max 5x but I am worried about privacy and security.
I'm using it via OpenCode Go, which claims to only use Zero Data Retention providers.
How much you trust any particular provider's claim to not retain data is subjective though.
Doesn't "poor support" implies that there is some sort of support? Shouldnt it be "no support"
You get to talk to an AI agent
I've been very happy using Codex in the VScode extension. Very high quality coding and generous token limits. I've been running Claude in the CLI over the last couple of months to compare and overall I prefer Codex, but would be happy with either.
I feel like almost everyone using AI for support systems is utterly failing at the same incredibly obvious place.
The first job of any support system—both in terms of importance and chronologically—is triage. This is not a research issue and it's not an interaction issue. It's at root a classification problem and should be trained and implemented as such.
There are three broad categories of interaction: cranks, grandmas, and wtfs.
Cranks are the people opening a support chat to tell you they have vital missing information about the Kennedy Assassintion or they want your help suing the government for their exposure to Agent Orange when they were stationed at Minot. "Unfortunately I can't help with that. We are a website that sells wholesale frozen lemonade. Good luck!"
Grandma questions are the people who can't navigate your website. (This isn't meant to be derogatory, just vivid; I have grandma questions often enough myself.) They need to be pointed toward some resource: a help page, a kb article, a settings page, whatever.
WTFs are everything else. Every weird undocumented behavior, every emergent circumstance, every invalid state, etc. These are your best customers and they should be escalated to a real human, preferably a smart one, as soon as realistically possible. They're your best customers because (a) they are investing time into fixing something that actually went wrong; (b) they will walk you through it in greater detail than a bug report, live, and help you figure it out; and (c) they are invested, which means you have an opportunity for real loyalty and word-of-mouth gains.
What most AI systems (whether LLMs or scripts) do wrong is that they treat WTFs like they're grandmas. They re spending significant money on building these systems just to destroy the value they get from the most intelligent and passionate people in their customer base doing in-depth production QC/QA.
This rings true. However I have used one AI automated support chat that didn't behave that way. I wish I could remember the vendor but I do remember being blown away when it said something like "that sounds like a real problem would you like me to open a support ticket for this?". Which it then did and subsequently a human addressed my issue.
The timeline of the first few sentences doesn't add up. how can you subscribe 2 weeks ago when the problem started 3 weeks ago.
They can't afford to care about individual customers because enterprise demand exploded and they're short on compute
My recent frustration with Claude has been it feels like I'm waiting on responses more. I don't have historical latency to compare this with, but I feel like it has been getting slower. I may be wrong, and maybe its just spending more time thinking than it used to. My guess is Anthropic is having capacity issues. I hope I'm wrong because I don't want to switch.
There was a really good point in this podcast episode about the speed of LLMs. They are so slow that all of the progress messages and token streaming are necessary. But the core problem is that the technology is so darn slow.
https://podcasts.apple.com/us/podcast/this-episode-is-a-cogn...
As someone who both uses and builds this technology I think this is a core UX issue we’re going to be improving for a while. At times it really feels like a choose 2+ of: slow, bad, and expensive.
Same, after being a long-time proponent too.
First was the CC adaptive thinking change, then 4.7. Even with `/effort max` and keeping under 20% of 1M context, the quality degradation is obvious.
I don't understand their strategy here.
I’ve noticed most of the complaints are about the Pro plan. Anecdotally I pay for the $200 Max plan and haven’t noticed anything radically different re: tokens or thinking time (availability is still a crapshoot)
I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising. Since it’s technically usable, but not really.
> I am certainly not saying people should “spend more money,” more like the Claude Code access in the Pro plan seems kind of like false advertising
Its particularly noticeable when for a long time you could work an 8 hour day in codex on ChatGPT´s $20/month plan (though they too started tightening the screws a couple of weeks back)
My guess is that the higher plans will be next, especially as more people upgrade to those and maximize their usage.
I can agree. ChatGPT 5.5 made this a no-brainer choice. Anthropic are idiots removing Claude Code from the Pro plan. They need to ask Claude if what they did was a natural intelligence bug! Greed kills companies, too!
> I can agree. ChatGPT 5.5 made this a no-brainer choice
The new model that came out less than 24 hours ago made this obvious? This feels like when a new video game comes out and there's 1,000 steam reviews glazing it in the first hours of release. Don't you think you should use it for longer than a day before declaring it a game changer?
I feel like Anthropic is forcing their new model (Opus 4.7) to do much less guess work when making architectural choices, instead it prefers to defer back decisions to the user. This is likely done to mine sessions for Reinforcement-Learning signals which is then used to make their future models even smarter.
https://www.anthropic.com/engineering/april-23-postmortem
On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode. This was the wrong tradeoff. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks. This impacted Sonnet 4.6 and Opus 4.6.
On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
How does this address the point I moved specifically?
They won't even reset usage for me: https://news.ycombinator.com/item?id=47892445
And by crikey do I empathise with the poor support in this article. Nothing has soured me on Anthropic more than their attitude.
Great AI engineers. Questionable command line engineers (but highly successful.) Downright awful to their customers.
Same. I think one of the issues is that Claude reached a treshold where I could just rely on it being good and having to manually fix it up less and less and other models hadn't reached that point yet so I was aware of that and knew I had to fix things up or do a second pass or more. Other providers also move you to a worse model after you run out which is key in setting expectation as well. Developers knew that that was the trade-off.
I think even with the worse limits people still hated it but when you start to either on purpose or inadvertently make the model dumber that's when there's really no purpose to keep using Claude anymore.
If someone wants to move off Claude what are the alternatives? More importantly can another system pick up from where Claude left off or is there some internal knowledge Claude keeps in their configuration that I need to extract before canceling?
Opencode is a great cli for driving a coding agents.
Like 3 weeks ago Qwen3-coder was the best coding LLM to run locally. I haven’t spent time since to figure out if anything is better.
You can also power Opencode with OpenRouter which lets you pay for any LLM à la carte.
I am trying Qwen3.5-9B-Claude-4.6 since a couple of days now locally coming from OMLX. Either via Hermes or Continue in VS Code. It's oka'ish, even performance-wise.
[1] https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-R...
Wait, weren't there posts in the not too distant past where everyone was signing the praises for Claude and wondering how OpenAI will catch up?
Yep. I think the sentiment here isn't lagging too much in terms of the day to day experience of what is being offered. Kind of makes HN very useful in this regard.
Wait, are SaaS's fundamentally shifting business models searching to maximize the value of a product at the expense of a customer over time?
Strange how things can change!
We've seen this sentiment shift on HN like 20 times in the past year, too often for it to be a real reflection of service quality. Feels more like people rooting for sports teams.
The services (OpenAI, Anthropic) are not wildly changing that much. People are just using LLMs more and getting frustrated because they were told it would change the world, and then they take it out on their current patron. Give it a month and we'll be hearing how far OpenAI has fallen behind.
Oh no, the unreliable product people pretend is the next coming of Jesus turned out to be thoroughly unreliable. Who coulda thunk it.
I have to say, this has been the opposite of my experience. If anything, I have moved over more work from ChatGPT to Claude.
Same. I am getting crazy good value from Claude at work, on both scientific applications and deployment environments.
There is one caveat, and that is you have to give the model well thought out constraints to guide it properly, and absolutely take the time to read all the thinking it's doing and not be afraid to stop the process whenever things go sideway.
People who just let Claude roam free on their repository deserve everything they end up with.
Sometimes it feels like Anthropic uses token processing as a throttling tool, to their advantage.
Switched to local models after quality dropped off a cliff and token consumption seemed to double. Having some success with Qwen+Crush and have been more productive.
Would love some more info on how you got any local model working with Crush. Love charmbracelet but the docs are all over the place on linking into arbitrary APIs.
assuming you have a locally running llama-server or llama-swap, just drop this into your crush.json with your setup details/local addresses etc:
Edit: i forgot HN doesn't do code fences. See https://pastebin.com/2rQg0r2L
Obviously the context window settings are going to depend on what you've got set on the llama-server/llama-swap side. Multiple models on the same server like I have in the config snippet above is mostly only relevant if you're using llama-swap.
TL;DR is you need to set up a provider for your local LLM server, then set at least one model on that server, then set the large and small models that crush actually uses to respond to prompts to use that provider/model combo. Pretty straightforward but agree that their docs could be better for local LLM setups in particular.
For me, I've got llama-swap running and set up on my tailnet as a [tailscale service](https://tailscale.com/docs/features/tailscale-services) so I'm able to use my local LLMs anywhere I would use a cloud-hosted one, and I just set the provider baseurl in crush.json to my tailscale service URL and it works great.
i ran prompts used up a ton of usage, and got no return just showed error.
Asked support hey i got nothing back i tried prompting several times used a ton of usage and it gave no response. I'd just like usage back. What I payed for I never got.
Just bot response we don't do refunds no exceptions. Even in the case they don't serve you what your plan should give you.
If all Claude does is automate mundane code, why not just make a "meta library" of said common mundane code snippets?
maybe make it so that when you start typing it completes the snippet?
Like Stack Overflow?
you still have to search stack overflow and sift, but I'm surprised someone doesn't just make a TLDR style or shortcuts-expanding product out of it that you can just pop in code for this use case or that. vs. the current product spending a few datacenter's worth of energy to give you the answer that's only correct x% of the time
Yeah, session limits are kinda show stoppers.
Seems like some of the token issues may be corrected now
https://www.anthropic.com/engineering/april-23-postmortem
These changes fixed some of the token issues, but the token bloat is an intrinsic problem to the model, and Anthropic's solution of defaulting to xhigh reasoning for Opus 4.7 just means you'll go through tokens faster anyways.
I'm worried anything less than xhigh is insufficient though. What do you do?
The problem is they changed people's default settings, and if you're like me, you keep a Claude Code session open for days, maybe weeks and even a month, and just come back to it and keep going. I wouldn't be surprised if there's hundreds if not thousands of people still on these broken configurations / models.
Dear Anthropic:
Please, for the love of all things holy, NEVER change someone's defaults without INFORMING the end user first, because you will wind up with people confused, upset, and leaving your service.
I used Opus via Copilot until December and then largely switched over to Claude Code. I'm not sure what the difference is but I haven't seen any of these issues in daily use.
The usage metering is just so incredibly inconsistent, sometimes 4 parallel Opus sessions for 3 hours straight on max effort only uses up 70% of a session, other times 20 mins / 3 prompts in one session completely maxes it out. (Max x20 plan) Is this just a bug on anthropic side or is the usage metering just completely opaque and arbitrary?
It's something strange because i never have these issues. I often run two in parallel (though not all day), and generally have something running anytime i look at my laptop to advance the steps/tasks/etc. Usually i struggle to hit 50% on my Max20.
Heck two weeks ago i tried my hardest to hit my limit just to make use of my subscription (i sometimes feel like i'm wasting it), and i still only managed to get to 80% for the week.
I generally prune my context frequently though, each new plan is a prune for example, because i don't trust large context windows and degradation. My CLAUDE.md's are also somewhat trim for this same fear and i don't use any plugins, and only a couple MCPs (LSP).
No idea why everyone seems to be having such wildly different experiences on token usage.
Same, it's a mess.
I use Claude Code with GLM, Kimi and MiniMax models. :)
I was worried about Anthropic models quality varying and about Anthropic jacking up prices.
I don't think Claude Code is the best agent orchestrator and harness in existence but it's most widely supported by plugins and skills.
Where are you getting inference from? I'm overwhelmed by the options at the moment.
I am also curious. Considering the kimi coding plan but I'm worried about data privacy and security.
Me too.
I don't get it. I use Claude Code every day, what I would consider pretty heavy usage...at least as heavy as I can use it while actually paying attention to what it's producing and guiding it effectively into producing good software. I literally never run into usage limits on the $100 plan, even when the bugs related to caching, etc. were happening that led to inflated token usage.
WTF are y'all doing that chews tokens so fast? I mean, sure, I could spin up Gas Town and Beads and produce infinite busy work for the agents, but that won't make useful software, because the models don't want anything. They don't know what to build without pretty constant guidance. Left to their own devices, they do busy work. The folks who "set and forget" on AI development are producing a whole lot of code to do nothing that needed doing. And, a lot of those folks are proud of their useless million lines of code.
I'm not trying to burn as many tokens as a possible, I'm trying to build good software. If you're paying attention to what you're building, there's so many points where a human is in the loop that it's unusual to run up against token limits.
Anyway, I assume that at some point they have to make enough money to pay the bills. Everything has been subsidized by investors for quite some time, and while the cost per token is going down with efficiency gains in the models/harnesses and with newer compute hardware tuned for these workloads, I think we're all still enjoying subsidized compute at the moment. I don't think Anthropic is making much profit on their plans, especially with folks who somehow run right at the edge of their token limit 24/7. And, I would guess OpenAI is running an even lossier balance sheet (they've raised more money and their prices are lower).
I dunno. I hear a lot of complaining about Claude, but it's been pretty much fine for me throughout 4.5, 4.6 and 4.7. It got Good Enough at 4.5, and it's never been less than Good Enough since. And, when I've tried alternatives, they usually proved to be not quite Good Enough for some reason, sometimes non-technical reasons (I won't use OpenAI, anymore, because I don't trust OpenAI, and Gemini is just not as good at coding as Claude).
From the comments here it seems like people are stress prompting their builds instead of planning and reviewing.
If one model seems to be a bit off during a session I just switch to another (Opencode) and plan and review from there.
Same here. I've asked this question before. Haven't received an answer yet.
I'm torn because I use it in my spare time, so I've missed some of these issues, I don't use it 9 to 5, but I've built some amazing things, when 1 Million tokens dropped, that was peak Claude Code for me, it was also when I suspect their issues started. I've built up some things I've been drafting in my head for ages but never had time for, and I can review the code and refine it until it looks good.
I'm debating trying out Codex, from some people I hear its "uncapped" from others I hear they reached limits in short spans of time.
There's also the really obnoxious "trust me bro" documentation update from OpenClaw where they claim Anthropic is allowing OpenClaw usage again, but no official statement?
Dear Anthropic:
I would love to build a custom harness that just uses my Claude Code subscription, I promise I wont leave it running 24/7, 365, can you please tell me how I can do this? I don't want to see some obscure tweet, make official blog posts or documentation pages to reflect policies.
Can I get whitelisted for "sane use" of my Claude Code subscription? I would love this. I am not dropping $2400 in credits for something I do for fun in my free time.
It sounds like we have very similar usage/projects. codex had been essentially uncapped (via combination of different x-factors between Plus and Pro and promotions) until very recently when they copied Anthropic's notes.
Plus is still very usable for me though. I have not tried Claude Pro in quite a while and if people are complaining about usage limits I know it's going to be a bad time for me. I had to move up from Claude Pro when the weekly limits were introduced because it was too annoying to schedule my life around 5hr windows.
I started using codex around December when I started to worry I was becoming too dependent on Claude and need to encourage competition. codex wasn't particularly competitive with Claude until 5.4 but has grown on me.
The only thing I really care about is that whatever I'm using "just works" and doesn't hurt limits and Claude code has been flaky as all hell on multiple fronts ever since everyone discovered it during the Pentagon flap. So I tend to reach for ChatGPT and codex at the moment because it will "just work" and there's a good chance Claude will not.
Don't forget, Openclaw was basically bought by OpenAI so there's only incentive to use it as a wedge to pry people off Anthropic.
Claude Code now has an official telegram plugin and cron jobs and can do 80% of the things people used OpenClaw for if you just give it access to tools and run it with --dangerously-skip-permissions.
The /loop command which is supposed to be the equivilant to heartbeat.md is EXTREMELY unreliable/shitty.
I just cancelled my Max20 plan yesterday.
Waiting 60s every time I send a msg really kills the ux of claude
I just noticed today that it doesn't warn about approaching limits and just blows straight into billing extra tokens.
I'm pretty sure it used to warn when you got close to your 5hr limit, but no, it happily billed extra usage. Granted only about $10 today, but over the span of like 45 minutes. Not super pleased.
Codex is becoming such a good product. I have the 100$ pro lite. I have Claude still but 20$. I rarely use it. Let’s see if they give generous limits and more importantly a model that’s better than 5.5. The mythos fear mongering did not give me a good impression that they care about the average developer.
Maybe this is an unpopular opinion, but I think choosing which companies to support during this period of pre-alignment is one way to vote which direction this all goes. I'm happy to accept a slightly worse coding agent if it means I don't get exterminated someday.
This sounds just like all my neighbors complaining about their internet provider.
Imagine vibe coding your core consumer application and associated backend…
Oh wait, I don’t have to imagine. That’s what Anthropic does. A nice preview for what is in store for those who chose to turn off their brains and turn on their AI agents.
It also seems to me they route prompts to cheaper dumber models that present themselves as e.g. Opus 4.7. Perhaps that's what is "adaptive reasoning" aka we'll route your request to something like Qwen saying it's Opus. Sometimes I get a good model, so I found I'll ask a difficult question first and if answer is dumb, I terminate the session and start again and only then go with the real prompt. But there is no guarantee model will be downgraded mid session. I wish they just charged real price and stopped these shenanigans. It wastes so much time.
You're describing a Taravangian prompt situation (a character in a book series who wakes up with a different/random intelligence level each day and has a series of tests for himself to determine which kind of decisions he's capable of that day). https://coppermind.net/wiki/Taravangian
Cool
Welcome to the future. Anthropic is currently speed running it but this is what all LLM tools are going to look like in the next few years, once they turn the enshitification corner.
I've spent thousands of dollars on API tokens in the last few months. Out of my own pocket, as an indie contractor. I used the API specifically instead of Pro/Max/Plus/Silver/Gold/Platinum/Diamond to avoid all of the mess there regarding usage resets and potential hidden routing to worse models. It worked great for months, I got a ton of shit done, shipped a bunch of features. I really began to rely on the tech. I was not happy about the cost, but the value proposition was there.
Then within the last few months everything changed and went to shit. My trust was lost. Behavior became completely inconsistent.
During the height of Claude's mental retardation (now finally acknowledged by the creators) I had an incident where CC ran a query against an unpartitioned/massive BQ table that resulted in $5,000 in extra spend because it scanned a table which should have been daily partitioned 30 times. 27 TB per scan. I recall going over and over the setup and exhaustively refining confidence. After I realized this blunder, I referred to it in the same CC session, "jesus fucking christ, I flagged this issue earlier" -- it responded, "you did. you called out the string types and full table scans and I said "let's do it later." That was wrong. I should have prioritized it when you raised it". Now obviously this is MY fault. I fucked up here, because I am the operator, and the buck stops with me. But this incident really galvinized that the Claude I had come to vibe with so well over the last N months was entirely gone.
We all knew it was making making mistakes, becoming fully retarded. We all felt and flagged this. When Anthropic came out and said, "yeah ... you guys are using it wrong, its a skill issue" I knew this honeymoon was over. Then recently when they finally came out and ack'd more of the issues (while somehow still glossing over how bad they fucked up?) it was the final nail. I'm done spending $ on Anthropic ecosystem. I signed up for OpenAI pro $200/mo and will continue working on my own local inference in the meantime.
The thing that irks me the most is that I was paying full price. Literally spreading my cheeks wide open and saying, "here ya go Anthropic, all yours." and for actually close to a year it was great. Looking back I see I began using it in March of 2025. Thats roughly a year of integrating it into my day to day, pairing with it constantly, and as I said shipping really well engineered meaningful work.
They could have just kept doing this - literally printing money. Literally: do absolutely nothing, go on vacation, profit $$$. So why did so much change? I think that the issue is they were trying to optimize CC for the monthly plan folks, the ones who are likely losing the company money, but API users became collateral damage.
Same here. The single prompt burnt all my tokens in 3 minutes for the day. What happened to Claude in the last 2 months? I was happy with what they were providing and was happy to pay whatever for it. Why did they mess with it? Why are they destroying the tool we all loved?
I hate enshittification and I hate seeing this happening to Claude Code right now.
The great de-skilling programme continues in Anthropic's casino. They completely want you dependent on gambling tokens on their slot machines with extortionate prices, fees and limits.
Anthropic can't even scale their own infrastructure operations, because it does not exist and they do not have the compute; even when they are losing tens of billions and can nerf models when they feel like it.
Once again, local models are the answer and Anthropic continues to get you addicted to their casino instead of running your own cheaper slot machine, which you save your money.
Every time you go to Anthropic's casino, the house always wins.
I would love to just say that if you are using claude code, you should no be on pro. I feel like all the people complaining are complaining that an agent cant handle the work of a developer for $20/m. Get on at least max 5, its a world of a difference.
It's impossible to justify the jump in the expense unless you are directly working on something that makes you money. Messing around on a hobby project, doing some quick research, and getting personalized notifications was a no brainer for 200 a year.
The product keeps getting worse so I will definitely evaluate options and possibly switch if management keeps screwing up the product.
For some reason you are being downvoted but I wanted to echo your sentiment. As someone who tries to switch things up for every next task, the productivity of Claude Max is worth every penny.
And I actually read the output to fix what I don't like and ever since Opus 4.5, I've had to less and less. 4.6 had issues at the beginning but that's because you have to manually make sure you change the effort level.
I found the perfect sweet spot for my hobby development. I pay 7 euros for Gemini plus and use it for creating the architecture and technical specs. Those are fed to Sonnet on Pro that just implements the instructions. This gives plenty of space to do long sessions several times a week.
True that.
Max 5, sonnet for 95% of things. I never run out of tokens in a week and I use it for ~5-6 hours a day.
I dare to call meself a senior dev, so I don't need a replacement, I need a tool.