Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.
I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin
I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.
In the same way, it’s hard to see how people who say they’re struggling are actually using it.
There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.
> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.
I wonder about this. I see two obvious possibilities (if we ignore bias):
1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.
At some point the is a need to have faith in some stable enough ground to be able to walk onto.
All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.
The way to develop in this space seems to be to give away free stuff, get your name out there, then make everything proprietary. I hope they still continue releasing open weights. The day no one releases open weights is a sad day for humanity. Normal people won’t own their own compute if that ever happens.
I think that's an overgeneralization. We've seen all the American models be closed and proprietary from the start. Meanwhile the non-American (especially the Chinese ones) have been open since the start. In fact they often go the opposite direction. Many Chinese models started off proprietary and then were later opened up (like many of the larger Qwen models)
> We've seen all the American models be closed and proprietary from the start.
Most*.
OpenAI, contrary to popular belief, actually used to believe in open research and (more or less) open models. GPT1 and GPT2 both were model+code releases (although GPT2 was a "staged" release), GPT3 ended up API-only.
Also the Chinese models aren't following a typical American SaaS playbook which relies on free/cheap proprietary software for early growth. They are not just publishing their weights but also their code and often even publishing papers in Open Access journals to explicitly highlight what methods and advancements were made to accomplish their results
I think they're in a win-win situation. Big AI companies would love to see local computing die in favour of the cloud because they are well aware the moment an open model that can run on non ludicrous consumer hardware appears, they're screwed. In this situation Nvidia, AMD and the like would be the only ones profiting from it - even though I'm not convinced they'd prefer going back to fighting for B2C while B2B Is so much simpler for them
If you want to run AI models at scale and with reasonably quick response, there's not many alternatives to datacenter hardware. Consumer hardware is great for repurposing existing "free" compute (including gaming PCs, pro workstations etc. at the higher end) and for basic insurance against rug pulls from the big AI vendors, but increased scale will probably still bring very real benefits.
Currently, yes. But I don't find it hard to imagine that in a while we could get reasonably light open models with a level of reasoning similar to current opus, for instance. In such a scenario how many people would opt to pay for a way more expensive cloud subscription? Especially since lots of people are already not that interested in paying for frontier models nowadays where it makes sense. Unless keep on getting a constant, never ending stream of improvements we're basically bound to get to a point where unless you really need it you are ok with the basic, cheaper local alternative you don't have to pay for monthly.
This is really just a question of product design meeting the technology.
Today, lots of integer compute happens on local devices for some purposes, and in the cloud for others.
Same is already true for matmul, lots of FLOPS being spent locally on photo and video processing, speech to text, …
No obvious reason you wouldn’t want to specialize LLM tasks similarly, especially as long-running agents increasingly take over from chatbots as the dominant interaction architecture.
I think average users are already okay with the reasoning level they'd get with current open models. But the big AI firms have pivoted their frontier models towards the enterprise: coding and research, as opposed to general chat. And scale is quite important for these uses, ordinary pro hardware is not enough.
At a consistent amount of usage, datacenters are at least an order of magnitude more hardware efficient. I'm sure Nvidia and AMD would be fine fighting for B2C if it meant volume would be 10+x.
Now, given they can't satisfy current volume, they are forced to settle for just having crazy margins.
The problem with B2C is that you need to have leverage of some kind (more demanding applications, planned obsolescence, ...) in order to get people to keep on buying your product. The average consumer may simply consider themselves satisfied with their old product they already own and only replace it when it breaks down. On the contrary, with the cloud you can keep people hooked on getting the latest product whether they need it or not, and get artificial demand from datacentres and such.
This is obviously a strategic move at a national level. Keep publishing competing free models to erode the moat western companies could have with their proprietary models. As long as the narrative serves China there will be no turn to proprietary models.
Always has been, it’s literally saas; the slight difference is that the lowest tier subscriptions at the frontier labs are basically free trials nowadays, too
I'm a little more optimistic than that. I suspect that the open-weight models we already have are going to be enough to support incremental development of new ones, using reasonably-accessible levels of compute.
The idea that every new foundation model needs to be pretrained from scratch, using warehouses of GPUs to crunch the same 50 terabytes of data from the same original dumps of Common Crawl and various Russian pirate sites, is hard to justify on an intuitive basis. I think the hard work has already been done. We just don't know how to leverage it properly yet.
I do not think it's common crawl anymore, its common crawl++ using paid human experts to generate and verify new content, weather its code or research.
I believe US is building this off the cost difference from other countries using companies like scale, outlier etc, while china has the internal population to do this
The Chinese state wants the world using their models.
People think that Chinese AI labs are just super cool bros that love sharing for free.
The don't understand it's just a state sponsored venture meant to further entrench China in global supply and logistics. China's VCs are Chinese banks and a sprinkle of "private" money. Private in quotes because technically it still belongs to the state anyway.
China doesn't have companies and government like the US. It just has government, and a thin veil of "company" that readily fool westerners.
Also many of these Chinese companies aren't just opening their weights. They are open sourcing their code AND publishing detailed research papers alongside them to reveal how they accomplished what they accomplished.
That's very different from an American SaaS model which relies of free but proprietary software for early growth
I'm not sure how local AI models are meant to "entrench China in global supply and logistics". The two areas have nothing to do with one another. You can easily run a Chinese open model on all-American hardware.
They are building a pipeline, and the goal is to get people in the door.
If you forever stand at the entrance eating the free samples, that's fine, they don't care. Other people are going through the door and you are still consuming what they feed you. Doesn't mean it's going to be bad or evil, but they are staking their territory of control.
Oh for sure, they're getting a whole lot of Chinese people and other non-Westerners through the door already - mostly, the people who are being ignored or even blocked outright by the big Western labs. That's territory we purposely abandoned, and they're going to control it by default.
Like with nuclear technology, it's not healthy for only one country to dominate AI. The cat is already out of the bag and many countries now have the ability to train and run models. Silicon Valley has bootstrapped this space. But it should be noted that they are using AI talent from all over the world and it was sort of inevitable that this technology would get around. Lots of Chinese, Indian, Russian, and Europeans are involved.
As for what comes next, it's probably going to be a bit of a race for who can do the most useful and valuable things the cheapest. If OpenAI and Anthropic don't make it, the technology will survive them. If they do, they'll be competing on quality and cost.
As for state sponsorship, a lot of things are state sponsored. Including in the US. Silicon Valley has a rich history that is rooted in massive government funding programs. There's a great documentary out there the secret history of Silicon Valley on this. Not to mention all the "cheap" gas that is currently powering data centers of course comes on the back of a long history of public funding being channeled into the oil and gas industry.
>As for state sponsorship, a lot of things are state sponsored.
You can make any comparison you want if you use adjectives rather than values. I can say that cars use a massive amount of water (all those radiators!) to try and downplay agricultural water usage. But its blatantly disingenuous.
SV is overwhelmingly private (actual constitutional private) money. To the point that you should disregard people saying otherwise, just like you would the people saying cars use massive amounts of water.
So an OPEN model that I can run on my own fucking hardware will entrench China in global supply and logistics how?
Contrary: How will the closed, proprietary models from Anthropic, "Open"AI and Co. lead us all to freedom? Freedom of what exactly? Freedom of my money?
At some point this "anti-communism" bullshit propaganda has to stop. And that moment was decades ago!
Their Plus series have existed since Qwen chat was available , as far as I remember. I can at least remember trying out their Plus model early last year.
With them comparing to Opus 4.5, I find it hard to take some of these in good faith. Opus 4.7 is new, so I don't expect that, but Opus 4.6 has been out for quite some time.
The thing is, Opus 4.5 is where the model reached Good Enough, at least for a wide variety of problems I use LLMs for. Before that, I almost never thought it was a more productive use of my time to use AI for development tasks, because it would always hallucinate something that would waste a bunch of my time. It just wasn't a good trade.
But, if for some reason everything stopped at Opus 4.5 level and we never got a better model (and 4.6/4.7 are better, if only marginally so and mostly expanding the kind of work it can do rather than making it better at making web apps), we could still do a lot of real work real fast with Opus 4.5, and software development would never go back to everyone handwriting most of the code.
A model as good as Opus 4.5 (or slightly better according to the mostly easily gamed benchmarks) at a 10th the price is probably a worthwhile proposition for a lot of people. $100 a month, or more, to get Opus 4.7 is well worth it for a western developer...the time the lower-end models waste is far more expensive than the cost of using the most expensive models. For the foreseeable future, I'll keep paying a premium for the models that waste less of my time and produce better results with less prodding.
But, also, it's wild how fast things move. Open models you can run on relatively modest hardware are competitive with frontier models of two years ago. I mean, you can run Qwen 3.6 MoE 35B A3B or the larger Gemma 4 models on normal hardware, like a beefy Macbook or a Strix Halo or any recentish 24GB/32GB GPU...not much more expensive than the average developer laptop of pre-AI times. And, it can write code. It can write decent prose (Qwen is maybe better at code, Gemma definitely has better prose), they can use tools, they have a big enough context window for real work. They aren't as good as Opus 4.5, yet.
Anyway, I use several models at this point, for security and code reviews, even if Claude Code with Opus is still obviously the best option for most software development tasks. I'll give Qwen a try, too. I like their small models, which punch well above their weight, I'll probably like the big one, too.
If money is no object, then nothing else is worth considering if it isn't Codex 5.4/Opus 4.7/SOTA. But for many to most people, value Vs. relative quality are huge levers.
Even many people on a Claude subscription aren't choosing or able to choose Opus 4.7 because of those cost/usage pressures. Often using Sonnet or an older opus, because of the value Vs. quality curve.
anecdotally the quality of output isn't significantly different, the speed seems to be what you're really paying for, and since the alternative is free I'll stick to local.
Cost may or may not be a factor in my choice of model, but knowing the capabilities and knowing they will remain consistent, reliable, and available over time is always a dominant consideration. Lately, Anthropic in particular has not been great at that.
Unfortunately, like with the release of Qwen3.6-Plus, this model also isn’t released for local use. From the linked article: “Qwen3.6-Max-Preview is the hosted proprietary model available via Alibaba Cloud Model Studio”
When Sonnet 4.6 was released, I switchmed my default from Opus to Sonnet because it was about en par with Opus 4.5. While 4.6 and 4.7 are "better", the leap is too small for most tasks for me to need it, and so reducing cost is now a valid reason to stay at that level.
If even cheaper models start reaching that level (GLM 5.1 is also close enough that I'm using it at lot), that's a big deal, and a totally valid reason to compare against Opus 4.5
Nowadays, I'm working on a realtime path tracer where you need proper understanding of microfacet reflection models, PDFs, (multiple) importance sampling, ReSTIR, etc.. Saying that mine is a somewhat specific use case.
And I use Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me more than a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I’ve almost stopped using Claude and Gemini to not to waste my time anymore.
Claude code may shine developing web applications, backends and simple games but it's definitely not for me. And this is the story of my specific use case.
I have said similar things about someone experiencing similar things while writing some OpenGL code (some raytracing etc) that these models have very little understanding and aren't good at anything beyond basic CRUD web apps.
In my own experience, even with web app of medium scale (think Odoo kind of ERP), they are next to useless in understanding and modling domain correctly with very detailed written specs fed in (whole directory with index.md and sub sections and more detailed sections/chapters in separate markdown files with pointers in index.md) and I am not talking open weight models here - I am talking SOTA Claude Opus 4.6 and Gemini 3.1 Pro etc.
But that narrative isn't popular. I see the parallels here with the Crypto and NFT era. That was surely the future and at least my firm pays me in cypto whereas NFTs are used for rewarding bonusess.
a semester ago i was taking a machine learning exam in uni and the exam tasked us with creating a neural network using only numerical libraries (no pytorch ecc). I'm sure that there are a huge lot of examples looking all the same, but given that we were just students without a lot of prior experience we probably deviated from what it had in its training data, with more naive or weird solutions. Asking gemini 3 to refactor things or in very narrow things to help was ok, but it was quite bad at getting the general context, and spotting bugs, so much that a few times it was easier to grab the book and get the original formula right
otoh, we spotted a wrong formula regarding learning rate on wikipedia and it is now correct :) without gemini and just our intuition of "mhh this formula doesn't seem right", that definitely inflated our ego
What size of Qwen is that, though? The largest sizes are admittedly difficult to run locally (though this is an issue of current capability wrt. inference engines, not just raw hardware).
How "social" does Quen feel? The way I am using LLMs for coding makes this actually the most important aspect by now. Claude 4.6 felt like a nice knowledgeable coworker who shared his thinking while solving problems. Claude 4.7 is the difficult anti-social guy who jumps ahead instead of actually answering your questions and does not like to talk to people in general. How are Qwen's social skills?
This is not my experience at all, Qwen3.6-Plus spits out multiple paragraphs of text for the prompts I give. It wasn't like this before. Now I have to explicitly tell it not to yap so much and keep it short, concise and direct.
Everybody's out here chasing SOTA, meanwhile I'm getting all my coding done with MiniMax M2.5 in multiple parallel sessions for $10/month and never running into limits.
For serious work, the difference between spending $10/month and $100/month is not even worth considering for most professional developers. There are exceptions like students and people in very low income countries, but I’m always confused by developers with in careers where six figure salaries are normal who are going cheap on tools.
I find even the SOTA models to be far away from trustworthy for anything beyond throwaway tasks. Supervising a less-than-SOTA model to save $10 to $100 per month is not attractive to me in the least.
I have been experimenting with self hosted models for smaller throwaway tasks a lot. It’s fun, but I’m not going to waste my time with it for the real work.
You need to supervise the model anyway, because you want that code to be long-term maintainable and defect free, and AI is nowhere near strong enough to guarantee that anytime soon. Using the latest Opus for literally everything is just a huge waste of effort.
For actually serious work, it's a stark difference if your proprietary and security relevant code is sent abroad to a foreign, possibly future hostile country, or is sent to some data center around the corner. It doesn't even need to be defence related.
AFAIK all these companies have SOTA or near-SOTA models available under enterprise licenses. AI companies are not interested in your secret sauce, they are trying to capture the SDLC wholesale.
If an American company, let's say a company that writes software for power stations, would use the services of a French or Chinese AI company under such enterprise licenses, how long would you think it would take until someone in Congress, e.g., would interfere?
I find it odd that none of OpenAI models was used in comparison, but used Z GLM 5.1. Is Z (GLM 5.1) really that good? It is crushing Opus 4.5 in these benchmarks, if that is true, I would have expected to read many articles on HN on how people flocked CC and Codex to use it.
GLM 5.1 is pretty good, probably the best non-US agentic coding model currently available. But both GLM 5.0 and 5.1 have had issues with availability and performance that makes them frustrating to use. Recently GLM 5.1 was also outputting garbage thinking traces for me, but that appears to be fixed now.
In fact it is appreciated that Qwen is comparing to a peer. I myself and several eng I know are trying GLM. It's legit. Definitely not the same as Codex or Opus, but cheaper and "good enough". I basically ask GLM to solve a program, walk away 10-15 minutes, and the problem is solved.
cheaper is quite subjective, I just went to their pricing page [0] and cost saving compared to performance does not sell it well (again, personal opinion).
CC has a limited capacity for Opus, but fairly good for Sonnet. For Codex, never had issues about hitting my limits and I'm only a pro user.
Yes. GLM 5.1 is that good. I don't think it is as good as Claude was in January or February of this year, but it is similar to how Claude runs now, perhaps better because I feel like it's performance is more consistent.
GLM-5 is good, like really good. Especially if you take pricing into consideration. I paid 7$ for 3 months. And I get more usage than CC.
They have difficulty supplying their users with capacity, but in an email they pointed out that they are aware of it. During peak hours, I experience degraded performance. But I am on their lowest tier subscription, so I understand if my demand is not prioritized during those hours.
I'm using GLM 5.1 for the last two weeks as a cheaper alternative to Sonnet, and it's great - probably somewhere between Sonnet and Opus. It's pretty slow though.
GLM 5.1 is the first model I've found good enough to spring for a subscription for other than Claude and Codex.
It's not crushing Opus 4.5 in real-life use for me, but it's close enough to be near interchangeable with Sonnet for me for a lot of tasks, though some of the "savings" are eaten up by seemingly using more tokens for similar complexity tasks (I don't have enough data yet, but I've pushed ~500m tokens through it so far.
I've been using it through OpenCode Go and it does seem decent in my limited experience. I haven't done anything which I could directly compare to Opus yet though.
I did give it one task which was more complex and I was quite impressed by. I had a local setup with Tiltdev, K3S and a pnpm monorepo which was failing to run the web application dev server; GLM correctly figured out that it was a container image build cache issue after inspecting the containers etc and corrected the Tiltfile and build setup.
Most HN commenters seem to be a step behind the latest developments, and sometimes miss them entirely (Kimi K2.5 is one example). Not surprising as most people don't want to put in the effort to sift through the bullshit on Twitter to figure out the latest opinions. Many people here will still prefer the output of Opus 4.5/4.6/4.7, nowadays this mostly comes down to the aesthetic choices Anthropic has made.
Not just aesthetics though, from time to time I implement the same feature with CC and Codex just to compare results, and I yet to find Codex making better decisions or even the completeness of the feature.
For more complicated stuff, like queries or data comparison, Codex seems always behind for me.
its an SKU from OpenAI's perspective, broader goal and vision is (was) different. Look at the Claude and GLM, both were 95% committed to dev tooling: best coding models, coding harness, even their cowork is built on top of claude code
I'm not sure how this makes sense when Claude models aren't even coding specific: Haiku, Sonnet, Opus are the exact same models you'd use for chat or (with the recent Mythos) bleeding edge research.
But they detect it under the hood and apply a similar "variant", as API results are not the same than on Claude Code (that was documented before by someone).
Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.
I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin
Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.
As so many things these days: It's a cult.
I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6
But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"
For me it's just a tool, so I shrug.
I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.
In the same way, it’s hard to see how people who say they’re struggling are actually using it.
There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.
> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.
I wonder about this. I see two obvious possibilities (if we ignore bias):
1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.
At some point the is a need to have faith in some stable enough ground to be able to walk onto.
All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.
Allow me to introduce you to Buddhism
Dogmatism is a spectrum and for too many people it's on the animal side of the scale.
The way to develop in this space seems to be to give away free stuff, get your name out there, then make everything proprietary. I hope they still continue releasing open weights. The day no one releases open weights is a sad day for humanity. Normal people won’t own their own compute if that ever happens.
I think that's an overgeneralization. We've seen all the American models be closed and proprietary from the start. Meanwhile the non-American (especially the Chinese ones) have been open since the start. In fact they often go the opposite direction. Many Chinese models started off proprietary and then were later opened up (like many of the larger Qwen models)
> We've seen all the American models be closed and proprietary from the start
What about Gemma and Llama and gpt-oss, not to mention lots of smaller/specialized models from Nvidia and others?
I would never argue that China isn't ahead in the open weights game, of course, but it's not like it's "all" American models by any stretch.
gpt-oss is good but I haven't heard anything about an update. It seems like one and done, to shut up people complaining about non-Open AI
> We've seen all the American models be closed and proprietary from the start.
Most*.
OpenAI, contrary to popular belief, actually used to believe in open research and (more or less) open models. GPT1 and GPT2 both were model+code releases (although GPT2 was a "staged" release), GPT3 ended up API-only.
OpenAI has released their GPT-OSS series more recently.
That's fair but those days seem so long gone now.
Also the Chinese models aren't following a typical American SaaS playbook which relies on free/cheap proprietary software for early growth. They are not just publishing their weights but also their code and often even publishing papers in Open Access journals to explicitly highlight what methods and advancements were made to accomplish their results
The Nvidia Nemotron models are recent, and of course the Gemma 4 series from Google.
I think it is in the interest of chip makers to make sure we all get local models
I think they're in a win-win situation. Big AI companies would love to see local computing die in favour of the cloud because they are well aware the moment an open model that can run on non ludicrous consumer hardware appears, they're screwed. In this situation Nvidia, AMD and the like would be the only ones profiting from it - even though I'm not convinced they'd prefer going back to fighting for B2C while B2B Is so much simpler for them
If you want to run AI models at scale and with reasonably quick response, there's not many alternatives to datacenter hardware. Consumer hardware is great for repurposing existing "free" compute (including gaming PCs, pro workstations etc. at the higher end) and for basic insurance against rug pulls from the big AI vendors, but increased scale will probably still bring very real benefits.
Currently, yes. But I don't find it hard to imagine that in a while we could get reasonably light open models with a level of reasoning similar to current opus, for instance. In such a scenario how many people would opt to pay for a way more expensive cloud subscription? Especially since lots of people are already not that interested in paying for frontier models nowadays where it makes sense. Unless keep on getting a constant, never ending stream of improvements we're basically bound to get to a point where unless you really need it you are ok with the basic, cheaper local alternative you don't have to pay for monthly.
This is really just a question of product design meeting the technology.
Today, lots of integer compute happens on local devices for some purposes, and in the cloud for others.
Same is already true for matmul, lots of FLOPS being spent locally on photo and video processing, speech to text, …
No obvious reason you wouldn’t want to specialize LLM tasks similarly, especially as long-running agents increasingly take over from chatbots as the dominant interaction architecture.
I think average users are already okay with the reasoning level they'd get with current open models. But the big AI firms have pivoted their frontier models towards the enterprise: coding and research, as opposed to general chat. And scale is quite important for these uses, ordinary pro hardware is not enough.
At a consistent amount of usage, datacenters are at least an order of magnitude more hardware efficient. I'm sure Nvidia and AMD would be fine fighting for B2C if it meant volume would be 10+x.
Now, given they can't satisfy current volume, they are forced to settle for just having crazy margins.
The problem with B2C is that you need to have leverage of some kind (more demanding applications, planned obsolescence, ...) in order to get people to keep on buying your product. The average consumer may simply consider themselves satisfied with their old product they already own and only replace it when it breaks down. On the contrary, with the cloud you can keep people hooked on getting the latest product whether they need it or not, and get artificial demand from datacentres and such.
Definitely. Many big hardware firms are directly supporting HuggingFace for this very reason.
True, chip companies have the opposite mindset, Nvidia is making their own open weights I believe
This is obviously a strategic move at a national level. Keep publishing competing free models to erode the moat western companies could have with their proprietary models. As long as the narrative serves China there will be no turn to proprietary models.
Always has been, it’s literally saas; the slight difference is that the lowest tier subscriptions at the frontier labs are basically free trials nowadays, too
Any reason for them to do this other than altruism? I don’t think this can be regulated.
Bake ads into them.
Its the new freeware model!
I'm a little more optimistic than that. I suspect that the open-weight models we already have are going to be enough to support incremental development of new ones, using reasonably-accessible levels of compute.
The idea that every new foundation model needs to be pretrained from scratch, using warehouses of GPUs to crunch the same 50 terabytes of data from the same original dumps of Common Crawl and various Russian pirate sites, is hard to justify on an intuitive basis. I think the hard work has already been done. We just don't know how to leverage it properly yet.
Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.
Hopefully we will find a way to make it so that making minor changes don't require a full retrain. Training how to train, as a concept, comes to mind.
I do not think it's common crawl anymore, its common crawl++ using paid human experts to generate and verify new content, weather its code or research.
I believe US is building this off the cost difference from other countries using companies like scale, outlier etc, while china has the internal population to do this
The Chinese state wants the world using their models.
People think that Chinese AI labs are just super cool bros that love sharing for free.
The don't understand it's just a state sponsored venture meant to further entrench China in global supply and logistics. China's VCs are Chinese banks and a sprinkle of "private" money. Private in quotes because technically it still belongs to the state anyway.
China doesn't have companies and government like the US. It just has government, and a thin veil of "company" that readily fool westerners.
As opposed to the US, which just has companies and a thin veil of “government”.
Also many of these Chinese companies aren't just opening their weights. They are open sourcing their code AND publishing detailed research papers alongside them to reveal how they accomplished what they accomplished.
That's very different from an American SaaS model which relies of free but proprietary software for early growth
I'm not sure how local AI models are meant to "entrench China in global supply and logistics". The two areas have nothing to do with one another. You can easily run a Chinese open model on all-American hardware.
They are building a pipeline, and the goal is to get people in the door.
If you forever stand at the entrance eating the free samples, that's fine, they don't care. Other people are going through the door and you are still consuming what they feed you. Doesn't mean it's going to be bad or evil, but they are staking their territory of control.
Oh for sure, they're getting a whole lot of Chinese people and other non-Westerners through the door already - mostly, the people who are being ignored or even blocked outright by the big Western labs. That's territory we purposely abandoned, and they're going to control it by default.
Like with nuclear technology, it's not healthy for only one country to dominate AI. The cat is already out of the bag and many countries now have the ability to train and run models. Silicon Valley has bootstrapped this space. But it should be noted that they are using AI talent from all over the world and it was sort of inevitable that this technology would get around. Lots of Chinese, Indian, Russian, and Europeans are involved.
As for what comes next, it's probably going to be a bit of a race for who can do the most useful and valuable things the cheapest. If OpenAI and Anthropic don't make it, the technology will survive them. If they do, they'll be competing on quality and cost.
As for state sponsorship, a lot of things are state sponsored. Including in the US. Silicon Valley has a rich history that is rooted in massive government funding programs. There's a great documentary out there the secret history of Silicon Valley on this. Not to mention all the "cheap" gas that is currently powering data centers of course comes on the back of a long history of public funding being channeled into the oil and gas industry.
>As for state sponsorship, a lot of things are state sponsored.
You can make any comparison you want if you use adjectives rather than values. I can say that cars use a massive amount of water (all those radiators!) to try and downplay agricultural water usage. But its blatantly disingenuous.
SV is overwhelmingly private (actual constitutional private) money. To the point that you should disregard people saying otherwise, just like you would the people saying cars use massive amounts of water.
So an OPEN model that I can run on my own fucking hardware will entrench China in global supply and logistics how?
Contrary: How will the closed, proprietary models from Anthropic, "Open"AI and Co. lead us all to freedom? Freedom of what exactly? Freedom of my money?
At some point this "anti-communism" bullshit propaganda has to stop. And that moment was decades ago!
Anything that isn't explicitly to the benefit of US interests must be against them /s
So what?
I still prefer that over US total dominance.
Let them fight it out.
I'd get a bit informed about what exactly Chinese dominance entails. Ask a few Uyghurs, Cantonese Hong Kongers, or even Tibetans.
Then decide ...
Well, isn't this what the US and really any other power in the world has always done, since forever?
Notice the pattern that Chinese providers are now:
1. Keeping models closed source.
2. Jacking up pricing. A lot. Sometimes up to 100% increase.
Huh yeah, that's truly a unique trait these Chinese companies don't share with companies in other countries.
US companies hate that trick?!
Well, they can't subsidize forever. And, it is kinda expected?
The fun thing is, you can be aware of the entire range of Qwen models that are available for local running, but not at all about their cloud models.
I knew of all the 3.5’s and the one 3.6, but only now heard about the Plus.
Their Plus series have existed since Qwen chat was available , as far as I remember. I can at least remember trying out their Plus model early last year.
With them comparing to Opus 4.5, I find it hard to take some of these in good faith. Opus 4.7 is new, so I don't expect that, but Opus 4.6 has been out for quite some time.
The thing is, Opus 4.5 is where the model reached Good Enough, at least for a wide variety of problems I use LLMs for. Before that, I almost never thought it was a more productive use of my time to use AI for development tasks, because it would always hallucinate something that would waste a bunch of my time. It just wasn't a good trade.
But, if for some reason everything stopped at Opus 4.5 level and we never got a better model (and 4.6/4.7 are better, if only marginally so and mostly expanding the kind of work it can do rather than making it better at making web apps), we could still do a lot of real work real fast with Opus 4.5, and software development would never go back to everyone handwriting most of the code.
A model as good as Opus 4.5 (or slightly better according to the mostly easily gamed benchmarks) at a 10th the price is probably a worthwhile proposition for a lot of people. $100 a month, or more, to get Opus 4.7 is well worth it for a western developer...the time the lower-end models waste is far more expensive than the cost of using the most expensive models. For the foreseeable future, I'll keep paying a premium for the models that waste less of my time and produce better results with less prodding.
But, also, it's wild how fast things move. Open models you can run on relatively modest hardware are competitive with frontier models of two years ago. I mean, you can run Qwen 3.6 MoE 35B A3B or the larger Gemma 4 models on normal hardware, like a beefy Macbook or a Strix Halo or any recentish 24GB/32GB GPU...not much more expensive than the average developer laptop of pre-AI times. And, it can write code. It can write decent prose (Qwen is maybe better at code, Gemma definitely has better prose), they can use tools, they have a big enough context window for real work. They aren't as good as Opus 4.5, yet.
Anyway, I use several models at this point, for security and code reviews, even if Claude Code with Opus is still obviously the best option for most software development tasks. I'll give Qwen a try, too. I like their small models, which punch well above their weight, I'll probably like the big one, too.
If money is no object, then nothing else is worth considering if it isn't Codex 5.4/Opus 4.7/SOTA. But for many to most people, value Vs. relative quality are huge levers.
Even many people on a Claude subscription aren't choosing or able to choose Opus 4.7 because of those cost/usage pressures. Often using Sonnet or an older opus, because of the value Vs. quality curve.
anecdotally the quality of output isn't significantly different, the speed seems to be what you're really paying for, and since the alternative is free I'll stick to local.
Codex 5.4 is not out?
Cost may or may not be a factor in my choice of model, but knowing the capabilities and knowing they will remain consistent, reliable, and available over time is always a dominant consideration. Lately, Anthropic in particular has not been great at that.
Also us weirdos with local model uses. But your point stands.
Unfortunately, like with the release of Qwen3.6-Plus, this model also isn’t released for local use. From the linked article: “Qwen3.6-Max-Preview is the hosted proprietary model available via Alibaba Cloud Model Studio”
The Max series was never available for local use, though. So this is expected.
Codex subscription is very generous at pro tiers
When Sonnet 4.6 was released, I switchmed my default from Opus to Sonnet because it was about en par with Opus 4.5. While 4.6 and 4.7 are "better", the leap is too small for most tasks for me to need it, and so reducing cost is now a valid reason to stay at that level.
If even cheaper models start reaching that level (GLM 5.1 is also close enough that I'm using it at lot), that's a big deal, and a totally valid reason to compare against Opus 4.5
Wow I couldn't disagree more.
For me, Opus 4.5 and 4.6 feel so different compared to sonnet.
Maybe I'm lazy or something but sonnet is much worse in my experience at inferring intent correctly if I've left any ambiguity.
That effect is super compounding.
Opus 4.6 performance has been so wildly inconsistent over the past couple of months, why waste the tokens?
You compare with what's most comparable.
In any case a benchmark provided by the provider is always biased, they will pick the frameworks where their model fares well. Omit the others.
Independent benchmarks are the go to.
Opus 4.6 was released in February. It can take quite some time to run all these benchmarks properly
Quite some time is a little over 2 months. I understand this is actually true right now, but it’s still a bit hard to accept.
Comparing it with Opus 4.6 is difficult, since Anthropic may ban accounts and accuse users of state-sponsored hacking.
I think its only been like 10 weeks. I meant that's forever in AI time, but not a long time in normie people time.
I think the benchmarks and numbers need to be easier to read. Those benchmarks are useless to the regular consumer.
Nowadays, I'm working on a realtime path tracer where you need proper understanding of microfacet reflection models, PDFs, (multiple) importance sampling, ReSTIR, etc.. Saying that mine is a somewhat specific use case.
And I use Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me more than a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I’ve almost stopped using Claude and Gemini to not to waste my time anymore.
Claude code may shine developing web applications, backends and simple games but it's definitely not for me. And this is the story of my specific use case.
I have said similar things about someone experiencing similar things while writing some OpenGL code (some raytracing etc) that these models have very little understanding and aren't good at anything beyond basic CRUD web apps.
In my own experience, even with web app of medium scale (think Odoo kind of ERP), they are next to useless in understanding and modling domain correctly with very detailed written specs fed in (whole directory with index.md and sub sections and more detailed sections/chapters in separate markdown files with pointers in index.md) and I am not talking open weight models here - I am talking SOTA Claude Opus 4.6 and Gemini 3.1 Pro etc.
But that narrative isn't popular. I see the parallels here with the Crypto and NFT era. That was surely the future and at least my firm pays me in cypto whereas NFTs are used for rewarding bonusess.
a semester ago i was taking a machine learning exam in uni and the exam tasked us with creating a neural network using only numerical libraries (no pytorch ecc). I'm sure that there are a huge lot of examples looking all the same, but given that we were just students without a lot of prior experience we probably deviated from what it had in its training data, with more naive or weird solutions. Asking gemini 3 to refactor things or in very narrow things to help was ok, but it was quite bad at getting the general context, and spotting bugs, so much that a few times it was easier to grab the book and get the original formula right
otoh, we spotted a wrong formula regarding learning rate on wikipedia and it is now correct :) without gemini and just our intuition of "mhh this formula doesn't seem right", that definitely inflated our ego
Someone exactly said it better here[0] already.
[0]. https://news.ycombinator.com/item?id=47817982
What size of Qwen is that, though? The largest sizes are admittedly difficult to run locally (though this is an issue of current capability wrt. inference engines, not just raw hardware).
I'm directly using https://chat.qwen.ai (Qwen3.6-Plus) and planning to switch to Qwen Code with subscription.
You may be interested in "radiance cascades"
How "social" does Quen feel? The way I am using LLMs for coding makes this actually the most important aspect by now. Claude 4.6 felt like a nice knowledgeable coworker who shared his thinking while solving problems. Claude 4.7 is the difficult anti-social guy who jumps ahead instead of actually answering your questions and does not like to talk to people in general. How are Qwen's social skills?
Qwen feels like wise Chinese philosopher. Talks in very short elegant sentences, but does very solid work.
> Talks in very short elegant sentences
This is not my experience at all, Qwen3.6-Plus spits out multiple paragraphs of text for the prompts I give. It wasn't like this before. Now I have to explicitly tell it not to yap so much and keep it short, concise and direct.
Everybody's out here chasing SOTA, meanwhile I'm getting all my coding done with MiniMax M2.5 in multiple parallel sessions for $10/month and never running into limits.
For serious work, the difference between spending $10/month and $100/month is not even worth considering for most professional developers. There are exceptions like students and people in very low income countries, but I’m always confused by developers with in careers where six figure salaries are normal who are going cheap on tools.
I find even the SOTA models to be far away from trustworthy for anything beyond throwaway tasks. Supervising a less-than-SOTA model to save $10 to $100 per month is not attractive to me in the least.
I have been experimenting with self hosted models for smaller throwaway tasks a lot. It’s fun, but I’m not going to waste my time with it for the real work.
You need to supervise the model anyway, because you want that code to be long-term maintainable and defect free, and AI is nowhere near strong enough to guarantee that anytime soon. Using the latest Opus for literally everything is just a huge waste of effort.
Waste of effort... of Opus? If "Opus effort" is cheaper, than dev hours managing yourself more dumb/effective model, what is the point?
rich people dont concern themselves with the cost of tokens.
For actually serious work, it's a stark difference if your proprietary and security relevant code is sent abroad to a foreign, possibly future hostile country, or is sent to some data center around the corner. It doesn't even need to be defence related.
AFAIK all these companies have SOTA or near-SOTA models available under enterprise licenses. AI companies are not interested in your secret sauce, they are trying to capture the SDLC wholesale.
If an American company, let's say a company that writes software for power stations, would use the services of a French or Chinese AI company under such enterprise licenses, how long would you think it would take until someone in Congress, e.g., would interfere?
I find it odd that none of OpenAI models was used in comparison, but used Z GLM 5.1. Is Z (GLM 5.1) really that good? It is crushing Opus 4.5 in these benchmarks, if that is true, I would have expected to read many articles on HN on how people flocked CC and Codex to use it.
GLM 5.1 is pretty good, probably the best non-US agentic coding model currently available. But both GLM 5.0 and 5.1 have had issues with availability and performance that makes them frustrating to use. Recently GLM 5.1 was also outputting garbage thinking traces for me, but that appears to be fixed now.
Use them via DeepInfra instead of z.ai. No reliability issues.
https://deepinfra.com/zai-org/GLM-5.1
Looks like fp4 quantization now though? Last week was showing fp8. Hm..
Deepinfra's implementation of it is not correct. Thinking is not preserved, and they're not responding to my submitted issue about it.
I also regularly experience Deepinfra slow to an absolute crawl - I've actually gotten more consistent performance from Z.ai.
I really liked Deepinfra but something doesn't seem right over there at the moment.
Damn. Yeah, that sucks. I did play with it earlier again and it did seem to slow down.
It's frankly a bummer that there's not seemingly a better serving option for GLM 5.1 than z.AI, who seems to have reliability and cost issues.
If you only look at open models, GLM 5.1 is the best performance you can get on on the Pareto distribution
https://arena.ai/leaderboard/text?viewBy=plot&license=open-s...
In fact it is appreciated that Qwen is comparing to a peer. I myself and several eng I know are trying GLM. It's legit. Definitely not the same as Codex or Opus, but cheaper and "good enough". I basically ask GLM to solve a program, walk away 10-15 minutes, and the problem is solved.
cheaper is quite subjective, I just went to their pricing page [0] and cost saving compared to performance does not sell it well (again, personal opinion).
CC has a limited capacity for Opus, but fairly good for Sonnet. For Codex, never had issues about hitting my limits and I'm only a pro user.
https://z.ai/subscribe
Yes. GLM 5.1 is that good. I don't think it is as good as Claude was in January or February of this year, but it is similar to how Claude runs now, perhaps better because I feel like it's performance is more consistent.
GLM-5 is good, like really good. Especially if you take pricing into consideration. I paid 7$ for 3 months. And I get more usage than CC.
They have difficulty supplying their users with capacity, but in an email they pointed out that they are aware of it. During peak hours, I experience degraded performance. But I am on their lowest tier subscription, so I understand if my demand is not prioritized during those hours.
Where are you getting 3 months for $7?
I'm using GLM 5.1 for the last two weeks as a cheaper alternative to Sonnet, and it's great - probably somewhere between Sonnet and Opus. It's pretty slow though.
This is what kills it for me… The long thinking blocks can make a simple task take 30 minutes.
GLM 5.1 is the first model I've found good enough to spring for a subscription for other than Claude and Codex.
It's not crushing Opus 4.5 in real-life use for me, but it's close enough to be near interchangeable with Sonnet for me for a lot of tasks, though some of the "savings" are eaten up by seemingly using more tokens for similar complexity tasks (I don't have enough data yet, but I've pushed ~500m tokens through it so far.
I've been using it through OpenCode Go and it does seem decent in my limited experience. I haven't done anything which I could directly compare to Opus yet though.
I did give it one task which was more complex and I was quite impressed by. I had a local setup with Tiltdev, K3S and a pnpm monorepo which was failing to run the web application dev server; GLM correctly figured out that it was a container image build cache issue after inspecting the containers etc and corrected the Tiltfile and build setup.
Most HN commenters seem to be a step behind the latest developments, and sometimes miss them entirely (Kimi K2.5 is one example). Not surprising as most people don't want to put in the effort to sift through the bullshit on Twitter to figure out the latest opinions. Many people here will still prefer the output of Opus 4.5/4.6/4.7, nowadays this mostly comes down to the aesthetic choices Anthropic has made.
Not just aesthetics though, from time to time I implement the same feature with CC and Codex just to compare results, and I yet to find Codex making better decisions or even the completeness of the feature.
For more complicated stuff, like queries or data comparison, Codex seems always behind for me.
maybe they decided OpenAI has different market, hence comparing only with companies who are focusing in dev tooling: Claude, GLM
Haven’t you heard about Codex?
its an SKU from OpenAI's perspective, broader goal and vision is (was) different. Look at the Claude and GLM, both were 95% committed to dev tooling: best coding models, coding harness, even their cowork is built on top of claude code
I'm not sure how this makes sense when Claude models aren't even coding specific: Haiku, Sonnet, Opus are the exact same models you'd use for chat or (with the recent Mythos) bleeding edge research.
Anthropic models and training data is optimized for coding use cases, this is the difference.
OpenAI on the other hand has different models optimized for coding, GPT-x-codex, Anthropic doesnt have this distinction
But they detect it under the hood and apply a similar "variant", as API results are not the same than on Claude Code (that was documented before by someone).
Yeah GLM’s great for coding, code review, and tool use. Not amazing at other domains.
I use it and think its intelligence compares favorably with OpenAI and Anthropic workhorses. Its biggest weakness is its speed.
I am trying since one week to subscribe Alibaba Coding Plan (to use Qwen 3.6 Plus) but it's always out of stock.
They brag about Qwen but don't let people use it.