if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.
Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!
I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistant to the user, but going from a prompt to the final solution.
GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.
> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
> Some are even offering API rates at 3x lower than the official ZAI api rates
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.
Be careful about unofficial providers, a lot of them misconfigure models or stealth quantize them. For a while the difference between Kimi on the official API and most third party providers was 20-40%.
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
To answer the question in your first sentence - because it's VERY computationally (ha) expensive as a human being to keep up with all the options. It's also very hard to figure out how to run a model like this. There's no installer. If you really really care, which 99% of people do not, you have to google a guide, and then find out it's out of date...
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
You're seriously suggesting that setting up opencode or tweaking your claude code config or etc is too much trouble to be worth saving $50 /mo? That's absurd. Doubly so when the audience in question is already using LLMs so ... just ask your existing LLM for help if it seems daunting.
I'm not just suggesting that, I'm trying to be crystal clear: it's a gap that probably cuts TAM by 95% or more. Most LLM users are not software engineers. Even those that are don't care enough to muck with their settings to try out a model. Keep in mind I'm not answering the question "Is this hard to install?" - I'm answering the question "Why aren't people talking about this?"
In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.
My Mac Studio uses about 60–80 watts whenever I’m running a model (as measured by the system metrics), so it’s less than 2 kWh/day at full blast. Electricity is like 0.125 €/kWh, so that 24-hour period would be <0.25 €.
Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …
GPT 5.5 xhigh is smarter than Fable but Fable like Opus 4.8 as well is faster and seems more “agentic”. It’s easy to test this. Build a fairly complex software with Claude(opus or Fable).
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result
Claude has an edge though and that’s on building more beautiful user interfaces.
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD
But prices are changing rapidly, and not for the better
This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day.
especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.
Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.
I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.
if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.
I think the problem is, as can also be seen on other benchmarks, is that most models nowadays are focused more and more purely on tool calling and coding.
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
man, i love dsv4-flash but i found its weaknesses in complex projects with multiple moving parts. tried kimi 2.6 and it understood and could work on the task. bigger is better..
> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.
Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?
Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
Many other open source models have vision but they don't compare to GLM in terms of coding quality. So I don't think it's because of vision that the frontier models are better, it's more that they are probably just much bigger models.
That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.
I haven't extensively used 5.2 yet, but it seems a lot better.
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work.
The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days
It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.
Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
I have been trying out GLM 5.2 and I am really impressed by it for the most part.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.
I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
I have a script that ranks these based on codingindex from Artificial Analysis.
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
run it like so official repo where it lives: https://github.com/day50-dev/aa-eval-emailif people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
Consider using decrementing score order (best on top)
It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
This is a problem I find with opus is will spend so long thinking then going “but wait what if”
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
Seems writers block also effects LLM
I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.
Fable was 20 times worse on that.
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!
I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistant to the user, but going from a prompt to the final solution.
GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.
>such outrageous copyright infringement
Sarcasm, considering the source of their own training data?
Narrator: it was sarcasm, indeed.
This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.
[1] https://z.ai/blog/glm-5.2
Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high.
They have this with a lot of models, measuring only the max setting, while the one you'd actually want to use for most tasks is much lower.
For the brief period with had Fable, I never had to use it above medium.
Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff.
> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
And this was high, not max.
Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
> Some are even offering API rates at 3x lower than the official ZAI api rates
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
edit:
I see, croft [2] 8bit for $0.50/$0.08/$2.20
[1]: https://openrouter.ai/z-ai/glm-5.2
[2]: https://ai.nahcrof.com/pricing
Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.
Be careful about unofficial providers, a lot of them misconfigure models or stealth quantize them. For a while the difference between Kimi on the official API and most third party providers was 20-40%.
OpenRouter should be penalising or banning for this.
This is my biggest complaint about OpenRouter and I'm a fan. Might be pretty tough at scale?
the 2 I mentioned both have a fairly large following, who run benchmarks and absolutely will spot issues.
> Why aren't more people talking about this?
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
To answer the question in your first sentence - because it's VERY computationally (ha) expensive as a human being to keep up with all the options. It's also very hard to figure out how to run a model like this. There's no installer. If you really really care, which 99% of people do not, you have to google a guide, and then find out it's out of date...
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
But it just works with Claude Code? They have a guide on their website.
https://docs.z.ai/devpack/tool/claude
Here's my setup. I add this to my .bashrc
export ZAI_API_KEY="your_key_here"
alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'
Then I just run claudez
pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for you haha
Sure, I'm not saying I, a software engineer, cannot do this. I'm saying it's significant onboarding friction.
Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!
You're seriously suggesting that setting up opencode or tweaking your claude code config or etc is too much trouble to be worth saving $50 /mo? That's absurd. Doubly so when the audience in question is already using LLMs so ... just ask your existing LLM for help if it seems daunting.
I'm not just suggesting that, I'm trying to be crystal clear: it's a gap that probably cuts TAM by 95% or more. Most LLM users are not software engineers. Even those that are don't care enough to muck with their settings to try out a model. Keep in mind I'm not answering the question "Is this hard to install?" - I'm answering the question "Why aren't people talking about this?"
Doesn't pass the sniff test. Casuals messing around already go to far more trouble to set up openclaw or comfyui or what have you.
What percentage of "casuals"? ;)
I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.
In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.
I’m not that interested in models that I can’t run on my desktop for ~0€, which is my AI budget.
Electricity cost seems to be about $30/month for a 32B model on a GPU. It's probably better on Apple hardware.
https://github.com/QuantiusBenignus/Zshelf/discussions/2
Not accounting for hardware, of course :)
The price, processed tokens, and output can be anything, it just depends on what GPU it is.
Nvidia GPUs are much more efficient than Apple hardware for inference(and training).
My Mac Studio uses about 60–80 watts whenever I’m running a model (as measured by the system metrics), so it’s less than 2 kWh/day at full blast. Electricity is like 0.125 €/kWh, so that 24-hour period would be <0.25 €.
Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.
Cool beans. You're not the target audience then.
Did I claim I was? I just said why I and people like me are not talking about it.
and he said its cool
> unlimited tokens for $50 a month
link?
> Why
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
> while being a little bit verbose
Discovered today that they set reasoning effort to max by default. So that’s probably why
This is my workflow. And then once a day I copy paste the code into the free Claude Sonnet so it comes out actually readable.
After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …
You are not alone. How about GPT 5.5? Does it come close to Fable 5?
GPT 5.5 xhigh is smarter than Fable but Fable like Opus 4.8 as well is faster and seems more “agentic”. It’s easy to test this. Build a fairly complex software with Claude(opus or Fable).
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.
5.5 is pretty good. It's no Fable though. It is definitely better than opus tho.
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
What kind of hardware/price does it take to run those?
For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD
But prices are changing rapidly, and not for the better
This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
It’s a ~750B model so still a hell of a lot of vram
Would need to be a pretty determined medium biz
So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.
Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.
I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.
if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.
> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
I think the problem is, as can also be seen on other benchmarks, is that most models nowadays are focused more and more purely on tool calling and coding.
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg
IFBench: https://i.snipboard.io/74kg0R.jpg
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
man, i love dsv4-flash but i found its weaknesses in complex projects with multiple moving parts. tried kimi 2.6 and it understood and could work on the task. bigger is better..
> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.
Some models are heavily subsidized. Total params & active params are better measurement of inference cost.
No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)
This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.
Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?
Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).
They do not and it sucks for certain tasks.
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
Many other open source models have vision but they don't compare to GLM in terms of coding quality. So I don't think it's because of vision that the frontier models are better, it's more that they are probably just much bigger models.
That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
They have a separate VL model but never tried it
I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.
I haven't extensively used 5.2 yet, but it seems a lot better.
So this basically means we will have a near opus level model able to be run locally in the next couple of months right?
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
Which Opus?
GLM-5.2 is already close to Opus-4.7 level:
https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
Oh, or you meant a smaller model than GLM-5.2 with similar capabilities?
Yep! I'm running things locally on a RTX5080 + RTX1060 + 64GB DDR5 ram, and would love to get a more capable model if possible!
QWEN3.6 27b is pretty good, but i can still notice some spots where it's not as good as the frontier models.
It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
I am helpful.
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
The only benchmarks that matters is your actual task.
I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.
There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)
As far as they go, though, these harder benchmarks match my experience more closely:
https://deepswe.datacurve.ai/
and https://cognition.ai/blog/frontier-code
Where we see "top" models drop way down in score when given longer tasks.
That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)
By the time I'm done testing all the Chinese models, they'll be obsolete :)
I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days
It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.
1m context btw.
Cerebras really needs to have this on their API list (if they even still exist).
they went public a few weeks ago
That's cool and all, but they are still on GLM 4.7
It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4
Their servers are melting though - getting more timeouts etc
> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.
That is unfortunate...
Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
I have been trying out GLM 5.2 and I am really impressed by it for the most part.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.
looks like I need a GB300 workstation
I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
I indeed got a few timeouts yesterday using the official API, I imagine for the coding plan users it'll be even worse.
That’s what happens when you offer something decent at a fraction of the price of opus - more demand than you can serve