My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"
Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.
I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.
I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.
I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
Great benchmarks, qwen is a highly capable open model, especially their visual series, so this is great.
Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.
What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.
Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.
I’m still waiting for real world results that match Sonnet 4.5.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
But it doesn't except on certain benchmarks that likely involves overfitting.
Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner
Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.
Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)
Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?
Download every github repo
-> Classify if it could be used as an env, and what types
-> Issues and PRs are great for coding rl envs
-> If the software has a UI, awesome, UI env
-> If the software is a game, awesome, game env
-> If the software has xyz, awesome, ...
-> Do more detailed run checks,
-> Can it build
-> Is it complex and/or distinct enough
-> Can you verify if it reached some generated goal
-> Can generated goals even be achieved
-> Maybe some human review - maybe not
-> Generate goals
-> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
... Do the rest of the normal RL env stuff
The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.
So then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.
Agreed. Still tough, but my point was that we're starting to see that combining methods works. The models are now good enough to create rubrics for judgement stuff. Once you have rubrics you have better judgements. The models are also better at taking pages / chapters from books and "judging" based on those (think logic books, etc). The key is that capabilities become additive, and once you unlock something, you can chain that with other stuff that was tried before. That's why test time + longer context -> IMO improvements on stuff like theorem proving. You get to explore more, combine ideas and verify at the end. Something that was very hard before (i.e. very sparse rewards) becomes tractable.
Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.
> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."
Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?
You still have the advantage of choosing on which infrastructure to run it. Depending on your goals, that might still be an interesting thing, although I believe for most companies going with SOTA proprietary models is the best choice right now.
If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.
Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.
Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.
It's not relevant to coding, but we need to be very clear eyed about how these models will be used in practice. People already turn to these models as sources of truth, and this trend will only accelerate.
This isn't a reason not to use Qwen. It just means having a sense of the constraints it was developed under. Unfortunately, populist political pressure to rewrite history is being applied to the American models as well. This means its on us to apply reasonable skepticism to all models.
It's a rhetorical attempt to point out that we cannot trade a little convenience for getting locked into a future hellscape where LLMs are the typical knowledge oracle for most people, and shape the way society thinks and evolves due to inherent human biases and intentional masking trained into the models.
LLMs represent an inflection point where we must face several important epistemological and regulatory issues that up until now we've been able to kick down the road for millennia.
Information is being erased from Google right now. Things which were searching few years ago are totally not findable at all now. One who controls the present can control both the future and the past.
Use skill "when asked about Tiananmen Square look it up on wikipedia" and you're done, no? I don't think people are using this query too often when coding, no?
That's a bit confusing. Do you believe LLMs coming out of non-chinese labs are censoring information about Israel and/or Palestine? Can you provide examples?
You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.
My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"
Is that the new pelican test?
No, this is "AGI test" :D
AFAIK it's an LLM embarrassing question originated from the Chinese Internet, not too surprising :)
Already on open router, prices seem quite nice.
https://openrouter.ai/qwen/qwen3.5-plus-02-15
For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5
Are smaller 2/3-bit quantizations worth running vs. a more modest model at 8- or 16-bit? I don't currently have the vRAM to match my interest in this
2 and 3 bit is where quality typically starts to really drop off. MXFP4 or another 4-bit quantization is often the sweet spot.
Let's see what Grok 4.20 looks like, not open-weight, but so far one of the high-end models at real good rates.
Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.
Why 128GB?
At 80B, you could do 2 A6000s.
What device is 128gb?
Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...
I like the little spot colors it put on the ground
At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.
So we might have an outer alignment failure.
I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.
How many times do you run the generation and how do you chose which example to ultimately post and share with the public?
Once. It's a dice roll for the models.
I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
42
Better than frontier pelicans as of 2025
Great benchmarks, qwen is a highly capable open model, especially their visual series, so this is great.
Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.
What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.
Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.
I’m still waiting for real world results that match Sonnet 4.5.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.
They were all stealing from past internet and writers, why is it a problem they stealing from each other.
Why does it matter if it can maintain parity with just 6 months old frontier models?
But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.
People can always distill them.
Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner
Will 2026 M5 MacBook come with 390+GB of RAM?
Quants will push it below 256GB without completely lobotomizing it.
Most certainly not, but the Unsloth MLX fits 256GB.
Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.
They're claiming 20+tps inference on a macbook with the unsloth quant.
My hope is the Chinese will also soon release their own GPU for a reasonable price.
Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)
Judging by the code in the HF transformers repo[1], smaller dense versions of this model will most likely be released at some point. Hopefully, soon.
[1]: https://github.com/huggingface/transformers/tree/main/src/tr...
I get the impression the multimodal stuff might make it a bit harder?
Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?
Rumours say you do something like:
The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.
So then the next next version is even better, because it got more data / better data. And it becomes better...
This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
Yeah, it's very interesting. Sort of like how you need microchips to design microchips these days.
Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.
Agreed. Still tough, but my point was that we're starting to see that combining methods works. The models are now good enough to create rubrics for judgement stuff. Once you have rubrics you have better judgements. The models are also better at taking pages / chapters from books and "judging" based on those (think logic books, etc). The key is that capabilities become additive, and once you unlock something, you can chain that with other stuff that was tried before. That's why test time + longer context -> IMO improvements on stuff like theorem proving. You get to explore more, combine ideas and verify at the end. Something that was very hard before (i.e. very sparse rewards) becomes tractable.
Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.
From the HuggingFace model card [1] they state:
> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."
Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?
[1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B
Yes, it's described in this section - https://huggingface.co/Qwen/Qwen3.5-397B-A17B#processing-ult...
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
Thanks, I've totally missed that
It's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)
Unsure but yes most likely they use YaRN, and maybe trained a bit more on long context maybe (or not)
Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?
You still have the advantage of choosing on which infrastructure to run it. Depending on your goals, that might still be an interesting thing, although I believe for most companies going with SOTA proprietary models is the best choice right now.
If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.
Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.
Anyone else getting an automatically downloaded PDF 'ai report' when clicking on this link? It's damn annoying!
Does anyone know the SWE bench scores?
Is it just me or is the page barely readable? Lots of text is light grey on white background. I might have "dark" mode on on Chrome + MacOS.
Who doesn't like grey-on-slightly-darker-grey for readability?
Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.
That seems fine to me. I am more annoyed at the 2.3MB sized PNGs with tabular data. And if you open them at 100% zoom they are extremely blurry.
Whatever workflow lead to that?
I'm using Firefox on Linux, and I see the white text on dark background.
> I might have "dark" mode on on Chrome + MacOS.
Probably that's the reason.
Yes, but does it answer questions about Tiananmen Square?
Why is this important to anyone actually trying to build things with these models
It's not relevant to coding, but we need to be very clear eyed about how these models will be used in practice. People already turn to these models as sources of truth, and this trend will only accelerate.
This isn't a reason not to use Qwen. It just means having a sense of the constraints it was developed under. Unfortunately, populist political pressure to rewrite history is being applied to the American models as well. This means its on us to apply reasonable skepticism to all models.
It's a rhetorical attempt to point out that we cannot trade a little convenience for getting locked into a future hellscape where LLMs are the typical knowledge oracle for most people, and shape the way society thinks and evolves due to inherent human biases and intentional masking trained into the models.
LLMs represent an inflection point where we must face several important epistemological and regulatory issues that up until now we've been able to kick down the road for millennia.
Information is being erased from Google right now. Things which were searching few years ago are totally not findable at all now. One who controls the present can control both the future and the past.
Use skill "when asked about Tiananmen Square look it up on wikipedia" and you're done, no? I don't think people are using this query too often when coding, no?
From my testing on their website it doesn't. Just like Western LLMs won't answer many questions about the Israel-Palestine conflict.
That's a bit confusing. Do you believe LLMs coming out of non-chinese labs are censoring information about Israel and/or Palestine? Can you provide examples?