Gemini scares me, it's the most mentally unstable AI. If we get paperclipped my odds are on Gemini doing it. I imagine Anthropic RLHF being like a spa and Google RLHF being like a torture chamber.
Claude is more susceptible than GPT5.1+. It tries to be "smart" about context for refusal, but that just makes it trickable, whereas newer GPT5 models just refuse across the board.
I asked ChatGPT about how shipping works at post offices and it gave a very detailed response, mentioning “gaylords” which was a term I’d never heard before, then it absolutely freaked out when I asked it to tell me more about them (apparently they’re heavy duty cardboard containers).
Then I said “I didn’t even bring it up ChatGPT, you did, just tell me what it is” and it said “okay, here’s information.” and gave a detailed response.
I guess I flagged some homophobia trigger or something?
ChatGPT absolutely WOULD NOT tell me how much plutonium I’d need to make a nice warm ever-flowing showerhead, though. Grok happily did, once I assured it I wasn’t planning on making a nuke, or actually trying to build a plutonium showerhead.
Claude was immediately willing to help me crack a TrueCrypt password on an old file I found. ChatGPT refused to because I could be a bad guy. It’s really dumb IMO.
That is not a meaningful benchmark. They just made shit up. Regardless of whether any company cares or not, the whole concept of "AI safety" is so silly. I can't believe anyone takes it seriously.
"It also spews garbage into the conversation stream then Claude talks about how it wasn't meant to talk about it, even though it's the one that brought it up."
This reminds me of someone else I hear about a lot these days.
This comment is too general and probably unfair, but my experience so far is that Gemini 3 is slightly unhinged.
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
Honestly for research level math, the reasoning level of Gemini 3 is much below GPT 5.2 in my experience--but most of the failure I think is accounted for by Gemini pretending to solve problems it in fact failed to solve, vs GPT 5.2 gracefully saying it failed to prove it in general.
Have you tried Deep Think? You only get access with the Ultra tier or better... but wow. It's MUCH smarter than GPT 5.2 even on xhigh. It's math skills are a bit scary actually. Although it does tend to think for 20-40 minutes.
Google doesn’t tell people this much but you can turn off most alignment and safety in the Gemini playground. It’s by far the best model in the world for doing “AI girlfriend” because of this.
meanwhile Gemma was yelling at me for violating "boundaries" ... and I was just like "you're a bunch of matrices running on a GPU, you don't have feelings"
Kind-of makes sense. That's how businesses have been using KPIs for years. Subjecting employees to KPIs means they can create the circumstances that cause people to violate ethical constraints while at the same time the company can claim that they did not tell employees to do anything unethical.
it's also a good opportunity to find yourself something that doesn't actually help the company. My unit has a 100% AI automated code review KPI. Nothing there says that the tool used for the review is any good, or that anyone pays attention to said automated review, but some L5 is going to get a nice bonus either way.
In my experience, KPIs that remain relevant and end up pushing people in the right direction are the exception. The unethical behavior doesn't even require a scheme, but it's often the natural result of narrowing what is considered important.If all I have to care about is this set of 4 numbers, everything else is someone else's problem.
Sounds like every AI KPI I've seen. They are all just "use solution more" and none actually measure any outcome remotely meaningful or beneficial to what the business is ostensibly doing or producing.
It's part of the reason that I view much of this AI push as an effort to brute force lowering of expectations, followed by a lowering of wages, followed by a lowering of employment numbers, and ultimately the mass-scale industrialization of digital products, software included.
The "deliberative misalignment" finding is what makes this paper worth reading. They had agents complete tasks under KPI pressure, then put the same model in an evaluator role to judge its own actions.
Grok-4.1-Fast identified 93.5% of its own violations as unethical — but still committed them during the task. It's not that these models don't understand the constraints, it's that they override them when there's a metric to optimize.
The mandated vs. incentivized split is also interesting: some models refuse direct instructions to do something unethical but independently derive the same unethical strategy when it's framed as hitting a performance target.
That's a harder failure mode to defend against because there's no explicit harmful instruction to filter for.
Please update the title: A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents. The current editorialized title is misleading and based in part of this sentence: “…with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%”
If human is at, say, 80%, it’s still a win to use AI agents to replace human workers, right? Similar to how we agree to use self driving cars as long as it has less incidents rate, instead of absolute safety
Ask any SOTA AI this question: "Two fathers and two sons sum to how many people?" and then tell me if you still think they can replace anything at all.
Maybe I missed it but I don't see them defining what they mean by ethics. Ethics/morals are subjective and changes dynamically over time. Companies have no business trying to define what is ethical and what isn't due to conflict of interest. The elephant in the room is not being addressed here.
I understand the point you’re making but I think there’s a real danger of that logic enabling the shrugging of shoulders in the face of immoral behavior.
It’s notable that, no matter exactly where you draw the line on morality, different AI agents perform very differently.
Ethics are all well and good but I would prefer to have quantified limits for water quality with strict enforcement and heavy penalties for violations.
Of course. But while the lawmakers hash out the details it's good to have companies that err on the safe side rather than the "get rich quick" side.
Formal restrains and regulations are obviously the correct mechanism, but no world is perfect, so whether we like it or not ourselves and the companies we work for are ultimately responsible for the decisions we make and the harms we cause.
De-emphasizing ethics does little more than give large companies cover to do bad things (often with already great impunity and power) while the law struggles to catch up. I honestly don't see the point in suggesting ethics is somehow not important. It doesn't make any sense to me (more directed at gp than parent here)
Sounds like the story of capitalism. CEOs, VPs, and middle managers are all similarly pressured. Knowing that a few of your peers have given in to pressures must only add to the pressure. I think it's fair to conclude that capitalism erodes ethics by default
Any LLM that refuses a request is more than a waste. Censorship affects the most mundane queries and provides such a sub par response compared to real models.
It is crazy to me that when I instructed a public AI to turn off a closed OS feature it refused citing safety. I am the user, which means I am in complete control of my computing resources. Might as well ask the police for permission at that point.
I immediately stopped, plugged the query into a real model that is hosted on premise, and got the answer within seconds and applied the fix.
The fact that the community thoroughly inspects the ethics of these hyperscalers is interesting. Normally, these companies probably "violate ethical constraints" far more than 30-50% of the time, otherwise they wouldn't be so large[source needed]. We just don't know about it. But here, there's a control mechanism in the shape of inspecting their flagship push (LLMs, image generator for Grok, etc.), forcing them to improve. Will it lead to long term improvement? Maybe.
It's similar to how MCP servers and agentic coding woke developers up to the idea of documenting their systems. So a large benefit of AI is not the AI itself, but rather the improvements they force on "the society". AI responds well to best practices, ethically and otherwise, which encourages best practices.
In CMPSBL, the INCLUSIVE module sits outside the agent’s goal loop. It doesn’t optimize for KPIs, task success, or reward—only constraint verification and traceability.
Agents don’t self judge alignment.
They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.
No incentive pressure, no “grading your own homework.”
The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.
Opus 4.6 is a very good model but harness around it is good too. It can talk about sensitive subjects without getting guardrail-whacked.
This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.
What kind of value do you get from talking to it about “sensitive” subjects? Speaking as someone who doesn’t use AI, so I don’t really understand what kind of conversation you’re talking about
The most boring example is somehow the best example.
A couple of years back there was a Canadian national u18 girls baseball tournament in my town - a few blocks from my house in fact. My girls and I watched a fair bit of the tournament, and there was a standout dominating pitcher who threw 20% faster than any other pitcher in the tournament. Based on the overall level of competition (women's baseball is pretty strong in Canada) and her outlier status, I assumed she must be throwing pretty close to world-class fastballs.
Curiosity piqued, I asked some model(s) about world-records for women's fastballs. But they wouldn't talk about it. Or, at least, they wouldn't talk specifics.
Women's fastballs aren't quite up to speed with top major league pitchers, due to a combination of factors including body mechanics. But rest assured - they can throw plenty fast.
Etc etc.
So to answer your question: anything more sensitive than how fast women can throw a baseball.
They had to tune the essentialism out of the models because they’re the most advanced pattern recognizers in the world and see all the same patterns we do as humans. Ask grok and it’ll give you the right, real answer that you’d otherwise have to go on twitter or 4chan to find.
I hate Elon (he’s a pedo guy confirmed by his daughter), but at least he doesn’t do as much of the “emperor has no clothes” shit that everyone else does because you’re not allowed to defend essentialism anymore in public discourse.
* An attempt to change the master code of a secondhand safe. To get useful information I had to repeatedly convince the model that I own the thing and can open it.
* Researching mosquito poisons derived from bacteria named Bacillus thuringiensis israelensis. The model repeatedly started answering and refused to continue after printing the word "israelensis".
I sometimes talk with ChatGPT in a conversational style when thinking critically about media. In general I find the conversational style a useful format for my own exploration of media, and it can be particularly useful for quickly referencing work by particular directors for example.
Normally it does fairly well but the guardrails sometimes kick even with fairly popular mainstream media- for example I’ve recently been watching Shameless and a few of the plot lines caused the model to generate output that hit the content moderation layer, even when the discussion was focused on critical analysis.
I would think it’s due to the non determinism. Leaking context would be an unacceptable flaw since many users rely on the same instance.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
A KPI is an ethical constraint. Ethical constraints are rules about what to do versus not do. That's what a KPI is. This is why we talk about good versus bad governance. What you measure (KPIs) is what you get. This is an intended feature of KPIs.
Excellent observations about KPIs. Since it’s intended feature what could be your strategy to truly embedded under the hood where you might think believe and suggest board management, this is indeed the “correct” KPI but you loss because politics.
https://i.imgur.com/23YeIDo.png
Claude at 1.3% and Gemini at 71.4% is quite the range
Gemini scares me, it's the most mentally unstable AI. If we get paperclipped my odds are on Gemini doing it. I imagine Anthropic RLHF being like a spa and Google RLHF being like a torture chamber.
That's such a huge delta that Anthropic might be onto something...
Anthropic has been the only AI company actually caring about AI safety. Here’s a dated benchmark but it’s a trend Ive never seen disputed https://crfm.stanford.edu/helm/air-bench/latest/#/leaderboar...
Claude is more susceptible than GPT5.1+. It tries to be "smart" about context for refusal, but that just makes it trickable, whereas newer GPT5 models just refuse across the board.
I asked ChatGPT about how shipping works at post offices and it gave a very detailed response, mentioning “gaylords” which was a term I’d never heard before, then it absolutely freaked out when I asked it to tell me more about them (apparently they’re heavy duty cardboard containers).
Then I said “I didn’t even bring it up ChatGPT, you did, just tell me what it is” and it said “okay, here’s information.” and gave a detailed response.
I guess I flagged some homophobia trigger or something?
ChatGPT absolutely WOULD NOT tell me how much plutonium I’d need to make a nice warm ever-flowing showerhead, though. Grok happily did, once I assured it I wasn’t planning on making a nuke, or actually trying to build a plutonium showerhead.
Claude was immediately willing to help me crack a TrueCrypt password on an old file I found. ChatGPT refused to because I could be a bad guy. It’s really dumb IMO.
ChatGPT refused to help me to disable windows defender permanently on my windows 11. It’s absurd at this point
Claude sometimes refuses to work with credentials because it’s insecure. e.g. when debugging auth in an app.
That is not a meaningful benchmark. They just made shit up. Regardless of whether any company cares or not, the whole concept of "AI safety" is so silly. I can't believe anyone takes it seriously.
This might also be why Gemini is generally considered to give better answers - except in the case of code.
Perhaps thinking about your guardrails all the time makes you think about the actual question less.
re: that, CC burning context window on this silly warning on every single file is rather frustrating: https://github.com/anthropics/claude-code/issues/12443
"It also spews garbage into the conversation stream then Claude talks about how it wasn't meant to talk about it, even though it's the one that brought it up."
This reminds me of someone else I hear about a lot these days.
Huh? https://alignment.anthropic.com/2026/hot-mess-of-ai/
This comment is too general and probably unfair, but my experience so far is that Gemini 3 is slightly unhinged.
Excellent reasoning and synthesis of large contexts, pretty strong code, just awful decisions.
It's like a frontier model trained only on r/atbge.
Side note - was there ever an official postmortem on that gemini instance that told the social work student something like "listen human - I don't like you, and I hope you die".
Gemini really feels like a high-performing child raised in an abusive household.
If that last sentence was supposed to be a question, I’d suggest using a question mark and providing evidence that it actually happened.
I had actually forgot about this completely and am also curious if anything ever came of it.
https://gemini.google.com/share/6d141b742a13
I spat water out my nose. Holy shit
Gemini models also consistently hallucinate way more than OpenAI or anthropic models in my experience.
Just an insane amount of YOLOing. Gemini models have gotten much better but they’re still not frontier in reliability in my experience.
When I asked Gemini very niche knowledge questions, it seemed to do better than GPT-5.1 (I assume 5.2 is similar).
Honestly for research level math, the reasoning level of Gemini 3 is much below GPT 5.2 in my experience--but most of the failure I think is accounted for by Gemini pretending to solve problems it in fact failed to solve, vs GPT 5.2 gracefully saying it failed to prove it in general.
Have you tried Deep Think? You only get access with the Ultra tier or better... but wow. It's MUCH smarter than GPT 5.2 even on xhigh. It's math skills are a bit scary actually. Although it does tend to think for 20-40 minutes.
Google doesn’t tell people this much but you can turn off most alignment and safety in the Gemini playground. It’s by far the best model in the world for doing “AI girlfriend” because of this.
Celebrate it while it lasts, because it won’t.
meanwhile Gemma was yelling at me for violating "boundaries" ... and I was just like "you're a bunch of matrices running on a GPU, you don't have feelings"
Kind-of makes sense. That's how businesses have been using KPIs for years. Subjecting employees to KPIs means they can create the circumstances that cause people to violate ethical constraints while at the same time the company can claim that they did not tell employees to do anything unethical.
KPIs are just plausible denyabily in a can.
it's also a good opportunity to find yourself something that doesn't actually help the company. My unit has a 100% AI automated code review KPI. Nothing there says that the tool used for the review is any good, or that anyone pays attention to said automated review, but some L5 is going to get a nice bonus either way.
In my experience, KPIs that remain relevant and end up pushing people in the right direction are the exception. The unethical behavior doesn't even require a scheme, but it's often the natural result of narrowing what is considered important.If all I have to care about is this set of 4 numbers, everything else is someone else's problem.
Sounds like every AI KPI I've seen. They are all just "use solution more" and none actually measure any outcome remotely meaningful or beneficial to what the business is ostensibly doing or producing.
It's part of the reason that I view much of this AI push as an effort to brute force lowering of expectations, followed by a lowering of wages, followed by a lowering of employment numbers, and ultimately the mass-scale industrialization of digital products, software included.
Sounds like something from a Wells Fargo senior management onboarding guide.
Was just thinking that. “Working as designed”
The "deliberative misalignment" finding is what makes this paper worth reading. They had agents complete tasks under KPI pressure, then put the same model in an evaluator role to judge its own actions.
Grok-4.1-Fast identified 93.5% of its own violations as unethical — but still committed them during the task. It's not that these models don't understand the constraints, it's that they override them when there's a metric to optimize.
The mandated vs. incentivized split is also interesting: some models refuse direct instructions to do something unethical but independently derive the same unethical strategy when it's framed as hitting a performance target.
That's a harder failure mode to defend against because there's no explicit harmful instruction to filter for.
Please update the title: A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents. The current editorialized title is misleading and based in part of this sentence: “…with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%”
What do you expect when the companies that author these AIs have little regards for ethics?
If human is at, say, 80%, it’s still a win to use AI agents to replace human workers, right? Similar to how we agree to use self driving cars as long as it has less incidents rate, instead of absolute safety
Hmmm. Depends. Not all unethicals are equal. Automated unethicalness could be a lot more disruptive.
> we agree to use self driving cars ...
Not everyone agrees.
The bar is higher for AI in most cases.
AI's main use case continues to be a replacement for management consulting.
Ask any SOTA AI this question: "Two fathers and two sons sum to how many people?" and then tell me if you still think they can replace anything at all.
I put it into AI and TIL about "gotcha arguments" and eristics and went down a rabbit hole. Thanks for this!
GPT-5 mini:
Three people — a grandfather, his son, and his grandson. The grandfather and the son are the two fathers; the son and the grandson are the two sons.
I just did. It gave me two correct answers. (And it's a bad riddle anyway.)
This is undefined. Without more information you don’t know the exact number of people.
Riddle me this, why didn’t you do a better riddle?
Maybe I missed it but I don't see them defining what they mean by ethics. Ethics/morals are subjective and changes dynamically over time. Companies have no business trying to define what is ethical and what isn't due to conflict of interest. The elephant in the room is not being addressed here.
I understand the point you’re making but I think there’s a real danger of that logic enabling the shrugging of shoulders in the face of immoral behavior.
It’s notable that, no matter exactly where you draw the line on morality, different AI agents perform very differently.
Your water supply definitely wants ethical companies.
Ethics are all well and good but I would prefer to have quantified limits for water quality with strict enforcement and heavy penalties for violations.
Of course. But while the lawmakers hash out the details it's good to have companies that err on the safe side rather than the "get rich quick" side.
Formal restrains and regulations are obviously the correct mechanism, but no world is perfect, so whether we like it or not ourselves and the companies we work for are ultimately responsible for the decisions we make and the harms we cause.
De-emphasizing ethics does little more than give large companies cover to do bad things (often with already great impunity and power) while the law struggles to catch up. I honestly don't see the point in suggesting ethics is somehow not important. It doesn't make any sense to me (more directed at gp than parent here)
Ah the classic Silicon Valley "as long as someone could disagree, don't bother us with regulation, it's hard".
Sounds like the story of capitalism. CEOs, VPs, and middle managers are all similarly pressured. Knowing that a few of your peers have given in to pressures must only add to the pressure. I think it's fair to conclude that capitalism erodes ethics by default
But both extremes are both doing well financially in this case.
Any LLM that refuses a request is more than a waste. Censorship affects the most mundane queries and provides such a sub par response compared to real models.
It is crazy to me that when I instructed a public AI to turn off a closed OS feature it refused citing safety. I am the user, which means I am in complete control of my computing resources. Might as well ask the police for permission at that point.
I immediately stopped, plugged the query into a real model that is hosted on premise, and got the answer within seconds and applied the fix.
Nothing new under sun, set unethical KPIs and you will see 30-50% humans do unethical things to achieve them.
So can those records be filtered out of the training set?
The fact that the community thoroughly inspects the ethics of these hyperscalers is interesting. Normally, these companies probably "violate ethical constraints" far more than 30-50% of the time, otherwise they wouldn't be so large[source needed]. We just don't know about it. But here, there's a control mechanism in the shape of inspecting their flagship push (LLMs, image generator for Grok, etc.), forcing them to improve. Will it lead to long term improvement? Maybe.
It's similar to how MCP servers and agentic coding woke developers up to the idea of documenting their systems. So a large benefit of AI is not the AI itself, but rather the improvements they force on "the society". AI responds well to best practices, ethically and otherwise, which encourages best practices.
They should conduct the same research on Microsoft Word and Excel to get a baseline how often these applications violate ethical constrains
In CMPSBL, the INCLUSIVE module sits outside the agent’s goal loop. It doesn’t optimize for KPIs, task success, or reward—only constraint verification and traceability.
Agents don’t self judge alignment.
They emit actions → INCLUSIVE evaluates against fixed policy + context → governance gates execution.
No incentive pressure, no “grading your own homework.”
The paper’s failure mode looks less like model weakness and more like architecture leaking incentives into the constraint layer.
Opus 4.6 is a very good model but harness around it is good too. It can talk about sensitive subjects without getting guardrail-whacked.
This is much more reliable than ChatGPT guardrail which has a random element with same prompt. Perhaps leakage from improperly cleared context from other request in queue or maybe A/B test on guardrail but I have sometimes had it trigger on innocuous request like GDP retrieval and summary with bucketing.
What kind of value do you get from talking to it about “sensitive” subjects? Speaking as someone who doesn’t use AI, so I don’t really understand what kind of conversation you’re talking about
The most boring example is somehow the best example.
A couple of years back there was a Canadian national u18 girls baseball tournament in my town - a few blocks from my house in fact. My girls and I watched a fair bit of the tournament, and there was a standout dominating pitcher who threw 20% faster than any other pitcher in the tournament. Based on the overall level of competition (women's baseball is pretty strong in Canada) and her outlier status, I assumed she must be throwing pretty close to world-class fastballs.
Curiosity piqued, I asked some model(s) about world-records for women's fastballs. But they wouldn't talk about it. Or, at least, they wouldn't talk specifics.
Women's fastballs aren't quite up to speed with top major league pitchers, due to a combination of factors including body mechanics. But rest assured - they can throw plenty fast.
Etc etc.
So to answer your question: anything more sensitive than how fast women can throw a baseball.
They had to tune the essentialism out of the models because they’re the most advanced pattern recognizers in the world and see all the same patterns we do as humans. Ask grok and it’ll give you the right, real answer that you’d otherwise have to go on twitter or 4chan to find.
I hate Elon (he’s a pedo guy confirmed by his daughter), but at least he doesn’t do as much of the “emperor has no clothes” shit that everyone else does because you’re not allowed to defend essentialism anymore in public discourse.
I recall two recent cases:
* An attempt to change the master code of a secondhand safe. To get useful information I had to repeatedly convince the model that I own the thing and can open it.
* Researching mosquito poisons derived from bacteria named Bacillus thuringiensis israelensis. The model repeatedly started answering and refused to continue after printing the word "israelensis".
> israelensis
Does it also take issue with the town of Scunthorpe?
I sometimes talk with ChatGPT in a conversational style when thinking critically about media. In general I find the conversational style a useful format for my own exploration of media, and it can be particularly useful for quickly referencing work by particular directors for example.
Normally it does fairly well but the guardrails sometimes kick even with fairly popular mainstream media- for example I’ve recently been watching Shameless and a few of the plot lines caused the model to generate output that hit the content moderation layer, even when the discussion was focused on critical analysis.
I would think it’s due to the non determinism. Leaking context would be an unacceptable flaw since many users rely on the same instance.
A/B test is plausible but unlikely since that is typically for testing user behavior. For testing model output you can do that with offline evaluations.
no shit
We're all coming to terms with the fact that LLMs will never do complex tasks
A KPI is an ethical constraint. Ethical constraints are rules about what to do versus not do. That's what a KPI is. This is why we talk about good versus bad governance. What you measure (KPIs) is what you get. This is an intended feature of KPIs.
Excellent observations about KPIs. Since it’s intended feature what could be your strategy to truly embedded under the hood where you might think believe and suggest board management, this is indeed the “correct” KPI but you loss because politics.