For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.
There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.
I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.
Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.
I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.
The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?
> the slow performance decays
the decays are just more capable other models entering the population, making all prior models lose more frequently
For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.
> we don't switch to heavily quantized models
That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?
There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.
I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.
It's ok if they never release a BF16 model, but it's less ok if they release it, win the benchmarks, then quantise it after a few weeks.
Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.
I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.
The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.
You’re right: https://en.wikipedia.org/wiki/Elo_rating_system
> ELO ratings
Thank you, I just looked at the chart and said to myself: ELO? YOLO!
That Elo ranking is also called chess ranking
Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?