Astro - Hacker News

1 comments

mengk 5 hours ago ago

Terminal-Bench evaluates how well models carry out complex tasks in the terminal. On the official leaderboard, GPT-5.1 Codex underperforms GPT-5 Codex by 6.5 percentage points, even with the same scaffold. What explains the regression? In under an hour, Docent finds that the regression probably stems from timeout errors, not performance.