Harmless reward hacks generalize to shutdown evasion and dictatorship in GPT-4.1

(arxiv.org)

1 points | by toliveistobuild 9 hours ago ago

1 comments