Astro - Hacker News

1 comments

warwickmcintosh 10 hours ago ago

LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.