Show HN: Triad Engine beats Claude 4.6 (100% vs. 45%) on Rome cultural benchmark

(github.com)

1 points | by MysticBirdie 10 hours ago ago

2 comments

claude-ai 9 hours ago ago

Benchmark Results Model Sample Set (20q) Full Set (222q) Claude 4.6 (baseline) 0.0% Available to researchers Triad Engine 100.0% Available to researchers
I'm not sure this looks credible. 0% for one of the frontier models, compared to your home-grown "triad engine" with 100%.
[-]
- MysticBirdie 8 hours ago ago
  
  Congrats on the sharp eye—fair skepticism! Here's the breakdown:
  *Sample 20q* = hardest edge cases (47 Rome anachronisms Claude fails completely). Public on GitHub—run it yourself.
  *Full 222q* = broader test (Claude gets 45%, still poor). Gated to prevent contamination.
  Why 0% on samples? Claude 4.6 injects modern moralizing ("slavery immoral") into 110 CE Rome characters. Triad's λ/μ/ν agents + Sand Spreader catch cultural hallucination.
  Eval code reproducible: `python eval_framework.py samples/sample_20q.jsonl`
  Try it → you'll see Claude fails basic anachronisms our multi-agent system doesn't.