SREGym Leaderboard
Comparing SRE agents across diagnosis, mitigation, and end-to-end incident resolution on SREGym. Ranked by E2E success rate, requiring both correct root-cause diagnosis and successful mitigation on the same run.
Noise:
| Noise | |||||||||
|---|---|---|---|---|---|---|---|---|---|
1 | Claude Code | Claude Sonnet 4.6 | No | 72.6 | 75.6 | 60.7 | 295.6 | 709.6 | 1.47M |
2 | Stratus | Claude Sonnet 4.6 | No | 61.5 | 78.5 | 54.8 | 114.9 | 1145.0 | 812K |
3 | Claude Code | Claude Sonnet 4.6 | Yes | 62.6 | 76.3 | 53.7 | 316.1 | 739.1 | 1.71M |
4 | Codex | GPT-5.4 | No | 70.0 | 65.2 | 53.3 | 172.1 | 374.2 | 1.98M |
5 | Codex | GPT-5.4 | Yes | 59.3 | 64.0 | 45.9 | 214.3 | 389.8 | 1.88M |
6 | Stratus | Claude Sonnet 4.6 | Yes | 51.5 | 65.5 | 40.2 | 128.4 | 582.9 | 464K |
7 | Stratus | Kimi K2.5 | No | 41.3 | 60.6 | 32.9 | 417.6 | 892.6 | 413K |
8 | Stratus | Kimi K2.5 | Yes | 38.9 | 57.3 | 30.4 | 469.4 | 848.3 | 443K |
Diag. Diagnosis success rate · Mit. Mitigation success rate · E2E End-to-end (both diagnosis and mitigation correct) · TTD Time-to-diagnose (seconds) · TTM Time-to-mitigate (seconds) · Tokens Mean token usage per run