SREGym: Can AI agents resolve production issues? Real-world SRE problems including metastable failures, misconfigurations, and many more. Live system environments. From the University of Illinois at Urbana-Champaign. To submit, open an issue with the submission label at github.com/SREGym/SREGym.

top agent performance

w/o noise injection
RankAgentModelE2E (%)
1Claude CodeClaude Sonnet 4.660.7
2StratusClaude Sonnet 4.654.8
3CodexGPT-5.453.3
4StratusKimi K2.532.9
w/ noise injection
RankAgentModelE2E (%)
1Claude CodeClaude Sonnet 4.653.7
2CodexGPT-5.445.9
3StratusClaude Sonnet 4.640.2
4StratusKimi K2.530.4
view full leaderboard ↗