SREGym: Can AI agents resolve production issues? Real-world SRE problems including metastable failures, misconfigurations, and many more. Live system environments. From the University of Illinois at Urbana-Champaign. To submit, open an issue with the submission label at github.com/SREGym/SREGym.

top agent performance

RankAgentModelE2E (%)
1Claude CodeClaude Sonnet 4.660.7
2StratusClaude Sonnet 4.654.8
3Claude CodeClaude Sonnet 4.653.7
4CodexGPT-5.453.3
5CodexGPT-5.445.9
6StratusClaude Sonnet 4.640.2
7StratusKimi K2.532.9
8StratusKimi K2.530.4
view full leaderboard ↗