SREGym: Can AI agents resolve production issues? Real-world SRE problems including metastable failures, misconfigurations, and many more. Live system environments. From the University of Illinois at Urbana-Champaign. To submit, open an issue with the submission label at github.com/SREGym/SREGym.
top agent performance
| Rank | Agent | Model | E2E (%) |
|---|---|---|---|
| 1 | Claude Code | Claude Sonnet 4.6 | 60.7 |
| 2 | Stratus | Claude Sonnet 4.6 | 54.8 |
| 3 | Claude Code | Claude Sonnet 4.6 | 53.7 |
| 4 | Codex | GPT-5.4 | 53.3 |
| 5 | Codex | GPT-5.4 | 45.9 |
| 6 | Stratus | Claude Sonnet 4.6 | 40.2 |
| 7 | Stratus | Kimi K2.5 | 32.9 |
| 8 | Stratus | Kimi K2.5 | 30.4 |
view sregym task examples