Introduction
A unified platform to enable the design, development, and evaluation of AI agents for Site Reliability Engineering (SRE).
SRE Problems
Problems in SREGym consist of three components: an application, a fault, and an oracle. When evaluating a problem, SREGym first deploys the application specified in the problem. After deployment, the fault is injected into the system to cause the incident. Then, SREGym begins evaluating the agent and uses the oracle as the ground truth for the problem's solution.
See our registry for a complete list of problems.
SREGym is built to be extensible, we always welcome new contributions. See Contributing to get started.
Getting Started
- Installation - Install SREGym and set up your development environment
- Cluster Setup - Set up your Kubernetes cluster (Emulated cluster or full cluster)
- Quick Start - Run your first agent
- Troubleshooting - Troubleshooting guide for common problems
Using SREGym
- MCP Tools - Complete reference for MCP tools available in SREGym
- Running Your Own Agent - Guide to registering and running custom agents
Development
- Development Guide - Testing tools and development workflow
- Contributing - How to contribute to SREGym
SREGym is built to be extensible, we always welcome new contributions. See Contributing to get started.
