Introduction

A unified platform to enable the design, development, and evaluation of AI agents for Site Reliability Engineering (SRE).

SRE Problems

Problems in SREGym consist of three components: an application, a fault, and an oracle. When evaluating a problem, SREGym first deploys the application specified in the problem. After deployment, the fault is injected into the system to cause the incident. Then, SREGym begins evaluating the agent and uses the oracle as the ground truth for the problem's solution.

See our registry for a complete list of problems.

SREGym is built to be extensible, we always welcome new contributions. See Contributing to get started.

Getting Started

Installation - Install SREGym and set up your development environment
Cluster Setup - Set up your Kubernetes cluster (Emulated cluster or full cluster)
Quick Start - Run your first agent
Troubleshooting - Troubleshooting guide for common problems

Using SREGym

MCP Tools - Complete reference for MCP tools available in SREGym
Running Your Own Agent - Guide to registering and running custom agents
LLM Backend - High-level overview of how SREGym uses and configures language model backends

Development

Development Guide - Testing tools and development workflow
Contributing - How to contribute to SREGym