AI Evals Overview
Overview of Statsig AI Evals for evaluating prompts and models with offline and online graders, currently available in private beta for AI applications.
AI Evals are currently in beta. Statsig is no longer accepting new beta customers at this time.
What are AI evals?
Statsig AI Evals has three core components for iterating on and serving LLM apps in production.
- Prompts: Prompts represent an LLM prompt and its associated config (model provider, model, temperature, and similar settings). A prompt typically represents a task for the LLM (for example, "Classify this ticket to a triage queue" or "Summarize this text"). You can version prompts, choose which version is live in Statsig, and retrieve prompts in production using the Statsig server SDKs. Prompts can be used as the control plane for your LLM apps without using the rest of the Evals product suite.
- Offline Evals: Offline evals provide quick, automated grading of model outputs on a fixed test set. They catch wins and regressions before any real users are exposed. For example, compare a new support bot's replies to human-curated answers to decide if the bot is ready to ship. You can grade output even without a reference dataset (for example, when using an LLM to validate English-to-French translation).
- Online Evals: Online evals let you grade model output in production on real-world use cases. You can run the "live" version of a prompt and also shadow-run "candidate" versions without exposing users to them. Grading works directly on the model output and does not require a ground truth to compare against.
Gates, Experiments, and Analytics
The standard Statsig product capabilities are also available for use with AI Evals. For example, you can target an LLM feature at users who meet specific criteria using a Feature Gate, or roll out a new prompt version as an Experiment to understand its impact on metrics.
LLM as a judge
Some grading can use heuristics (for example, checking if the AI-generated output matches the ideal answer when the output is as simple as High, Medium, or Low). Other grading can't use heuristics: for example, deciding whether "Your ticket has been escalated" and "This ticket has been escalated" mean the same thing. LLM-as-a-judge lets you evaluate AI outputs at scale without requiring many human reviewers. It mimics how a human assesses quality: not perfect, but fast, consistent, and useful for comparing different versions of your model or prompt. In the example above, you could write an LLM-as-a-judge prompt such as "Score how close this answer is to the ideal one on a scale of 0 to 1.0".
Was this helpful?