Offline Evals
What are Offline Evals
Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship.
Steps to do this on Statsig -
- Create a Prompt. This contains the prompt for your task (e.g. Classify tickets as high, medium or low urgency based on ticket text)
- Upload a sample dataset - with example inputs and ideal answers (e.g. Ticket1 text, High; Ticket2 text, Low)
- Run your AI on that dataset to produce output. (e.g. classify each ticket in this example)
- Grade or score the outputs. You can do this by comparing ideal answer in the dataset with the output your AI generated.
- Create multiple versions of your prompts. Compare scores across versions and promote the best one to be Live.