On this page

Offline Evals

Run offline AI evaluations in Statsig to grade model outputs against fixed test sets and catch regressions before exposing changes to real users.

What are offline evals

Offline evals provide quick, automated grading of model outputs on a fixed test set. They catch wins and regressions before any real users are exposed. For example, compare a new support bot's replies to human-curated answers to decide if the bot is ready to ship.

Steps to run offline evals on Statsig:

  1. Create a Prompt that contains the instruction for your task (for example, "Classify tickets as high, medium, or low urgency based on ticket text").
  2. Upload a sample dataset with example inputs and ideal answers (for example, Ticket1 text to High; Ticket2 text to Low).
  3. Run your AI on the dataset to produce output (for example, classify each ticket).
  4. Grade or score the outputs by comparing the ideal answers in the dataset with the AI-generated output.
  5. Create multiple versions of your prompts, compare scores across versions, and promote the best one to Live.

Create/analyze an offline eval in 10 minutes

1. Create a Prompt within Statsig

This captures the instruction you provide to an LLM to accomplish your task. Use the Statsig Node or Python Server Core SDKs to retrieve this prompt within your app. You can create multiple versions of the prompt as you iterate and choose which one is "live" (retrieved by the SDK).

Statsig prompt editor listing live and candidate versions with messages

2. Create a dataset you can use to evaluate LLM completions for your prompt

For the example above, this might be a list of words alongside known good translations in French. Small lists can be entered manually, or you can upload a CSV.

Dataset creation table with translation pairs for offline evaluation

3. Create a grader that grades LLM completions for your prompt

Configure a grader that compares the LLM completion text with the reference output. Use one of the built-in string evaluators, or configure an LLM-as-a-judge evaluator that mimics a human's grading rubric.

Grader configuration form comparing model output against reference answers

3. Run evaluation

Run an evaluation on a version of the prompt. You should see results in a few minutes that look like this. You can click into any row of the dataset to understand more about the evaluation for that row.

Offline evaluation results table showing prompt version scores

You can categorize your dataset, and break scores out by category.

Category breakdown chart splitting evaluation scores by dataset segments

If you have scores for multiple versions, you can compare them to see what changed between versions.

Comparison view charting multiple prompt versions across graders

Was this helpful?