Offline Evals

What are Offline Evals

Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship. Steps to do this on Statsig -

Create a Prompt. This contains the prompt for your task (e.g. Classify tickets as high, medium or low urgency based on ticket text)
Upload a sample dataset - with example inputs and ideal answers (e.g. Ticket1 text, High; Ticket2 text, Low)
Run your AI on that dataset to produce output. (e.g. classify each ticket in this example)
Grade or score the outputs. You can do this by comparing ideal answer in the dataset with the output your AI generated.
Create multiple versions of your prompts. Compare scores across versions and promote the best one to be Live.

Create/analyze an offline eval in 10 minutes

1. Create a Prompt within Statsig This captures the instruction you provide to an LLM to accomplish your task. You can now use the Statsig Node or Python Server Core SDKs to retrieve this prompt within your app and use it. You can create multiple versions of the prompt as you iterate, and choose which one is “live” (retrieved by the SDK).

Statsig prompt editor listing live and candidate versions with messages

2. Create a dataset you can use to evaluate LLM completions for your prompt For the example above, this might be a list of words, along side known good translations in French. Small lists can be entered (or upload a CSV).

Dataset creation table with translation pairs for offline evaluation

3. Create a grader that will grade LLM completions for your prompt Configure a grader that compares the LLM completion text with the reference output. You can use one of the out of box string evaluators, or even configure an LLM-as-a-Judge evaluator that mimics a human’s grading rubric.

Grader configuration form comparing model output against reference answers

3. Run evaluation Run an evaluation on a version of the prompt. You should see results in a few minutes that look like this. You can click into any row of the dataset to understand more about the evaluation for that row.

Offline evaluation results table showing prompt version scores

You can categorize your dataset, and break scores out by category.

Category breakdown chart splitting evaluation scores by dataset segments

If you have scores for multiple versions, you can compare them to see what changed between versions.

Comparison view charting multiple prompt versions across graders

Get Started

Experiments

Feature Management

Analytics

AI Evals

Other Features

Tutorials

What are Offline Evals

Create/analyze an offline eval in 10 minutes

Get Started

Experiments

Feature Management

Analytics

AI Evals

Other Features

Tutorials

​What are Offline Evals

​Create/analyze an offline eval in 10 minutes

What are Offline Evals

Create/analyze an offline eval in 10 minutes