Getting Started
Create/analyze an offline eval in 10 minutes
(Coming soon: How to start an online eval in 15 minutes)
1. Create a Prompt within Statsig
This captures the instruction you provide to an LLM to accomplish your task. You can now use the Statsig Node or Python Server Core SDKs to retrieve this prompt within your app and use it.
2. Create a dataset you can use to evaluate LLM completions for your prompt
For the example above, this might be a list of words, along side known good translations in French. Small lists can be entered (or upload a CSV).
3. Create a grader that will grade LLM completions for your prompt
Configure a grader that compares the LLM completion text with the reference output. You can use one of the out of box string evaluators, or even configure an LLM-as-a-Judge evaluator that mimics a human's grading rubric.
3. Run evaluation
Run an evaluation on a version of the prompt. You should see results in a few minutes that look like this. You can click into any row of the dataset to understand more about the evaluation for that row.
You can categorize your dataset, and break scores out by category.
If you have scores for multiple versions, you can compare them to see what changed between versions.