Skip to main content

Prompt Management (AI Evals)

info

Prompts are currently in beta

What is a Prompt in Statsig?

A Prompt is a way to represent an LLM prompt or a task in Statsig, with it's config. Prompts are similar to Dynamic Configs, and allow you to evaluate and roll out prompts in production without deploying code.

With Prompts, you can

  • Manage your prompt configuration outside of your application code. You can update model, configuration or prompt at runtime.
  • Team mates who have access to Statsig can collaborate and iterate on prompts, while benefitting from Statsig's production change control processes and versioning.
  • Add configuration for a new model, model provider and progressively shift production traffic to this while comparing costs, user satisfaction or any metric of interest.
  • Support advanced use cases such as
    • retrieval-augmented generation (RAG) and
    • evaluation in production.

What are Offline Evals

Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship.

Steps to do this on Statsig -

  1. Create a Prompt. This contains the prompt for your task (e.g. Classify tickets as high, medium or low urgency based on ticket text)
  2. Upload a sample dataset - with example inputs and ideal answers (e.g. Ticket1 text, High; Ticket2 text, Low)
  3. Run your AI on that dataset to produce output. (e.g. classify each ticket in this example)
  4. Grade or score the outputs. You can do this by comparing ideal answer in the dataset with the output your AI generated.
  5. Create multiple versions of your prompts. Compare scores across versions and promote the best one to be Live.

What are Online Evals

Online evals let you grade your model output in production on real world use cases. You can run the "live" version of a prompt, but can also shadow run "candidate" versions of a prompt, without exposing users to them. Grading works directly on the model output, and has to work without a ground truth to compare against.

Steps to do this in Statsig -

  1. Create a Prompt. This contains the prompt for your task (e.g. Summarize ticket content. Don't include email addresses or credit cards in the summary). Create a v2 prompt that improves on this.
  2. In your app, use produce model output using the v1 and v2 prompts. The output from v1 is rendered to the user; the output from v1 and v2 are judged by an LLM-as-a-judge.
  3. The grades from v1 and v2 are logged back to Statsig and can be compared there.

LLM as a Judge

Some grading can use heuristics (e.g. check if the AI generated output matches the ideal answer in the dataset when the output is as simple as High, Medium or Low). Some grading can't - you're trying to decide if "Your ticket has been escalated" and "This ticket has been escalated" mean the same thing. LLM-as-a-judge lets you quickly and cheaply evaluate AI outputs at scale without needing tons of human reviewers. It mimics how a human would assess quality — and while not perfect, it's fast, consistent, and good enough to compare different versions of your model or prompt. In this example, we could write an LLM-as-a-judge prompt "Score how close this answer is to the ideal one (0–5)".