Skip to main content

AI Configs

info

AI Configs are currently in beta

What is an AI Config?

An AI Config is a way to represent an LLM prompt or a task in Statsig. AI Configs are similar to Dynamic Configs, and allow you to evaluate and roll out prompts in production without deploying code.

With AI Configs, you can

  • Manage your prompt configuration outside of your application code. You can update model, configuration or prompt at runtime.
  • Team mates who have access to Statsig can collaborate and iterate on prompts, while benefitting from Statsig's production change control processes and versioning.
  • Add configuration for a new model, model provider and progressively shift production traffic to this while comparing costs, user satisfaction or any metric of interest.
  • Support advanced use cases such as
    • retrieval-augmented generation (RAG) and
    • evaluation in production.

What are Offline Evals

Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship.

Steps to do this on Statsig -

  1. Create an AI Config. This contains the prompt for your task (e.g. Classify tickets as high, medium or low urgency based on ticket text)
  2. Upload a sample dataset - with example inputs and ideal answers (e.g. Ticket1 text, High; Ticket2 text, Low)
  3. Run your AI on that dataset to produce output. (e.g. classify each ticket in this example)
  4. Grade or score the outputs. You can do this by comparing ideal answer in the dataset with the output your AI generated.
  5. Create multiple versions of your prompts. Compare scores across versions and promote the best one to be Live.

LLM as a Judge

Some grading can use heuristics (e.g. check if the AI generated output matches the ideal answer in the dataset when the output is as simple as High, Medium or Low). Some grading can't - you're trying to decide if "Your ticket has been escalated" and "This ticket has been escalated" mean the same thing. LLM-as-a-judge lets you quickly and cheaply evaluate AI outputs at scale without needing tons of human reviewers. It mimics how a human would assess quality — and while not perfect, it's fast, consistent, and good enough to compare different versions of your model or prompt. In this example, we could write an LLM-as-a-judge prompt "Score how close this answer is to the ideal one (0–5)".