Running A/B Tests
Run experiments on Azure OpenAI prompts, models, and parameters with Statsig, including variant configuration, exposure logging, and result analysis.
The Azure AI SDK lets you run A/B tests to measure the effectiveness of different models and parameters. Using Statsig's stats engine, you gain real-time insights into model performance across metrics such as cost, accuracy, and latency. You can experiment with configurations including model type, prompt settings, and response parameters, then make data-driven decisions to improve your application.
Example: Test GPT4o vs. GPT4o-mini
Step 1: Create configs
Create two dynamic configs, one named gpt-4o and another named gpt-4o-mini. In the Value section add the endpoint, key and other default parameters like this:

These serve as the base deployment configs for the tests and let you modify parameters dynamically after launch.
Step 2: Create some metrics to track
This example uses a latency metric to show how to create metrics in Statsig.
Navigate to the Metrics Catalog page at https://console.statsig.com/metrics/metrics_catalog and click Create.

Now, in the Metric Definition section, choose:
| Property | Value |
|---|---|
| Metric Type: | Aggregation |
| ID Type: | User ID |
| Aggregation Using: | Events |
| Aggregation Type: | Average |
| Rollup Mode: | Total Experiment |
| Event: | usage |
| Average Using: | Metadata => latency_ms |
This creates a metric that averages latency across all usage events coming from chat completions.

Step 3: Create an experiment
Create a new experiment in the Statsig console at https://console.statsig.com/experiments.

In the Setup page, add the metrics you created in Step #2 in the Primary Metrics field.

Step 4: Set up the variations
Create the control and test variants for the experiment. For this example, split them 50/50.
In the Groups and Parameters section, click Add Parameter and name the parameter model_name, with String type.

Add the two configs created in Step #1, one each to Control and Test parameters like this:

Step 5: Save and start the experiment
Select Save at the bottom of the page. A Start button appears at the top of the experiment page. Select it to begin the allocation process.
Step 6: Write code
The code below:
- Fetches the experiment configuration from the server for a given user. Pass the userID from your client application or from your database. The example generates a random one for testing.
- Gets the config name from the experiment variant (control or test).
- Creates a model client using the fetched config.
- Uses the model client to complete text.
async function testExperiments() {
await AzureAI.initialize(statsigServerKey);
const experiment = Statsig.getExperimentSync(
{ userID: Math.random().toString() }, // use a valid userID here
"model_experiment_gpt4o_vs_gpt4o-mini",
);
const configName = experiment.get("model_name", "gpt-4o");
console.log(`Using model: ${configName}`);
const modelClient = AzureAI.getModelClient(configName);
const result = await modelClient.complete([{
role: "user",
content: "Recite the first 10 digits of pi."
}]);
result.choices.forEach((choice, i) => {
console.log(choice.message.content);
});
await AzureAI.shutdown();
}
Step 7: Run the experiment and verify results
Run this experiment for several days to measure the latency profiles of gpt-4o compared with gpt-4o-mini in the Statsig console. Choose whichever model suits your needs.
This is a simple experiment to test models against each other. You can also adjust other parameters such as temperature, frequency_penalty, and max_tokens by modifying the config, without updating code.
Was this helpful?