Reliability FAQs

Frequently asked questions about Statsig reliability, including SDK failover behavior, multi- region architecture, SLAs, and customer-side mitigations.

Integrating your product with Statsig means depending on Statsig, and Statsig takes reliability seriously. Here are common questions about evaluating the risks. Reach out on Slack if you have questions not listed here.

What does Statsig do to stay highly available?

Statsig actively tracks internal Service Level Objectives (SLOs) for availability to maintain high uptime.
Measures Statsig takes to ensure service reliability:
- Statsig handles bursts through autoscalers and over-provisioned resources.
- Mechanisms exist to reduce unintended or malicious spikes and prevent DDoS attacks.
- Statsig deploys services in multiple regions. If a region goes down, Statsig routes traffic to other healthy regions.
- Statsig uses the GitOps approach (code review, validation, CI/CD) for all infrastructure changes to prevent human errors.
- A 24/7 engineering on-call rotation handles customer-facing alerts and issues.

Does Statsig use any caching to help with latency?

Statsig uses a combination of caching solutions, depending on the problem. For console and API requests, Statsig caches most data at the region or host level.

What else does Statsig do to make sure the service is resilient?

Statsig designs its SDKs to be resilient if API requests fail.
Client SDKs:
- The SDKs use the latest values from the Statsig server when the user can reach the Statsig server.
- If the server is unreachable, the SDKs use a cached value from a previous session, if available.
- If no cached value exists, the SDKs fall back to default values set in your code, so users receive the default experience.
- The SDKs automatically retry failed event requests if Statsig event servers are unreachable. Client SDKs also persist failed log requests to local storage and retry in subsequent sessions.
Server SDKs:
- Server SDKs store rules for gates and experiments in memory, so evaluation continues even if Statsig is down.
- You can bootstrap your server SDKs with rule values from a previous session if Statsig is down when your server starts. Use a Server Data Store to plug a storage provider into the Statsig SDK to store your rule values.
- The SDKs automatically retry failed event requests if Statsig event servers are unreachable.

What kind of automated testing does Statsig do?

Unit and integration tests run on every pull request
Continuous CI/CD running unit and integration test suites
Synthetic tests for Console and API use cases, mimicking customer requests
Stress tests to detect any performance issues
Continuous SDK tests run on every pull request and on schedule

What does Statsig do to protect runtime code?

Statsig uses GitHub and DockerHub for code and binary storage, and tracks the entire CI/CD process from source code to production deployment with traceable versioning and binary verification.

Was this helpful?