On this page

Athena Connection

Connect Amazon Athena to Statsig Warehouse Native, including IAM roles, S3 staging buckets, workgroups, query result locations, and required permissions.

Athena Warehouse Native overview

To set up a connection with Athena, Statsig needs the following:

  • An S3 Bucket
  • A Glue Database for staging
  • An S3 Query Result Location
  • Athena Access Permissions (using an AWS Role or an AWS User)

Set up Statsig staging structure

  1. Create or choose an S3 Bucket. Statsig uses a subfolder inside this S3 Bucket to store all staging data. Statsig has write-access to ONLY this scoped subfolder of this S3 Bucket (specifically labeled Statsig S3 Folder in your Data Connection settings in the Statsig Console).
  2. Create or choose a Glue Database (can be default). Statsig uses this as a staging Database to create and manage tables. Statsig can drop/create tables within ONLY this staging Database.
  3. Choose an S3 Query Result Location folder within the S3 Bucket. This S3 location acts as the Output Location for SELECT queries run in your Athena Warehouse. You can specify this location either:
    • Explicitly as an S3 location (ex: s3://my_bucket/my_query_results_folder/)
    • OR as part of a setting within an Athena Workgroup
      • NOTE that your workgroup must have the 'Query result location' field populated accordingly
  4. Add this information, along with your AWS Region, to your Data Connection settings in the Statsig Console.

Grant permissions to Statsig

Grant the following permissions for Statsig from your AWS console:

  • READ on any tables and data you are using for experimentation
  • USAGE and WRITE on a Statsig-specific schema used to materialize temp tables and results. This enables caching and incremental loads. You specify which Glue Database and S3 Bucket to use, and Statsig creates a Statsig S3 Subfolder for staging operations.
  1. Create an AWS IAM Policy to house the required access permissions

    This policy contains the permissions required for Statsig to access your warehouse. You can return to this policy later and edit it as needed.

    • In your AWS IAM Dashboard, select the Policies page under the Access Management tab
    • Click 'Create policy'
    • Switch the Policy Editor type from 'Visual' to 'JSON'
    • Copy and paste the below JSON template block
    • Replace the placeholders with your setup information and the Statsig S3 Folder (specified in your Statsig project's Data Connection settings)
    • Specify all S3 locations and Glue Databases of any read-only assignment/metric data
    • Remove the descriptor comments
    json
    {
       "Version": "2012-10-17",
       "Statement": [
          // Allow Statsig to recognize your staging S3 Bucket
          {
             "Effect": "Allow",
             "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
             ],
             "Resource": "arn:aws:s3:::__S3_BUCKET__"
          },
          // Allow Statsig to read events/exposures data from your S3 Buckets
          {
             "Effect": "Allow",
             "Action": "s3:GetObject",
             "Resource": "arn:aws:s3:::__PATH_TO_YOUR_READONLY_DATA__/*"
          },
          // Allow Statsig to read events/exposures tables from your Glue Databases
          {
             "Effect": "Allow",
             "Action": [
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions"
             ],
             "Resource": [
                "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:database/__YOUR_READONLY_DATABASE__",
                "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:table/__YOUR_READONLY_DATABASE__/*"
             ]
          },
          // Allow Statsig to read/write/use data and tables in an isolated staging S3 subfolder
          {
             "Effect": "Allow",
             "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "athena:StartQueryExecution",
                "athena:GetQueryResults",
                "athena:GetQueryExecution",
                "athena:StopQueryExecution",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:UpdatePartition",
                "glue:DeletePartition",
                "glue:BatchCreatePartition",
                "glue:BatchDeletePartition",
                "glue:GetDatabase",
                "glue:GetDatabases"
             ],
             "Resource": [
                "arn:aws:s3:::__S3_BUCKET__/__PATH_TO_S3_QUERY_RESULTS_FOLDER__/*",
                "arn:aws:s3:::__S3_BUCKET__/__STATSIG_S3_FOLDER__/*",
                "arn:aws:athena:__REGION__:__YOUR_AWS_ACCOUNT_ID__:workgroup/__WORKGROUP_NAME__",
                "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:catalog",
                "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:database/__GLUE_STAGING_DATABASE__",
                "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:table/__GLUE_STAGING_DATABASE__/*"
             ]
          }
       ]
    }
    
  2. Create an IAM Role or IAM User:

With an IAM Role, Statsig assumes your IAM Role through a Statsig Service Account. The Statsig Account ID for this service account is in your Data Connection settings in the Statsig Console. Statsig runs queries on behalf of this IAM Role.

Optionally, you can add an External ID condition for added security (AWS External ID Docs). Statsig generates this External ID, which you can view in your Data Connection settings in the Statsig Console.
  • In your AWS IAM Dashboard, select the Roles page under the Access Management tab
  • Click 'Create role'
  • Choose 'AWS account' as the Trusted entity type
  • Choose 'Another AWS account' from the options, and copy the Statsig Account ID from your Statsig console
  • Optionally, require use of an External ID for connections to this role (also specified in your Statsig console)
  • Continue to next step of setup, and select your IAM Permissions Policy from earlier
  • Name, review, and create; your Trust Policy JSON should follow the format below:
    json
    {
       "Version": "2012-10-17",
       "Statement": [
          {
             "Effect": "Allow",
             "Action": "sts:AssumeRole",
             "Principal": {
                "AWS": "__STATSIG_ACCOUNT_ID__"
             },
             "Condition": {
                "StringEquals": {
                   "sts:ExternalId": "__ROLE_EXTERNAL_ID__"
                }
             }
          }
       ]
    }
    
  • Add the ARN for this IAM Role to your Data Connection settings in the Statsig Console

Set up reading data from your events/exposures tables

  1. Give Statsig read-access to your Glue Database containing any tables you need Statsig to read from. Do this by adding the following to your AWS IAM Permissions Policy:
    plaintext
    {
       "Effect": "Allow",
       "Action": [
          "glue:GetTable",
          "glue:GetTables",
          "glue:GetDatabase",
          "glue:GetDatabases",
          "glue:GetPartition",
          "glue:GetPartitions"
       ],
       "Resource": [
          "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:database/__YOUR_READONLY_DATABASE__",
          "arn:aws:glue:__REGION__:__YOUR_AWS_ACCOUNT_ID__:table/__YOUR_READONLY_DATABASE__/*"
       ]
    }
    
  2. Give Statsig read-access to your S3 Bucket locations of the tables you need Statsig to read from. Add this to your AWS IAM Permissions Policy:
    plaintext
    {
       "Effect": "Allow",
       "Action": "s3:GetObject",
       "Resource": "arn:aws:s3:::__PATH_TO_YOUR_READONLY_DATA__/*"
    }
    
  3. In Statsig, when setting up Metric or Assignment Sources, select from these tables using "database"."table" format.
  4. Repeat for any additional tables, or when you need to read a new table from Statsig.

If your data warehouse is IP protected, add Statsig IP ranges to your allow list as part of your setup steps.

Additional Athena resources

S3 bucket encryption guide

Statsig supports encryption for all accessed S3 Buckets. To allow Statsig to access encrypted S3 Buckets:

  1. From the AWS Key Management Service console, create a new KMS Key using the below cryptographic configuration settings: (AWS SSE KMS Docs)
    • Key Type: Symmetric
    • Key Usage: Encrypt and Decrypt
    • Advanced Options
      • Key Material Origin = KMS
      • Regionality = Single-Region Key
  2. From the Key Policy tab of your newly created KMS Key, find the Key Administrators box. Click Add, and select the AWS IAM Role/User you provided to Statsig as an administrator.
  3. Navigate to your S3 Bucket. From your S3 Bucket Properties tab, find the Default Encryption box. Click Edit, and select the below default encryption settings:
    • Encryption Type: SSE-KMS
    • AWS KMS Key: Enter AWS KMS Key ARN
      • (enter your newly created KMS Key ARN in the box)
    • Bucket Key: Enable

Statsig staging architecture details

Statsig creates all staging tables within the Glue Staging Database that you create and provide. Statsig creates all tables as table-type ICEBERG, except the forwarded exposures and events tables, which are regular EXTERNAL Athena tables.

Statsig stores all staging data within the Statsig S3 Folder of your S3 Bucket. Statsig manages the structure within this subfolder. The folder structure is as follows:

  • S3 Bucket (you create and provide Statsig the name)
    • Statsig S3 Folder (Statsig creates and names, name provided in the Data Connection settings in your Statsig Console)
      • Experiment Folder(s) (Statsig creates a subfolder for each unique experiment you run, named experiment-<ID>)
        • Staging Tables Folders (Statsig puts data for each created staging table in its own subfolder, named intuitively after the data they contain)
          • metadata/
            • Iceberg Table Metadata
          • data/
            • Table Data Files (stored as Parquet)
      • statsig_forwarded_exposures/
        • Forwarded Exposures Table Subfolder (named in the Data Connection settings in your Statsig Console)
          • Dates in YYYY-MM-DD format (This table is partitioned by date, and there is a subfolder for each date that has exposure data sent through Statsig)
            • Forwarded Exposure Data
      • statsig_forwarded_events/
        • Forwarded Events Table Subfolder (named in the Data Connection settings in your Statsig Console)
          • Dates in YYYY-MM-DD format
            • Forwarded Event Data

What IP addresses does Statsig use to access data warehouses?

Go to the FAQ for the list of IP addresses.

Was this helpful?