Redshift (Deprecated)

Import event and metric data into Statsig from Amazon Redshift on a schedule, including authentication, queries, and column-to-event mappings.

Deprecated

This feature is deprecated and has been superseded by a replacement. Avoid using it for new work.

This solution is still functional, but can be manual and time consuming to set up with minimal error handling. Check out the Data Warehouse Ingestion solution instead.

How Redshift integration works

There are 2 ways to integrate with Redshift: using a data connector, or ingesting events and metrics to Statsig through S3.

Using a data connector

To ingest events from Redshift, you can use the Census integration.To export events to Redshift, you can use the Fivetran integration.

Direct ingestion

S3 imports use a custom setup flow. Contact Statsig through Slack or through your support contact to set up this integration. The steps below describe how to set up this integration. There are 3 main steps:

Create a pipeline to write your metric, event, and (optionally) signal data to an S3 bucket in parquet format.
Create an IAM user with read and list access on that bucket and send that user's Key/Secret to Statsig. Statsig securely stores these in a keystore service.
Schedule ingestion through a signals dataset or through the mark_data_ready API.

Set up a data pipeline to S3

Filesystem format

Statsig expects data in your S3 bucket in parquet format.

To allow for daily uploads, set up your bucket with the following folders:

events/ for events data
metrics/ for metrics data
signals/ for signal flags after you finish uploading data for a day. You can omit this folder and use the mark_data_ready API instead, but you must use one or the other.

Statsig recommends writing folders by date partitions for easier debugging, for example storing daily data in folders with ISO-formatted names (YYYY-MM-DD).

Data format

Confirm your data conforms to the following schemas.

** Events **

plaintext

| Column         | Description                                                                                                       | Rules                                                                                                   |
| -------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| timestamp      | UNIX timestamp of the event                                                                                       | UTC timestamp                                                                                           |
| event_name     | The name of the event                                                                                             | String under 128 characters, using `_` for spaces                                                       |
| event_value    | A string representing the value of a current event. Can represent a 'dimension' or a 'value'                      | Read as string format; Statsig converts numeric values into value                                       |
| event_metadata | A dictionary<string, string> in the form of a JSON string, containing named metadata for the event                | String format                                                                                           |
| user           | A JSON object representing the user this event was logged for; see below                                          | Escaped JSON string including the keys 'custom' and 'customIDs'. A userID or customID must be provided. |
| timeuuid       | A unique UUID or timeUUID used for deduping. If omitted, Statsig generates one but it won't be effective for deduping | UUID format                                                                                         |

Go to Statsig User Object for available fields. An example user object:

plaintext

{
  userID: "12345",
  customIDs: {
    stableID: "<device_id_here>",
    ...
  }
  email: "12345@gmail.com",
  userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.40 Safari/537.36",
  ip: "192.168.1.101",
  country: "US",
  locale: "en_US",
  appVersion: "1.0.1",
  systemName: "Android",
  systemVersion: "15.4",
  browserName: "Chrome",
  browserVersion: "45.0",
  custom: {
    new_user: "false",
    age: "22"
    ...
  },
}

** Metrics **

Include all of metric_value, numerator, and denominator. Write cast(null as double) for numerator and denominator if you're omitting them (or for metric_value if sending numerator/denominator).

Column	Description	Rules
unit_id	The unique user identifier this metric is for. This might not necessarily be a user_id - it could be a custom_id of some kind	String format. Make sure this is in the same format as your logged unit_ids
id_type	The id_type the unit_id represents.	String format. Must be a valid id_type. The default Statsig types are user_id/stable_id, but you may have generated custom id_types. Make sure this matches (case sensitive) a customID in your project, or you won't get experiment results
date	Date of the daily metric	Read as string format; can be written as ISO date. Statsig's dates are calculated in PST - Statsig loads custom metrics to whatever date you use here
metric_name	The name of the metric	String format. Not null. Length < 128 characters
metric_value	A numeric value for the metric	Double format. Metric value, or both of numerator/denominator need to be provided for Statsig to process the metric. See details below
numerator	Numerator for metric calculation	Double format. Required for ratio metrics. If present along with a denominator in any record, Statsig treats the metric as a ratio and calculates it only for users with non-null denominators.
denominator	Denominator for metric calculation	Double format. See above

Set up and provide credentials

Navigate to your IAM console on AWS.
Go to Users -> Add User.
Select the Access key - Programmatic access credential type.
Attach a policy that grants Read and List access to the appropriate bucket. Scope this policy so the user only has access to the intended data. Example policy:

plaintext

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "<BUCKET_ARN>"
        }
    ]
}

Next, modify your bucket access policy (under permissions on your S3 bucket's page) to allow this user to access objects. Example policy:

plaintext

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "statement1",
			"Effect": "Allow",
			"Principal": {
				"AWS": "<IAM_ARN>"
			},
			"Action": "s3:ListBucket",
			"Resource": "<BUCKET_ARN>"
		},
		{
			"Sid": "statement2",
			"Effect": "Allow",
			"Principal": {
				"AWS": "<IAM_ARN>"
			},
			"Action": "s3:GetObject",
			"Resource": "<BUCKET_ARN>/*"
		}
	]
}

To confirm your credentials are sufficient, add any data to your metrics folder and run the following code in PySpark with the IAM user credentials:

plaintext

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", '<IAM_USER_ACCESS_KEY>')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", '<IAM_USER_SECRET>')
spark.read.parquet("s3://<BUCKET_NAME>/metrics/*",inferSchema=True).show()

Scheduling

You might stream events to your tables or run multiple ETLs that point to your metrics table. Because of this timing uncertainty, Statsig relies on you to signal when your metrics and events for a given day are done.

To signal completion, write a dataset with the single column finished_date, which contains all dates of data written to Statsig. For example, after writing data for 2022-06-22, insert a record with finished_date of 2022-06-22 to trigger ingestion of data up to and including 2022-06-22.

Unlike Snowflake, Statsig skips dates for S3. If your latest finished date is 2022-06-22 and you insert 2022-07-01, Statsig ingests all data as of 2022-07-01 and infers that intermediate dates (for example, 2022-06-25) have data loaded.

Alternatively, you can use the mark_data_ready API and send a timestamp indicating that all data before that timestamp has finished loading into S3.

Statsig processes events in PST. When you mark data ready for 2022-06-20, Statsig processes events from 2022-06-20T00:00 PST to 2022-06-20T23:59 PST. Account for this PST processing window when scheduling your signals.

Was this helpful?