Snowplow
Snowplow Overview
Snowplow is an open source, event analytics platform. There is a business entity that runs SaaS for Snowplow and they also maintain the code for the open source product. The general architecture overview of snowplow is on GitHub and has more detail on the basics of how it works and how it is set up.
In June of 2019
, we switched sending Snowplow events from a third party to sending them to infrastructure managed by GitLab, documented on this page. From the perspective of the data team, not much changed from the third party implementation. Events are sent through the collector and enricher and dumped to S3.
Snowplow - adding new app_id
When new application should be tracked by Snowplow
here is the few things should be considered.
The right app_id
, and collector URL should be done in coordination with the data team.
URL wil stay the same snowplow.trx.gitlab.net
. Any app_id
is fine if there are no other concerns around enabling tracking on CustomersPortal
staging as well.
Note: Any un-expected events (with wrong app_id) are normally dropped.
The only model with app_id filtering is snowplow_base_events. That model flows downstream to the page view models:
flowchart LR SPBE[snowplow_base_events] --> SPWE[snowplow_web_events] SPWE --> snowplow_page_views snowplow_page_views --> common_prep.prep_snowplow_page_views_all common_prep.prep_snowplow_page_views_all --> common.fct_behaviour_website_page_views snowplow_page_views --> legacy.snowplow_page_views_*
In order to add a new app_id
to snowplow_base_events (and downstream page view models), the snowplow:app_ids
variable in the dbt package must be updated. Those values are set in the dbt_project.yml
file. As an example, here is an issue to update the variable.
GitLab Implementation
The original design document to move our Snowplow infrastructure from a 3rd-party hosting service to 1st-part is documented in the Infrastructure design library. This was written before the build was started and contains many of the assumptions and design decisions.
Snowplow is built with Terraform on AWS documented in the gitlab-com-infrastructure project.
For a detailed walk-through of our setup, watch this GitLab Unfiltered internal video.
S3
Enriched events are stored in TSV format in the bucket s3://gitlab-com-snowplow-events/output/
.
Bad events are stored as JSON in s3://gitlab-com-snowplow-events/enriched-bad/
.
For both buckets, there are paths that follow a date format of /YYYY/MM/DD/HH/<data>
.
Data Warehousing
Snowpipe
Once events are available in S3, we ingest them into the data warehouse using Snowpipe. This is a feature of our Snowflake Data Warehouse. An Amazon SQS event queue was set up for the good and bad event paths.
To run properly, Snowpipe needs a “stage” in Snowflake and a table to write to.
The good and bad S3 paths each have their own Stage within Snowflake.
These are named gitlab_events
and gitlab_bad_events
, respectively. They are owned by the LOADER
role.
The create table statements for the good and bad events are as follows:
|
|
Since TSV is not as straight-forward as CSV, a custom file format was created with the following statement:
|
|
The actual pipe for good events was created using:
|
|
This highlights the fact that JSON would be a better format. This could be a future iteration of the infrastructure.
The bad event pipe is created as follows:
|
|
To view pipes:
|
|
To describe a pipe:
|
|
To pause a running pipe:
|
|
To force resume a pipe:
|
|
To check the status of a pipe
|
|
To force a refresh of the stage so that snowpipe picks up older events:
|
|
dbt
To materialize data from the RAW database to PROD for querying, we have implemented a partitioning strategy within dbt. By default, the snowplow models and the Fishtown snowplow package will write to a schema scoped to the current month in the PREP database. For July 2019, the schema would be snowplow_2019_07
.
Within each monthly partition all of the base models and the models generated by the package are written for all events that have a derived timestamp that matches the partition date. Different monthly partitions can be generated by passing in variables to dbt at run time:
--vars '{"year": "2019", "month": "01", "part": "2019_01"}'
Backfills are done via Airflow. The dbt_snowplow_backfill
DAG will generate a task for each month from July 2018 to the current month.
Do Not Track
Our snowplow tracking configuration and particular implementations respect the Do Not Track (DNT) headers whenever it’s present on a user’s browser.
Duo data redaction
We only keep Duo free form feedback for 60 days in snowflake. This is managed by the duo_data_redaction DAG, which runs daily, removing contents of the extendedFeedback
attribute in the contexts
column for all feedback response Snowplow events in RAW
and PREP
. This timeline allows for our full-refresh process to complete, updating all downstream data, within 90 days for compliance.
c435a2b3
)