Snowplow is part of the infrastructure for GitLab.com's telemetry gathering. You can read more about this here.
GitLab.com needs it's own in-house implementation of Snowplow. Our current Snowplow service provider, while functional, does not provide the data security we require for user identifiable data we may collect.
Risks include making sure we can set up a functioning Snowplow implementation before our vendor contract expires. We also need to consider how flexible and scalable our implementation needs to be. This adds another environment that needs time and attention to make sure it is working reliably and performant.
The Snowplow pipeline consists (primarily) of the following stages:
Snowplow's own documentation lives in their primary repo.
Notes from discussion and considerations:
Future considerations that may be added later:
Since this is a new set of systems, we can test it before we begin to use it. Ideally we could attempt send live events to both the old and new Snowplow collectors and verify we see working collection and enrichment.
Both GitLab.com and self-managed installs could ship events to the collector for Snowplow.
Deploying and managing the infrastructure is automated using Terraform in a new environment in the current terraform repository.
EC2 and Auto-Scaling Groups will allow the scaling of nodes to meet capacity demands.
Metrics for basic operation are required for troubleshooting and alerting. A CloudWatch exporter will provide metrics for the Snowplow infrastructure to be alerted on and graphed. Since this could be leveraged in the future to reference other AWS infrastructure, this should probably reside in GCP and not be an integral part of the Snowplow environment. This exporter should be capable of referencing CloudWatch metrics via the AWS API so it can live in GCP near our prometheus systems.
An external monitor of the collector port would also be a good idea.