Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Data Infrastructure

Airflow Data Image Project GitLab Data Utils Project Python Style Guide

System Diagram

graph LR subgraph "Snowflake Data Warehouse " subgraph "Analytics DB " Analytics_Sensitive[Analytics Sensitive Schema ] Analytics_Staging[Analytics Staging Schema ] Analytics[Analytics Schema ] end SnowflakeRaw[Raw DB ] -- dbt --> Analytics_Sensitive SnowflakeRaw[Raw DB ] -- dbt --> Analytics_Staging SnowflakeRaw[Raw DB ] -- dbt --> Analytics SnowflakeRaw -- dbt snapshots --> SnowflakeRaw SnowflakeRaw -- Roles Snapshot --> SnowflakeRaw SnowflakeRaw -- Snowplow Event Sample --> SnowflakeRaw end subgraph "Periscope " Analytics-- Queries --> Periscope_data(Periscope Data ) Analytics_Staging -- Queries --> Periscope_data(Periscope Data ) end subgraph "Stitch " StitchData(Stitch ) --> SnowflakeRaw end subgraph "Fivetran" FivetranSource(Fivetran ) --> SnowflakeRaw end subgraph "Airflow " subgraph "Other DAGs " AirflowDAGs(Other DAGS) --> SnowflakeRaw end subgraph "Postgres Pipeline " PostgresPipeline(Postgres Pipeline) --> SnowflakeRaw end subgraph "Sheetload " Sheet_Load(Sheetload ) --> SnowflakeRaw end end subgraph "Internal Data Sources " CloudSQL[Airflow CloudSQL ] --> StitchData CI_Stats[CI Stats CloudSQL ] --> PostgresPipeline GitLab_Profiler[GitLab Profiler CloudSQL ] --> PostgresPipeline YAML(YAML Extracts ) --> AirflowDAGs Engineering(Engineering Commit Extracts ) --> AirflowDAGs subgraph "Internal Azure VM " VersionDB[Version DB ] --> PostgresPipeline CustomersDB[Customers DB ] --> PostgresPipeline LicenseDB[License DB ] --> PostgresPipeline end end subgraph "AWS S3 " Greenhouse_S3(Greenhouse Bucket ) --> Sheet_Load Snowplow(Snowplow Bucket ) -- Snowpipe --> SnowflakeRaw end subgraph "Third-Party Data Sources" Bamboo(BambooHR ) --> AirflowDAGs Greenhouse(Greenhouse ) --> Greenhouse_S3 Zuora(Zuora ) --> StitchData Zendesk(Zendesk ) --> StitchData SalesForce(SalesForce ) --> StitchData Netsuite(Netsuite ) --> FivetranSource end subgraph "GitLab Dotcom " GitLab_dotcom[Postgres ] --> PostgresPipeline GitLab_snowplow(Snowplow Events ) -- Snowplow Infra --> Snowplow end subgraph "Google Sheets " Google_Sheets(Google Sheets ) --> Sheet_Load end

Airflow

We use Airflow for all Orchesetration.

In Production

All DAGs are created using the KubernetesPodOperator, so the airflow pod itself has minimal dependencies and doesn't need to be restarted unless a major infrastructure change takes place.

There are 4 containers running in the current Airflow deployment as defined in the deployment.yml:

  1. A sidecar container checks the repo activity feed for any merges to master. If there was one, the sidecar will reclone the repo so that Airflow runs the freshest DAGs.
  2. The Airflow scheduler
  3. The Airflow webserver
  4. A cloudsql proxy that allows Airflow to connect to our cloudsql instance

Kubernetes Setup

We run in the gitlab-analysis project in GCP. Airflow runs in the data-ops cluster. Within this cluster there are 2 nodepools: highmem-pool and scd-1. Most every job will run in the highmem-pool nodepool.

The scd-1 nodepool is labeled pgp=scd and it also has a taint on it of scd=true. For a job to be scheduled in this pool a task must have nodeAffinity for the pool and it must have a toleration that matches the taint. See this MR where we added the affinity and toleration for the Slowly-Changing Dimensions task for our postgres pipeline jobs.

Handling Failed Jobs

There should never be more than one failed DAG run visible for any DAG at one time. For incremental jobs that rely on the execution_date, such as the extract from the gitlab.com database, any failed DAGs need to have their task instances cleared so that they can be rerun once the fix has been applied.

For jobs that are not dependent on execution_date the job should be rerun manually when the fix is applied and the failed DAGrun(s) should be deleted. If there is a failed DAGrun for a DAG it should mean that the current state of that DAG is broken and needs to be fixed.

This will make it easier to glance at the list of DAGs in Airflow and immediately know what needs attention and what doesn't.

Backfills

If incremental runs are missed for a given DAG or there is missing data in a table, there are two ways to do a backfill. If the table is small and a backfill would be relatively quick then dropping the table and doing a full sync is an option. However, for times when a DAG is stopped due to upstream errors, this may not be possible when there are a large number of tables.

In the latter case, it is better to run the backfill command in the airflow scheduler pod container. The command is:

airflow backfill gitlab_com_db_extract -s 2019-10-30 -e 2019-11-04 --delay_on_limit 30 --reset_dagruns

This will clear any DAGruns and task instances that already exist for the given time frame while also generating any new DAGruns that don't exist for the time frame. The Airflow documentation for the CLI details what the flags are.

In Merge Requests

To facilitate the easier use of Airflow locally while still testing properly running our DAGs in Kubernetes, we use docker-compose to spin up local Airflow instances that then have the ability to run their DAG in Kubernetes using the KubernetesPodOperator.

The flow from code change to testing in Airflow should look like this (this assumes there is already a DAG for that task):

  1. Commit and push your code to the remote branch.
  2. Run make init-airflow to spin up the postgres db container and init the Airflow tables, it will also create a generic Admin user. You will get an error if Docker is not running.
  3. Run make airflow to spin up Airflow and attach a shell to one of the containers
  4. Open a web browser and navigate to localhost:8080 to see your own local webserver. A generic Admin user is automatically created for you in MR airflow instances with the username and password set to admin.
  5. In the airflow shell, run a command to trigger the DAG/Task you want to test, for example airflow run snowflake_load snowflake-load 2019-01-01 (as configured in the docker-compose file, all kube pods will be created in the testing namespace). Or if you want to run an entire DAG (for instance the dbt DAG to test the branching logic), the command would be something like airflow backfill dbt -s 2019-01-01T00:00:00 -e 2019-01-01T00:00:00.
  6. Once the job is finished, you can navigate to the DAG/Task instance to review the logs.

There is also a make help command that describes what commands exist and what they do.

Some gotchas:

Video Walk Throughs

Project variables

Our current implementation uses the following project variables:

The following flavors are defined:

The following variables are set at the job level dependending on the running environment and should not set in the project settings.

Common Airflow and Kubernetes Tasks

Tips

Connecting to the the Kubernetes Airflow Cluster:

  1. Install Kubectl

  2. Connect it to the data team cluster by running -> gcloud container clusters get-credentials data-ops --zone us-west1-a --project gitlab-analysis

  3. Run kubectl get pods and make sure it returns successfully

  4. ALL OF YOUR COMMANDS TOUCH PRODUCTION, THERE IS CURRENTLY NO TESTING ENVIRONMENT IN K8S. The canonical way to test is to use the local docker-compose setup.

Access Airflow Webserver UI

Updating Airflow

View Resources

View Persistent Volumes

Restart Deployment and Pods

Access Shell with Pod

Updating Secrets

Stopping a Running DAG

Running specific tasks within a DAG

Docker Images

Data Image

The data_image directory contains everything needed for building and pushing the data-image. If a binary needs to be installed it should be done in the Dockerfile directly, python packages should be added to the requirements.txt file and pinned to a confirmed working version.

Airflow Image

The airflow_image dir contains everything needed to build and push not only the airflow-image but also the corresponding k8s deployment manifests. The only manual work that needs to be done for a fresh deployment is setting up an airflow secret. The required secrets can be found in airflow_image/manifests/secret.template.yaml.

The default airflow instance is the production instance, it uses the airflow postgres db. The testing instance uses the airflow_testing db.

The default instance logs are stored in gs://gitlab-airflow/prod, the testing instance logs are stored in gs://gitlab-airflow/testing

dbt Image

The dbt_image directory contains everything needed for building and pushing the data-image. If a binary needs to be installed it should be done in the Dockerfile directly, python packages should be added to the requirements.txt file and pinned to a confirmed working version. As this image is used by Data Analysts there should not be much more than dbt in the image.

Python Housekeeping

There are multiple make commands and CI jobs designed to help keep the repo's python clean and maintainable. The following commands in the Makefile will help analyze the repo:

Accessing peered VPCs

Some of the GitLab specific ELTs connect to databases which are in peered GCP projects, such as the usage ping. To allow connections, a few actions have been taken:

  1. The Kubernetes cluster where the runner executes has been setup to use IP aliasing, so each pod gets a real routable IP within GCP.
  2. A VPC peering relationship has been established between the two projects and their networks.
  3. A firewall rule has been created in the upstream project to allow access from the runner Kubernetes cluster's pod subnet.

Extractor Documentation

Updating the Runner

We execute our CI jobs in the gitlab-data group with Kubernetes via the gitlab-analysis GCP project. We have a group runner setup to share across all repos.

In the case where a new group runner token needs to be associated, or if we need to update the runner image. These are the basic steps. Note - since the release of helm 3, it is recommended that all of these commands be run in the Cloud Shell console in GCP. Navigate to the deployment for the runner (currently gitlab-data-gitlab-runner) and use the kubectl dropdown to enter the shell.

To get things installed

brew install kubernetes-helm

gcloud components install kubectl

To get the credentials

gcloud container clusters get-credentials bizops-runner --zone us-west1-a --project gitlab-analysis

To see the helm releases

helm list

To get the chart values for a specific release

helm get values <release_name>

Prep commands

helm init --client-only

helm repo add gitlab https://charts.gitlab.io

helm repo update

To delete a release

helm del --purge <release_name>

To install a release

helm install --namespace <namespace> --name <release_name> -f values.yaml <chart_name>

Example for updating the runner version or group token

gcloud components update # might not have to do in cloud shell
helm list
helm get values gitlab-data
helm get values gitlab-data > values.yml
nano values.yml # Update values
helm repo list
helm repo add gitlab https://charts.gitlab.io
helm list
helm del --purge gitlab-data
helm install --namespace gitlab-data --name gitlab-data -f values.yaml gitlab/gitlab-runner