Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Data Infrastructure

Airflow Data Image Project GitLab Data Utils Project Python Style Guide

System Diagram

graph LR subgraph "Snowflake Data Warehouse " subgraph "Analytics DB " Analytics_Sensitive[Analytics Sensitive Schema ] Analytics_Staging[Analytics Staging Schema ] Analytics[Analytics Schema ] end SnowflakeRaw[Raw DB ] -- dbt --> Analytics_Sensitive SnowflakeRaw[Raw DB ] -- dbt --> Analytics_Staging SnowflakeRaw[Raw DB ] -- dbt --> Analytics SnowflakeRaw -- dbt snapshots --> SnowflakeRaw SnowflakeRaw -- Roles Snapshot --> SnowflakeRaw SnowflakeRaw -- Snowplow Event Sample --> SnowflakeRaw end subgraph "Sisense " Analytics-- Queries --> Sisense_data(Sisense ) Analytics_Staging -- Queries --> Sisense_data(Sisense ) end subgraph "Stitch " StitchData(Stitch ) --> SnowflakeRaw end subgraph "Fivetran" FivetranSource(Fivetran ) --> SnowflakeRaw end subgraph "Airflow " subgraph "Other DAGs " AirflowDAGs(Other DAGS) --> SnowflakeRaw end subgraph "Postgres Pipeline " PostgresPipeline(Postgres Pipeline) --> SnowflakeRaw end subgraph "Sheetload " Sheet_Load(Sheetload ) --> SnowflakeRaw end end subgraph "Internal Data Sources " CloudSQL[Airflow CloudSQL ] --> StitchData CI_Stats[CI Stats CloudSQL ] --> PostgresPipeline GitLab_Profiler[GitLab Profiler CloudSQL ] --> PostgresPipeline YAML(YAML Extracts ) --> AirflowDAGs Engineering(Engineering Commit Extracts ) --> AirflowDAGs PoP_MRs(Part of Product Merge Request Extracts) --> AirflowDAGs Handbook_MRs(Handbook Merge Request Extracts) --> AirflowDAGs subgraph "Internal Azure VM " VersionDB[Version DB ] --> PostgresPipeline CustomersDB[Customers DB ] --> PostgresPipeline LicenseDB[License DB ] --> PostgresPipeline end end subgraph "AWS S3 " Greenhouse_S3(Greenhouse Bucket ) --> Sheet_Load Snowplow(Snowplow Bucket ) -- Snowpipe --> SnowflakeRaw end subgraph "Third-Party Data Sources" Bamboo(BambooHR ) --> AirflowDAGs Greenhouse(Greenhouse ) --> Greenhouse_S3 Zuora(Zuora ) --> StitchData Zendesk(Zendesk ) --> StitchData SalesForce(SalesForce ) --> StitchData Netsuite(Netsuite ) --> FivetranSource end subgraph "GitLab Dotcom " GitLab_dotcom[Postgres ] --> PostgresPipeline GitLab_snowplow(Snowplow Events ) -- Snowplow Infra --> Snowplow end subgraph "Google Sheets " Google_Sheets(Google Sheets ) --> Sheet_Load end

Airflow

We use Airflow for all Orchesetration.

In Production

All DAGs are created using the KubernetesPodOperator, so the airflow pod itself has minimal dependencies and doesn't need to be restarted unless a major infrastructure change takes place.

There are 4 containers running in the current Airflow deployment as defined in the deployment.yml:

  1. A sidecar container checks the repo activity feed for any merges to master. If there was one, the sidecar will reclone the repo so that Airflow runs the freshest DAGs.
  2. The Airflow scheduler
  3. The Airflow webserver
  4. A cloudsql proxy that allows Airflow to connect to our cloudsql instance

Kubernetes Setup

We run in the gitlab-analysis project in Google Coud Platform (GCP). Airflow runs in the data-ops cluster.

Nodepools

Within this cluster there are 2 nodepools: highmem-pool and scd-1. Most every job will run in the highmem-pool nodepool.

The scd-1 nodepool is labeled pgp=scd and it also has a taint on it of scd=true. For a job to be scheduled in this pool a task must have nodeAffinity for the pool and it must have a toleration that matches the taint. See this MR where we added the affinity and toleration for the Slowly-Changing Dimensions task for our postgres pipeline jobs.

DNS

To enable the URL airflow.gitlabdata.com to point to our cluster, a static IP was provisioned in the us-west1 region using the command gcloud compute addresses create airflow-west --region=us-west1. The IP generated was 35.233.169.210. This is available by running gcloud compute addresses list. Note that the static IP must be a regional and not global IP for the TLS steps to work.

This was mapped to the domain in Route 53 by the infrastructure team, documented in this issue.

TLS

A certificate was generated for the airflow.gitlabdata.com domain via this infrastructure issue. This certificate was saved to a kubernetes secret by running the command kubectl create secret tls airflow-tls --cert=airflow.gitlabdata.com.chained.crt --key=airflow.gitlabdata.com.key - these are the actual filenames for the chained certificate and key. This created the secret airflow-tls. The certificate files (site, chain, chained (site+chain), and key) are also stored in the Data Team Secure vault in 1password.

We decided to use the NGINX Ingress Controller since it has excellent built-ins for redirecting and enforcing TLS. NGINX was installed into the cluster with this command helm install airflownginx stable/nginx-ingress --values nginx_values.yaml. The NGINX value file defines what the load balancer IP address is. The load balancer IP is set to the address generated in the previous section. The values passed into the install command are expanded in the controller-deployment.yaml file.

If NGINX needs to be deleted and reinstalled that can be done via helm delete airflownginx.

The ingress definition was also updated with these settings:

Although not strictly necessary, I found it cleaner to delete the ingress when applying changes. This can be done via the UI in GCP or via the command kubectl delete ingress airflow-ingress. Applying the new configuration is done via the command kubectl apply -f ingress.yaml

GCP IAM

It is recommended for Engineers to have the following permissions for GCP:

Handling Failed Jobs

There should never be more than one failed DAG run visible for any DAG at one time. For incremental jobs that rely on the execution_date, such as the extract from the gitlab.com database, any failed DAGs need to have their task instances cleared so that they can be rerun once the fix has been applied.

For jobs that are not dependent on execution_date the job should be rerun manually when the fix is applied and the failed DAGrun(s) should be deleted. If there is a failed DAGrun for a DAG it should mean that the current state of that DAG is broken and needs to be fixed.

This will make it easier to glance at the list of DAGs in Airflow and immediately know what needs attention and what doesn't.

Backfills

If incremental runs are missed for a given DAG or there is missing data in a table, there are two ways to do a backfill. If the table is small and a backfill would be relatively quick then dropping the table and doing a full sync is an option. However, for times when a DAG is stopped due to upstream errors, this may not be possible when there are a large number of tables.

In the latter case, it is better to run the backfill command in the airflow scheduler pod container. The command is:

airflow backfill gitlab_com_db_extract -s 2019-10-30 -e 2019-11-04 --delay_on_limit 30 --reset_dagruns

This will clear any DAGruns and task instances that already exist for the given time frame while also generating any new DAGruns that don't exist for the time frame. The Airflow documentation for the CLI details what the flags are.

If DAG runs already exist for the timeframe that is being backfilled, the tasks may run concurrently when running the command above. If the DAGs need to run serially:

In Merge Requests

To facilitate the easier use of Airflow locally while still testing properly running our DAGs in Kubernetes, we use docker-compose to spin up local Airflow instances that then have the ability to run their DAG in Kubernetes using the KubernetesPodOperator.

The flow from code change to testing in Airflow should look like this (this assumes there is already a DAG for that task):

  1. Commit and push your code to the remote branch.
  2. Run make init-airflow to spin up the postgres db container and init the Airflow tables, it will also create a generic Admin user. You will get an error if Docker is not running.
  3. Run make airflow to spin up Airflow and attach a shell to one of the containers
  4. Open a web browser and navigate to localhost:8080 to see your own local webserver. A generic Admin user is automatically created for you in MR airflow instances with the username and password set to admin.
  5. In the airflow shell, run a command to trigger the DAG/Task you want to test, for example airflow run snowflake_load snowflake-load 2019-01-01 (as configured in the docker-compose file, all kube pods will be created in the testing namespace). Or if you want to run an entire DAG (for instance the dbt DAG to test the branching logic), the command would be something like airflow backfill dbt -s 2019-01-01T00:00:00 -e 2019-01-01T00:00:00.
  6. Once the job is finished, you can navigate to the DAG/Task instance to review the logs.

There is also a make help command that describes what commands exist and what they do.

Some gotchas:

Video Walk Throughs

Project variables

Our current implementation uses the following project variables:

The following flavors are defined:

The following variables are set at the job level dependending on the running environment and should not be set in the project settings.

Common Airflow and Kubernetes Tasks

Tips

Debugging Airflow Problems

Sometimes things break and it's not clear what's happening. Here are some common things to check.

Connecting to the the Kubernetes Airflow Cluster:

  1. Install Kubectl

  2. Connect it to the data team cluster by running -> gcloud container clusters get-credentials data-ops --zone us-west1-a --project gitlab-analysis

  3. Run kubectl get pods and make sure it returns successfully

  4. ALL OF YOUR COMMANDS TOUCH PRODUCTION, THERE IS CURRENTLY NO TESTING ENVIRONMENT IN K8S. The canonical way to test is to use the local docker-compose setup.

Access Airflow Webserver UI

Updating Airflow

View Resources

View Persistent Volumes

Restart Deployment and Pods

The resource manifests for kubernetes live in airflow-image/manifests/. To create or update these resources in kubernetes first run kubectl delete deployment airflow-deployment and then run kubectl apply -f <manifest-file.yaml>. Because we are using a persistent volume that can only be claimed by one pod at a time we can't use the the usual kubectl apply -f for modifications. A fresh deployment must be set up each time.

Access Shell with Pod

Updating Secrets

Stopping a Running DAG

Running specific tasks within a DAG

Setting/Changing an Airflow variable

Changing/Debugging Airflow Metrics

Managing Persistent Volume Claim and its Data

The persistent volume claim for Airflow is defined in the persistent_volume.yaml manifest. It is used in the deployment for logs and writes to the /usr/local/airflow/logs directory. If the persistent volume claim is full, there are two solutions:

Deleting Data
Increase Claim Size

Alternatively, you can update the persistent_volume.yaml definition in the project. However, redeploying this may delete the data already in the claim. This has not been tested yet.

Docker

Compose

We use Docker compose to define and run our different images and applications. These are activated by users via the Makefile.

Environment Variables

There are a few environment variables you will need to properly use the Makefile and Docker Compose.

Images

Data

The data_image directory contains everything needed for building and pushing the data-image. If a binary needs to be installed it should be done in the Dockerfile directly, python packages should be added to the requirements.txt file and pinned to a confirmed working version.

Airflow

The airflow_image directory contains everything needed to build and push not only the airflow-image but also the corresponding k8s deployment manifests. The only manual work that needs to be done for a fresh deployment is setting up an airflow secret. The required secrets can be found in airflow_image/manifests/secret.template.yaml.

The default airflow instance is the production instance, it uses the airflow postgres db. The testing instance uses the airflow_testing db.

The default instance logs are stored in gs://gitlab-airflow/prod, the testing instance logs are stored in gs://gitlab-airflow/testing

dbt

The dbt_image directory contains everything needed for building and pushing the data-image. If a binary needs to be installed it should be done in the Dockerfile directly, python packages should be added to the requirements.txt file and pinned to a confirmed working version. As this image is used by Data Analysts there should not be much more than dbt in the image.

Python Housekeeping

There are multiple make commands and CI jobs designed to help keep the repo's python clean and maintainable. The following commands in the Makefile will help analyze the repo:

Accessing peered VPCs

Some of the GitLab specific ELTs connect to databases which are in peered GCP projects, such as the usage ping. To allow connections, a few actions have been taken:

  1. The Kubernetes cluster where the runner executes has been setup to use IP aliasing, so each pod gets a real routable IP within GCP.
  2. A VPC peering relationship has been established between the two projects and their networks.
  3. A firewall rule has been created in the upstream project to allow access from the runner Kubernetes cluster's pod subnet.

Extractor Documentation

Updating the Runner

We execute our CI jobs in the gitlab-data group with Kubernetes via the gitlab-analysis GCP project. We have a group runner setup to share across all repos.

In the case where a new group runner token needs to be associated, or if we need to update the runner image. These are the basic steps. Note - since the release of helm 3, it is recommended that all of these commands be run in the Cloud Shell console in GCP. Navigate to the deployment for the runner (currently gitlab-data-gitlab-runner) and use the kubectl dropdown to enter the shell.

To get things installed

brew install kubernetes-helm

gcloud components install kubectl

To get the credentials

gcloud container clusters get-credentials bizops-runner --zone us-west1-a --project gitlab-analysis

To see the helm releases

helm list

To get the chart values for a specific release

helm get values <release_name>

Prep commands

helm init --client-only

helm repo add gitlab https://charts.gitlab.io

helm repo update

To delete a release

helm del --purge <release_name>

To install a release

helm install --namespace <namespace> --name <release_name> -f values.yaml <chart_name>

Example for updating the runner version or group token

gcloud components update # might not have to do in cloud shell
helm list
helm get values gitlab-data
helm get values gitlab-data > values.yml
nano values.yml # Update values
helm repo list
helm repo add gitlab https://charts.gitlab.io
helm list
helm del --purge gitlab-data
helm install --namespace gitlab-data --name gitlab-data -f values.yaml gitlab/gitlab-runner

Manually updating Version and License DBs

Version DB

  1. Create an issue in the GitLab Infrastructure Team project similar to this issue
  2. Include the link to the runbook
  3. Assign to Devin Sylva
  4. Devin or someone on the infra team will upload to a GCS bucket in the gitlab-internal GCP project.
    • Ensure you have access to this project. Make an access request if you don't.
  5. On the command line via gsutil, authenticate with both the gitlab-analysis and gitlab-internal GCP projects. It seems that gsutil authenticates using gcloud so if authentication with gcloud is working, then gsutil should work as well. To double check that authentication is working, attempt to run gsutil ls gs://tmurphy-temp and gsutil ls gs://version-db-dmp. If either fail, it may mean that more permissions are needed. Double check you can see the folders in the google cloud console: tmurphy-temp in the gitlab-internal project, and version-db-dmp in the gitlab-analysis project.
  6. Copy the export to the version-db-dmp bucket with the command gsutil cp gs://tmurphy-temp/data-2020-02-11.gz gs://version-db-dmp/data-2020-02-11.gz replacing tmurphy-temp with the bucket in gitlab-internal and adjusting the date as necessary.
  7. Navigate to the version-db-dump CloudSQL instance
  8. Click over to Databases and delete versiondb
  9. Once that is done, on the same screen click Create database and name it versiondb. Having the exact name is important for permissioning.
  10. Once that is done, navigate back to Overview and select Import at the top of the screen.
    • Select the version-db-dmp bucket and find the file you copied over.
    • Select the versiondb as the database to import it into.
  11. Click Import. This will take a while.
  12. Once this is done, confirm that the Version DB Dump stitch job is Active. If it is then it should automatically pick up the new data.
  13. Turn the job off once the data is up to date to save on rows.

License DB

  1. Create an issue in the GitLab Infrastructure Team project requesting an extract from the License Database. Issue 8580 is where the first extract was done.
  2. Due to CloudSQL Requirements the command for the extract should be pg_dump -Fp --no-owner --no-acl license_gitlab_com_production | sed -E 's/(DROP|CREATE|COMMENT ON) EXTENSION/-- \1 EXTENSION/g' > S{DUMPFILE}
  3. Tag Jose Finotto and Gerardo Herzig to request the extract be run.
  4. If the extract is done by someone from OnGres then most likely they will need to send the file over to you via Slack. It is < 200MB so it should transfer fine.
  5. Once you have the file, upload to the license-db-dump GCS bucket in the gitlab-analysis project.
  6. Navigate to the license-db-restore CloudSQL instance
  7. Click over to Databases and delete license
  8. Once that is done, on the same screen click Create database and name it license. Having the exact name is important for permissioning.
  9. Once that is done, navigate back to Overview and select Import at the top of the screen.
    • Select the license-db-dump bucket and find the file you copied over.
    • Select the license as the database to import it into.
  10. Click Import. This will take a while.
  11. Once this is done, confirm that the License DB Dump job is Active. If it is then it should automatically pick up the new data.
  12. Turn the job off once the data is up to date to save on rows.

Gitlab Data Utilities

This is a project for centralizing handy functions we use across the team. Project is https://gitlab.com/gitlab-data/gitlab-data-utils

Cutting a release

  1. Tag the commit with git tag v#.#.#
  2. Push the tag git push origin --tags
  3. Cut a release with a command similar to this:

     curl --header 'Content-Type: application/json' --header "PRIVATE-TOKEN: <your-private-token>" \
      --data '{ "name": "Gitlab Data Utilities v0.0.1", "tag_name": "v0.0.1", "description": "Initial tagged release"}' \
      --request POST https://gitlab.com/api/v4/projects/12846518/releases
    
  4. Update the docker image as was done in this merge request