GitLab Professional Services
Accelerate your software lifecycle with help from GitLab experts
Popular GitLab use cases
Enterprise Small Business Continuous Integration (CI/CD) Source Code Management (SCM) Out-of-the-box Pipelines (Auto DevOps) Security (DevSecOps) Agile Development Value Stream Management GitOpsGitLab Professional Services
Accelerate your software lifecycle with help from GitLab experts
Popular GitLab use cases
Enterprise Small Business Continuous Integration (CI/CD) Source Code Management (SCM) Out-of-the-box Pipelines (Auto DevOps) Security (DevSecOps) Agile Development Value Stream Management GitOpsAirflow Data Image Project GitLab Data Utils Project Python Guide
To edit the system diagram, go to the Lucidchart document and make the desired changes. If the changes are contained in the red box, the changes should be automatically reflected in GitLab. If the viewing area needs to be changed, then a new link will need to be generated using the Lucidchart documentation to publish, and an MR made to update the link used above to pull in the image.
We use Airflow for all Orchestration.
All DAGs are created using the KubernetesPodOperator
, so the airflow pod itself has minimal dependencies and doesn't need to be restarted unless a major infrastructure change takes place.
There are 4 containers running in the current Airflow deployment as defined in the deployment.yml:
We run in the gitlab-analysis
project in Google Cloud Platform (GCP). Airflow runs in the data-ops
cluster.
Within this cluster there are 4 nodepools: highmem-pool
, production-task-pool
, testing-pool
, and sdc-1
. Each nodepool has a dedicated use for ease of monitoring and resource management.
1) highmem-pool
- used to run the Airflow server, scheduler, and network components. Autoscales from 1-2 nodes.
2) production-task-pool
- used to run most production Airflow tasks except SCD tasks. Autoscales from 1-5 nodes.
3) sdc-1
- used to run production SCD extractions. Autoscales from 1-3 nodes.
4) testing-pool
- a pool that does not usually have a running node, but is used to run engineer's locally-launced Airflow tasks. Autoscales from 0-1 nodes.
All nodepools except the highmem-pool
have labels and taints to manage which nodepool launches which Airflow task. For a task to be scheduled in a nodepool, a task must have nodeAffinity for the pool and it must have a toleration that matches the taint. See this MR where we added the affinity and toleration for the Slowly-Changing Dimensions task for our postgres pipeline jobs.
To enable the URL airflow.gitlabdata.com to point to our cluster, a static IP was provisioned in the us-west1 region using the command gcloud compute addresses create airflow-west --region=us-west1
. The IP generated was 35.233.169.210
. This is available by running gcloud compute addresses list
. Note that the static IP must be a regional and not global IP for the TLS steps to work.
This was mapped to the domain in Route 53 by the infrastructure team, documented in this issue.
A certificate was generated for the airflow.gitlabdata.com domain via this infrastructure issue.
This certificate was saved to a kubernetes secret by running the command kubectl create secret tls airflow-tls --cert=airflow.gitlabdata.com.chained.crt --key=airflow.gitlabdata.com.key
- these are the actual filenames for the chained certificate and key.
This created the secret airflow-tls
.
The certificate files (site, chain, chained (site+chain), and key) are also stored in the Data Team Secure vault in 1password.
We decided to use the NGINX Ingress Controller since it has excellent built-ins for redirecting and enforcing TLS.
NGINX was installed into the cluster with this command helm install airflownginx stable/nginx-ingress --values nginx_values.yaml
. The NGINX value file defines what the load balancer IP address is. The load balancer IP is set to the address generated in the previous section.
The values passed into the install command are expanded in the controller-deployment.yaml file.
If NGINX needs to be deleted and reinstalled that can be done via helm delete airflownginx
.
The ingress definition was also updated with these settings:
airflow-tls
kubernetes secretairflow-webserver
service and not the default backendAlthough not strictly necessary, it is cleaner to delete the ingress when applying changes. This can be done via the UI in GCP or via the command kubectl delete ingress airflow-ingress
.
Applying the new configuration is done via the command kubectl apply -f ingress.yaml
Airflow logs of all kinds can be found in the Logs Explorer of Google Cloud. This application can be a little bit overwhelming with all of the different logs it collects, so here are some tips:
log fields
view is open. This view is critical in filtering logs down to exactly what you're interested in, whether it be the webserver or a specific DAG run. To open the view, go to PAGE LAYOUT
in the top right and click on it and then make sure Log fields
is checked. The Log fields
view shows up just to the left of the actual logs.Log fields
view, you can filter to logs from specific processes. The container_name is a great filter to use. Most of the containers have to do with the airflow deployment itself. If looking for a specific DAG run, then the container_name is base
which is shared by all DAGs, so more filtering will need to be done. The pod name is unique for each DAG, so the pod name can be filtered on if exploring a specific DAG run's logs.It is recommended for Engineers to have the following permissions for GCP:
For developing with Airflow locally, Engineers will also need a service account. These credentials should point to a service account provisioned by your manager. The account should follow the same pattern as your email i.e. tmurphy@gitlab-analysis.iam.gserviceaccount.com
. Recommended permissions are:
* Cloud SQL Client
* Kubernetes Engine Developer
* Storage Object Creator
* Storage Object Viewer
The service account key should be downloaded as JSON and sent to the user for secure storage on their computer.
Create Service Account
There should never be more than one failed DAG run visible for any DAG at one time. For incremental jobs that rely on the execution_date
, such as the extract from the gitlab.com database, any failed DAGs need to have their task instances cleared so that they can be rerun once the fix has been applied.
For jobs that are not dependent on execution_date
the job should be rerun manually when the fix is applied and the failed DAGrun(s) should be deleted. If there is a failed DAGrun for a DAG it should mean that the current state of that DAG is broken
and needs to be fixed.
This will make it easier to glance at the list of DAGs in Airflow and immediately know what needs attention and what doesn't.
If incremental runs are missed for a given DAG or there is missing data in a table, there are two ways to do a backfill. If the table is small and a backfill would be relatively quick then dropping the table and doing a full sync is an option. However, for times when a DAG is stopped due to upstream errors, this may not be possible when there are a large number of tables.
In the latter case, it is better to run the backfill command in the airflow scheduler pod container. The command is:
airflow backfill gitlab_com_db_extract -s 2019-10-30T00:00:00 -e 2019-11-04T12:00:00 --delay_on_limit 30 --reset_dagruns
This will clear any DAGruns and task instances that already exist for the given time frame while also generating any new DAGruns that don't exist for the time frame. The Airflow documentation for the CLI details what the flags are.
Consider using the --task_regex
flag to filter the tasks triggered. This is useful in the case of gitlab.com incremental loads where "pgp-extract" can be used to skip the downstream dbt tasks.
If DAG runs already exist for the timeframe that is being backfilled, the tasks may run concurrently when running the command above. If the DAGs need to run serially:
Actions
–> Delete
.Browse
–> Task Instances
and use the search bar and sorting to select the task instances that need to be ran again. With those task instances selected, go to Actions
–> Clear
to clear the state of those task instances.--reset_dagruns
because there are no dagruns to reset. This should make the backfill run serially.To facilitate the easier use of Airflow locally while still testing properly running our DAGs in Kubernetes, we use docker-compose to spin up local Airflow instances that then have the ability to run their DAG in Kubernetes using the KubernetesPodOperator. See the Docker section to ensure you have the proper environment variables configured.
The flow from code change to testing in Airflow should look like this (this assumes there is already a DAG for that task):
make init-airflow
to spin up the postgres db container and init the Airflow tables, it will also create a generic Admin user. You will get an error if Docker is not running.make airflow
to spin up Airflow and attach a shell to one of the containerslocalhost:8080
to see your own local webserver. A generic Admin user is automatically created for you in MR airflow instances with the username and password set to admin
.airflow run snowflake_load snowflake-load 2019-01-01
(as configured in the docker-compose file, all kube pods will be created in the testing
namespace). Or if you want to run an entire DAG (for instance the dbt
DAG to test the branching logic), the command would be something like airflow backfill dbt -s 2019-01-01T00:00:00 -e 2019-01-01T00:00:00
.There is also a make help
command that describes what commands exist and what they do.
Some gotchas:
ERROR: Version in “./docker-compose.yml” is unsupported.
chmod +x your_python_file.py
. This will avoid permission denied errors.kube_secrets.py
. This is the source of truth for which secrets Airflow uses. The actual secret value isn't stored in this file, just the pointers.docker pull <image_name>
to force a fresh pull of the latest images.FileNotFoundError: [Errno 2] No such file or directory: '/Users/(user)/google-cloud-sdk/bin/gcloud': '/Users/(user)/google-cloud-sdk/bin/gcloud'
This is because the default install location for the Google Cloud SDK on a Mac is now the above,
but on linux, and in containers it is installed to /usr/lib/google-cloud-sdk/bin/gcloud
. This value is passed
to the container from the /.kube/config
file.
To correct this error all you need to do is edit your /.kube/config
and update the command path parameter to where it will be in the container: /usr/lib/google-cloud-sdk/bin/gcloud
That file gets updated everytime you install the SDK or run this command: gcloud container clusters get-credentials data-ops
.
See related issue for more info
Our current implementation uses the following project variables:
The following flavors are defined:
LOAD
flavor is used by the Extract & Load processTRANSFORM
flavor is used by the Transform processTEST
flavor for testing using SnowflakePERMISSION
flavor for the permission botSYSADMIN
flavor for housekeeping tasks (like setting up review instances). This flavor doesn't define SNOWFLAKE_SYSADMIN_DATABASE
and SNOWFLAKE_SYSADMIN_WAREHOUSE
.The following variables are set at the job level depending on the running environment and should not be set in the project settings.
/admin/metrics
endpoint on a regular interval.gitlabdata.orchestration_utils.push_to_xcom_file
has been created. This function takes a JSON object and writes it to the XCom file. This function should only be called once per task. To use a value in an XCom as a metric, the metric must be a first-class member of the JSON object that is written to the XCom file. For example, {"record_count": 5, "other_record_count": 6}
would work if you want to use record_count
and other_record_count
as metrics.all
can be used. Once the configuration is changed, for the metric to be visible, the airflow image will have to be rebuilt and redeployed as explained in the "Restart Deployment and Pods" section.rules/airflow.yml
file. A new alert can be added by adding to the rules
list. The expr
element is a PromQL expression that should return 1 when alerting, and returns 0 otherwise. The for
element defines how long the expression must evaluate to 1
for before the Alert is actually triggered. Labels should include a severity. The severity is currently defined GitLab system-wide, so should be low severity for Airflow rules unless the GitLab infrastructure member on call should pay attention to it. The team
label ultimately determines which slack channel receives the alerts. The team
label should be set to data-analytics
for all alerts for the data team because data-analytics
is the name of the team setup in the runbook service catalog.data-prom-alerts
Slack channel which should be investigated and addressed by the team member on triage.kubectl
as kbc
Sometimes things break and it's not clear what's happening. Here are some common things to check.
analytics
or data-image
projectskubectl get pods
and see if one labeled airflow-deployment
comes back. Try to exec
into the podkubectl get pods -A
and double check if there are any old or stale pods which may be causing a bottle neck in the namespace.<task> had an event of type Pending
and the task never starts
Connect it to the data team cluster by running -> gcloud container clusters get-credentials data-ops --zone us-west1-a --project gitlab-analysis
Run kubectl get pods
and make sure it returns successfully
ALL OF YOUR COMMANDS TOUCH PRODUCTION, THERE IS CURRENTLY NO TESTING ENVIRONMENT IN Kubernetes. The canonical way to test is to use the local docker-compose setup.
kubectl port-forward deployment/airflow-deployment 1234:8080
. You can now navigate to localhost:1234
in a browser and it will take you to the webserver for the instance you port-forwarded to. Note: We no longer needd to do this as we now have a stable URL to access.airflow_image/Dockerfile
, the line looks like ARG AIRFLOW_VERSION=<version_number>
Restart Deployment and Pods
section below for specific directions for this step.exec
into one of the containers in the pod and run airflow upgradedb
kubectl get all
. This will display any pods, deployments, replicasets, etc.kubectl get pods
command to see a list of all pods in your current namespace.kubectl get pv
and kubectl get pvc
respectively. The command to get persistent volumes will show all volumes regardless of namespace, as persistent volumes don't belong to namespaces. Persistent volume claims do however belong to certain namespaces and therefore will only display ones within the namespace of your current context.The resource manifests for kubernetes live in airflow-image/manifests/
. To create or update these resources in kubernetes first run kubectl delete deployment airflow-deployment
and then run kubectl apply -f <manifest-file.yaml>
. Because we are using a persistent volume that can only be claimed by one pod at a time we can't use the usual kubectl apply -f
for modifications. A fresh deployment must be set up each time.
kubectl delete deployment airflow-deployment
. This will wipe out any and all pods (including ones being run by airflow so be careful). Then from the data-image
repository root folder, run kubectl apply -f airflow_image/manifests/deployment.yaml
to send the manifest back up to Kubernetes and respawn the pods.To get into a shell that exists in a kube pod, use the command kubectl exec -ti <pod-name> -c <container-name> /bin/bash
. This will drop you into a shell within the pod and container that you chose. This can be useful if you want to run airflow commands directly within a shell instead of trying to do it through the webserver UI.
kubectl exec -ti airflow-deployment-56658758-ssswj -c scheduler /bin/bash
Is an example command to access that pod and the container named scheduler
. The container names are listed in airflow_image/manifests/deployment.yaml
. This information is also available if you do kubectl describe <pod>
thought it is harder to read.
docker-compose exec airflow_scheduler bash
docker-compose run airflow_scheduler bash
Things you might do once you're in a shell:
airflow run <dag> <task_name> <execution_date> -f -A
airflow run dbt dbt-full-refresh 05-02T15:52:00+00:00 -f -A
-f
flag forces it to rerun even if there was already a success or failure for that task_run, the -A
flag forces it to ignore dependencies (aka doesn’t care that it wasn’t branched to upstream)kubectl edit secret airflow -o yaml
, this will open the secret in a text editor and you can edit it from there. New secrets must be base64 encoded, the easiest way to do this is to use echo -n <secret> | base64 -
. There are some null
values in the secret file when you edit it, for the file to save successfully you must change the null
values to ""
, otherwise it won't save properly.kubectl edit secret airflow -o yaml --namespace testing
. This command follows the same guidelines as those described above for production. If you don't add new secrets to the testing environment, the DAGs that use them will not run when testing.kubectl exec -ti <pod_name> -c <webserver|scheduler> /bin/bash
airflow run <dag_id> <task_id> <execution_date>
will be sufficient.dbt full-refresh
, a few more flags are required. airflow run dbt dbt-full-refresh <execution_date> -f -A
. The -f
flag forces the task to run even if it is already marked as a success of failure. The -A
flag tells it to run regardless of any dependencies it should have.from airflow.models import Variable
and then getting the variable value with Variable.get(variable_name)
.+
button. Type in the desired key and value. If the value should be encrypted, check the box. Then press save. To edit a variable, click on the edit icon next to the variable of interest and change what you would like. Then press save.BAMBOOHR_SKIP_TEST
. The value should be a comma-separated list of the names of the tables you would like to temporarily skip testing. Once the BambooHR extract has succeeded, please either remove the variable or change the value to an empty string.The persistent volume claim for Airflow is defined in the persistent_volume.yaml
manifest. It is used in the deployment for logs and writes to the /usr/local/airflow/logs
directory. If the persistent volume claim is full, there are two solutions:
logs
directorypvc-<GUID>
pdName
under gcePersistentDisk
. This is the GCE Disk that will need to be updatedstorage
which is under spec: capcacity:
Alternatively, you can update the persistent_volume.yaml
definition in the project. However, redeploying this may delete the data already in the claim. This has not been tested yet.
We run Meltano in its own Kubernetes cluster with in the meltano
namespace. The primary project is https://gitlab.com/gitlab-data/meltano-gitlab which is a fork of https://github.com/Mashey/gitlab-meltano.
The kubernetes cluster is running GCP as meltano-gitlab.
The UI of Meltano is not exposed to internet. To view the logs we have to look in the kubernetes container logs. It can be found under "LOGS" tab or under the overview page by selecting the meltano-gitlab
cluster under workloads.
To update what extractors are being used, update the meltano.yml
file in the main project. Add a git tag after the change is merged and update the gitlab-app.yml kubernetes manifest to point to the new image.
Meltano uses Airflow internally and we use Cloud SQL as the metadata database. The meltano-gitlab
database.
#Connect to the Kubernetes cluster from local(Prerequisites is Google cloud SDK installed). In case the command doesn't work then connect to GCP and select Cluster under Kubernetes and select connect to cluster. It will reveal the latest command.
gcloud container clusters get-credentials meltano-gitlab --zone us-west1-a --project gitlab-analysis
Many Kubernetes commands are similar to what we use for Airflow, except the flag --namespace=meltano
or -n=meltano
is used. For example:
# Editing Secrets
$ kubectl edit secrets tap-secrets --namespace=meltano
$ kubectl edit secrets admin-sdk --namespace=meltano
$ kubectl edit secrets cloud-sql --namespace=meltano
# Exec into a container
$ kubectl exec -ti meltano-gitlab-85bf9f958b-bbffl -c gitlab /bin/bash --namespace=meltano
# Applying the updated manifest does NOT require the namespace
$ kubectl apply -f ./gitlab-app.yaml
# Delete the deployment of meltano namespace
kubectl delete deployment meltano-gitlab --namespace=meltano
To add new tables or fields to be pulled by postgres pipeline, the manifest file for the specific source database will need to be changed. These manifest files are in a folder here. To add a new table, add an item to the tables
list. The import_db
value should match the other items in that manifest. The import_query
is executed directly against the source target database. Where possible, make the import query incremental by adding a WHERE
clause such as:
WHERE updated_at BETWEEN '{EXECUTION_DATE}'::timestamp - interval '{HOURS} hours'
AND '{EXECUTION_DATE}'::timestamp
The export_schema
value should match the other items in the manifest. The export_table
value should match the name of the table being imported. The export_table_primary_key
should be set to the primary key of that specific table in the postgres database. If the import query is not incremental, then add advanced_metadata: true
which creates another column that differentiates when data was loaded.
After the manifest changes are made, create a merge request using the add_manifest_tables
template. Then follow the instructions there.
For further details on the technical implementation see the README here
When testing postgres pipeline (pgp) locally in Airflow, there are a few things to keep in mind:
The gitlab.com read replica database used for data pulls into Snowflake occasionally has problems with replication lag. This means the database is behind in applying changes to the replica from the primary database. This can kill queries from pgp extracts and cause significant delays. The amount of lag can be monitored by checking Thanos.
The Infrastructure team has an alert setup to post to the #alerts
slack channel when the replication lag is over 3 hours. The name of this alert is PostgreSQL_ReplicationLagTooLarge_ArchiveReplica
. Monitoring the #alerts
channel for this alert name can help proactively address high lag situations.
The data engineering team maintains a Dangerfile in the main GitLab project here with the purpose of alerting the @gitlab-data/engineers
group about any changes to the gitlab.com source schema definition file. Being notified about source schema changes is essential to avoiding errors in the extraction process from the GitLab.com database since extraction is based on running a series of select statements. The data engineer on triage for any given day is the DRI for investigating schema changes as well as creating issues for any needed action from the data team.
We use Docker compose to define and run our different images and applications. These are activated by users via the Makefile.
There are a few environment variables you will need to properly use the Makefile and Docker Compose.
GIT_BRANCH
- typically a feature branch like 2112-my-feature
KUBECONFIG
- Defines where Kubernetes credentials are stored on your local machine. KUBECONFIG="/Users/tmurphy/.kube/config"
GOOGLE_APPLICATION_CREDENTIALS
- CloudSQL credentials for connecting to GCP. Typically points to a JSON file - GOOGLE_APPLICATION_CREDENTIALS="/Users/tmurphy/Projects/GitLab/gcloud_service_creds.json"
The data_image
directory contains everything needed for building and pushing the data-image
. If a binary needs to be installed it should be done in the Dockerfile directly, python packages should be added to the requirements.txt
file and pinned to a confirmed working version.
The airflow_image
directory contains everything needed to build and push not only the airflow-image
but also the corresponding Kubernetes deployment manifests. The only manual work that needs to be done for a fresh deployment is setting up an airflow
secret. The required secrets can be found in airflow_image/manifests/secret.template.yaml
.
The default
airflow instance is the production instance, it uses the airflow
postgres db. The testing
instance uses the airflow_testing
db.
The default
instance logs are stored in gs://gitlab-airflow/prod
, the testing
instance logs are stored in gs://gitlab-airflow/testing
The dbt_image
directory contains everything needed for building and pushing the data-image
. If a binary needs to be installed it should be done in the Dockerfile directly, python packages should be added to the requirements.txt
file and pinned to a confirmed working version. As this image is used by Data Analysts there should not be much more than dbt in the image.
Production images are only built when a git tag is pushed on the master branch. Our current release flow is as follows:
# Either on the master branch or in a feature branch run:
git tag v<sem_ver>
# Push the tag
git push origin $(git describe --tags --abbrev=0)
The build pipelines will automatically run. Make a new MR anywhere these images are referenced to update the tag.
There are multiple make
commands and CI jobs designed to help keep the repo's python clean and maintainable. The following commands in the Makefile will help analyze the repo:
make lint
will run the black
python linter and update files (this is not just a check)make pylint
will run the pylint checker but will NOT check for code formatting, as we use black
for this. This will check for duplicated code, possible errors, warnings, etc. General things to increase code quality. It ignores the DAGs dir as those are not expected to follow general code standards.make radon
will test relevant python code for cyclomatic complexity and show functions or modules with a score of B
or lower.make xenon
will run a complexity check that returns a non-zero exit code if the threshold isn't met. It ignores the shared_modules
and transform
repos until they get deleted/deprecated or updated at a later date.Some of the GitLab specific ELTs connect to databases which are in peered GCP projects, such as the usage ping. To allow connections, a few actions have been taken:
We execute our CI jobs in the gitlab-data
group with Kubernetes via the gitlab-analysis
GCP project. We have a group runner setup to share across all repos.
In the case where a new group runner token needs to be associated, or if we need to update the runner image. These are the basic steps. Note - since the release of helm 3, it is recommended that all of these commands be run in the Cloud Shell console in GCP. Navigate to the deployment for the runner (currently gitlab-data-gitlab-runner
) and use the kubectl dropdown to enter the shell.
To get the credentials
gcloud container clusters get-credentials bizops-runner --zone us-west1-a --project gitlab-analysis
To see the helm releases
helm list --namespace <namespace>
To get the chart values for a specific release
helm get values --namespace <namespace> <chartname>
Prep commands
helm repo add <chart> <url>
helm repo update
To delete the runner
helm delete --namespace <namespace> <chartname>
To install the runner with the Helm chart
helm install --namespace <namespace> --name <chartname> -f <valuesfile> <chartrepo/name>
Example for updating the runner version or group token
helm list --namespace gitlab-data
helm get values --namespace gitlab-data gitlab-runner
helm get values --namespace gitlab-data gitlab-runner > values.yaml
helm repo add gitlab https://charts.gitlab.io
helm repo update
helm delete --namespace gitlab-data gitlab-runner
helm install --namespace gitlab-data --name gitlab-runner -f values.yaml gitlab/gitlab-runner
Our YAML configuration is as follows:
affinity: {}
checkInterval: 30
concurrent: 10
gitlabUrl: https://gitlab.com
hostAliases: []
imagePullPolicy: IfNotPresent
metrics:
enabled: true
nodeSelector: {}
podAnnotations: {}
podLabels: {}
rbac:
clusterWideAccess: false
create: false
podSecurityPolicy:
enabled: false
resourceNames:
- gitlab-runner
serviceAccountName: gitlab-data-gitlab-runner
resources: {}
runnerRegistrationToken: <token found in https://gitlab.com/groups/gitlab-data/-/settings/ci_cd>
runners:
builds: {}
cache: {}
helpers: {}
image: ubuntu:16.04
outputLimit: 4096
pollTimeout: 180
privileged: false
services: {}
tags: analytics,housekeeping
securityContext:
fsGroup: 65533
runAsUser: 100
There's a once-daily CI job that executes in the version project as well as the license project that runs the database export version script or license script and exports CSV files to a GCS bucket. These files are named gitlab-version-{table_name}-{monday_of_week}
or gitlab-license-{table_name}-{monday_of_week}
respectively.
The GCS bucket where the version DB CSV files are being exported is setup in Snowflake as the stage raw.version_db.version_dump
. The GCS bucket where the license DB files are exported is setup as the stage raw.license_db.license_dump
. This means from Snowflake, we can list all of the files, copy the data in the files into tables, and even delete files.
The CSV files are not self-describing. They do not have a column header to know what column is in which position. Because of this, the tables in RAW need to have columns in order that exactly match what order the CSVs are in. In order to easily create the tables, this bash script was created. In order to generate the create statements:
do_create_tables.sh
and it should print out all of the create table statements.The CSV files are being loaded daily from the Snowflake stage with snowflake tasks. The tasks were generated running SQL such as
create or replace task users_load_task
WAREHOUSE = LOADING
SCHEDULE = '1440 minute'
AS
COPY INTO users
from @raw.version_db.version_dump/gitlab-version-users-
file_format = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY='"')
with the LOADER role. These tasks run daily and load only new or updated files. To see the task definitions, you can run show tasks
as the LOADER role with the version_db
or license_db
schema as part of the context.
The tasks are being monitored with a dbt test that makes sure each task has ran successfully in the last day. The version test and license test can be found in the internal data-tests project here and here respectively. In order to diagnose problems with the tasks, you can query the raw.snowflake.task_history_view
and inspect the error_message
column on failed tasks.
If a full refresh is required due to a failing test leading to an SLO breach, take time to investigate the root cause of the failing test. Make an incident and document any findings before triggering the full refresh. It may also be useful to copy the data to another temporary table to allow the full refresh to continue unimpeded.
For data sources extracted by Stitch (see extraction table on the platform page), the recommended way to do a full refresh is as follows:
create schema clone_<DATA_SCHEMA> clone <DATA_SCHEMA>;
SELECT
'truncate table if exists clone_<DATA_SCHEMA>.' || LOWER(table_name) || ';'
FROM "RAW"."INFORMATION_SCHEMA".tables
WHERE LOWER(table_schema) = '<DATA_SCHEMA>';
Clone <Name of Old Integration>
. For example, for Salesforce, the name of the integration would be Clone SalesForce Stitch
. Authenticate the integration and set up the extraction to match the extraction settings of the current Stitch job.clone_<DATA_SCHEMA>
schema. Perform some sanity check SQL queries to make sure the data looks as expected. Some Stitch data may arrive after the extraction job is complete, make sure to leave some time to ensure all data from Stitch has arrived.<DATA_SCHEMA>
appropriately):SELECT
'ALTER TABLE <DATA_SCHEMA>.' || LOWER(table_name) || ' SWAP WITH clone_<DATA_SCHEMA>.' || LOWER(table_name) || ';'
FROM "RAW"."INFORMATION_SCHEMA".tables
WHERE LOWER(table_schema) = '<DATA_SCHEMA>';
drop schema clone_<DATA_SCHEMA> cascade;
.Use dbt_full_refresh
DAG to force dbt to rebuild the entire incremental model from scratch.
DBT_MODEL_TO_FULL_REFRESH
with name of model(s) to refresh following dbt model selection syntax. For example, to refresh version models, the value would be sources.version staging.version
. To refresh gitlab_dotcom models, the value would be sources.gitlab_dotcom staging.gitlab_dotcom
.
dbt command that is run behind is
dbt run --profiles-dir profile --target prod --models DBT_MODEL_TO_FULL_REFRESH --full-refresh
This is a project for centralizing handy functions we use across the team. Project is https://gitlab.com/gitlab-data/gitlab-data-utils
1
in 0.1.0
. Currently, updating only the minor version is supported.gitlab-data-utils
, run make release
, this will make sure you don't have any pending changes and will push a new tag to the GitLab repo. This in turn will run the pipeline to publish to pypi. curl --header 'Content-Type: application/json' --header "PRIVATE-TOKEN: <your-private-token>" \
--data '{ "name": "GitLab Data Utilities v0.0.1", "tag_name": "v0.0.1", "description": "Initial tagged release"}' \
--request POST https://gitlab.com/api/v4/projects/12846518/releases
dbt-image
version following the instructions in the Creating New Images sectionpackages.yml
* index.html
by doing the following:make dbt-docs
<body>
to the bottom of the file</head>
to the top of the filedata-tests
project to update the image version and require-dbt-version
. See this MR for an example