The Data team has implemented the following triage schedule to take advantage of native timezones:
|UTC Day||Data Analyst||Data Engineer|
A team member who is off, on vacation, or working on a high priority project is responsible for finding coverage and communicating to the team who is taking over their coverage; this should be updated on the Data Team's Google Calendar. To avoid putting the Monday workload on the same shoulders every week again, the Data Engineers will will rotate/exchange every now and then triage days in good collaboration on an ad-hoc basis.
Having dedicated triagers on the team helps address the bystander affect. The schedule shares clear daily ownership information but is not an on-call position. Through clear ownership, we create room for everyone else on the team to spend most of the day around deep work. The triager is encouraged to plan their day for the kind of work that can be accomplished successfully with this additional demand on time.
Data triagers are the first responders to requests and problems for the Data team.
Many issues that come into the data team project from other GitLab team members need additional info and/or context in order to be understood, estimated, and prioritized. It is the triager's priority to ask those questions and to surface issues sooner, rather than later.
Note: The Data Analyst triager
Create an issue in the Data Team project. Task and duties are stated in the issue template.
Read the FAQ and common issues.
Parts of triage are assisted by the GitLab Triage Bot, which is setup in the Analytics project. The bot runs every hour and takes actions based on a set of rules defined in a policies file. The GitLab Triage README contains all documentation for the formatting and definition of rules.
Changes to the triage bot policy file should be tested in the MR by running the "dry-run:triage" CI job and inspecting the log output. This CI job is a dry-run, meaning it will not actually take any actions in the project but will print out what would happen if the policy was actually executed.
In order to get better and be more efficient in daily triage, we wrap-up the work by the end of the day. The following information is provided by the Data Analyst and Data Engineer each day:
A triage roundup will take place at the end of every milestone by the data leadership team to consolidate the milestones triage efforts. Please bear in mind the purpose of the information provided, to make it useful and improve Triage.
1 of the most important data source, that regularly changes, is the GitLab.com database. In order not to break the daily operation, changes to the database needs to be tracked and checked. Any change to the GitLab.com database, is made to the db/structure.sql file. The Data Team gets notified, by applying labels to the MR, if a change to the db/structure.sql is made, via the Danger Bot.
Data Warehouse::Impact Check is added by the Danger Bot as call to action for the data team.
Data Warehouse::Impact Check.
The following actions are perfored by Data Team Triager:
Data Warehouse::Not Impacted.
GitLab Data Team project, assigned to the correct DRI and linked to the original MR.
Determination matrix: **
|Change||Call to action needed*|
|New table created||:x:|
|Field name altered||:white_check_mark:|
|Field datatype altered||:question:|
*We are not loading all the tables and columns by default. Thus if new tables or columns are added, we only will load these tables if there is a specific business request. Any change to the current structure that causes a potential break of operation needs to be determined.
** Determination matrix is not extensive. Every MR should be checked carefully.
In a scenario when gitlab cloned Postgres database is not accessible, the airflow task log is showing below error.
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: the database system is starting up\n b'FATAL: the database system is starting up\n'
Follow the steps mentioned below.
gitlab_com_scd_db_sync. The reason behind is to keep the alerting down and not use unwanted resources.
Firing 1 - GitLab Job has failed The GitLab job "clone" resource "zlonk.datalytics.dailyx" has failed. :chart: View Prometheus graph:label: Labels: Alertname: JobFailed Alert_type: symptom Env: gprd Environment: gprd Fqdn: blackbox-01-inf-gprd.c.gitlab-production.internal Job: clone Monitor: default Provider: gcp Region: us-east Resource: zlonk.datalytics.dailyx Severity: s3 Shard: default Stage: main Tier: db Type: zlonk.postgres Show less
@sre-oncallslack handle to look into the issue also raise an incident request using incident declare. This will create a production incident issue for the SRE on-call team to act upon also
cc @gitlab-data/engineersfor broader visibility of the incident.
@sre-oncallperson or someone from the DBRE team, try re-running one of the failed tasks by clearing one alone to validate the stability of the connection.
gitlab_com_db_incremental_backfillclear failed task so that it get picked up for run as these task runs only once in 24 hour window.In case we have missed the whole schedule, we re-trigger the DAG itself.
In a situation when Service ping fail while it generates metrics, we should be informed either via
Trusted data dashboard or
Airflow log - generally, the error log is stored in
RAW.SAAS_USAGE_PING.INSTANCE_SQL_ERRORS table. Follow the instructions from the link error-handling-for-sql-based-service-ping in order to fix the issue.
It could happen, in any case, to reset the table in Stitch for the Zuora data pipeline, in order to backfill a table completely (i.e. new columns added to in the source, technical error etc). Currently, Zuora Stitch integration does not provide table level reset, and thus we have to perform a reset of all the tables in the integration. This will result in extra costs and risks.
To this below steps can be followed using which we have successfully done the table level reset.
In this example, we have used Zuora
subscription table, but this could be applied to any other table in the Stitch Zuora data pipeline.
ALTER TABLE "RAW"."ZUORA_STITCH"."SUBSCRIPTION" RENAME TO "RAW"."ZUORA_STITCH"."SUBSCRIPTION_20210903";
While setting it up setup the extraction frequency to 30 minutes and date from extraction to 1st Jan 2012 to ensure all data gets pulled through.
Try running the newly created integration manually and wait for it to complete. Once completed then and it shows on the home page successfully. Once done Pause the newly integration task because we don't want any misaligned data while we follow the next steps.
In the newly created table
"RAW"."ZUORASUBSCRIPTION"."SUBSCRIPTION" cross-check the number of rows showing as loaded in the integration UI in stitch and loaded in the table is same.
Move the newly loaded data to
ZUORA_STITCH schema because the new integration will create the table in the
ZUORASUBSCRIPTION as stated above in the image.
CREATE TABLE "RAW"."ZUORA_STITCH"."SUBSCRIPTION" CLONE "RAW"."ZUORASUBSCRIPTION"."SUBSCRIPTION"; **Note:** Check for the primary key present in the table post clone or not if not check for the primary key in the [link](https://www.stitchdata.com/docs/integrations/saas/zuora#subscription) and add the constraints on those columns.
select count(*) from "RAW"."ZUORA_STITCH"."SUBSCRIPTION_20210903" where deleted = 'FALSE'; select count(*) from "RAW"."ZUORA_STITCH"."SUBSCRIPTION" ;
DROP SCHEMA "RAW"."ZUORASUBSCRIPTION" CASCADE ;
This is to ensure that error observed previously to the table is gone and data is getting populated in the table. Check on duplicate ids due to 2 different extractors, to ensure the data is getting populated in the table correctly.
select id, count(*) from "RAW"."ZUORA_STITCH"."SUBSCRIPTION" group by id having count(*) > 1 **Note** Refer to the [MR](https://gitlab.com/gitlab-data/analytics/-/issues/10065#note_668365681) for more information.
Is Data Triage 24/7 support or shift where we need to support it for 24 hours?
We need to work in our normal working hour perform the list of task mentioned for the triage day in the Triage Template
If any issue is found do we directly jump to fix it in production or take it as part of the incident and solve it within the defined time?
On the Triage day the data team member present will look for all the failures, questions or errors in:
It includes all the failures since the last person did sign off and will create an issue for all the failures since then till the person signs off. If any data pipeline has broken and there is expected to be a delay in getting data loaded or refreshed. The concerned team has to be notified using the [Triage Template](https://gitlab.com/gitlab-data/analytics/-/issues/new?
Is there ETA for a different kind of issue?
If the pipeline is broken it needs to be fixed, currently we are working on defining SLO's for our data assets. For our data extraction pipelines, there is a comprehensive overview here.
If I work my normal hours on triage day i.e. till 11 AM of US timeline. What happens when the pipeline breaks post my normal hours and there is a delay in data availability?
Yes, the benefit of our presence is that we have a wide overage of hours. If the person who is on Triage is ahead of US timelines, we have an advantage of solving issues timely. The downside is that we have not full coverage that day for US timelines. This is an attention point towards the future.
In this section we state down common issues and resolutions
|Airflow Task failure!|
|Background: This extract relies on a copy (replication) database of the GitLab.com environment. Its high likely that this is the root cause of a high replication lag.|
|More information of the setup here.|
|Possible steps, resolution and actions: - Check for replication lag
- Pause the DAG if needed
- Check for data gaps
- Perform backfilling
- Reschedule the DAG
|Note: The GitLab.com data source is a very important data source and commonly used. Please inform an update business stakeholders accordingly.|
|Sheetload - Column '#REF!' is not recognised|
|Background: This is an issue with Google sheets when data is being imported from a second sheet using Google sheets' import function. Occasionally the connections between the sheets stop working and the sheet needs to be refreshed.|
|More information of the setup here.|
|Possible steps, resolution and actions:
- In general you should just need to open the Google sheet which is failing and confirm the data has been re-populated.
- If you do not have access to the sheet contact @gitlab-data/engineers and confirm if anyone else does.
^(?!.*(<First term to find>|<Second term to find>)).*$
e.g. For cleaning up Airflow logs:
^(?!.*(Failure in test|Database error)).*$