GitLab has a robust and vibrant Data Program which includes a Central Data Team and many Functional Analytics Teams. GitLab total team members are growing as well and we need to uplevel our triaging process to keep up with GitLab's growth.
Steps to uplevel triaging process:
Triage Group Name | Triage Slack Alias | Triage Group Members |
---|---|---|
Go To Market Analytics Triage | @GTMAnalyticsTriage |
Sales Strategy & Analytics Team, Marketing Strategy and Performance Team, Business Insights and Analytics Team, GTM Data Fusion Team |
R&D Analytics Triage | @R&DAnalyticsTriage |
Product Analysis Team, R&D Data Fusion Team |
People Analytics Triage | @PeopleAnalyticsTriage |
People Group Analytics Team, G&A Data Fusion Team |
Engineering Analytics Triage | @EngineeringAnalyticsTriage |
Engineering Analytics Team, G&A Data Fusion Team |
Data Platform Triage | @DataPlatformTriage |
Data Platform Team |
Data Collaboration Triage | @DataCollaborationTriage |
Data Collaboration Team |
Slack Channel Alias | Analytics Team | Sunday | Monday | Tuesday | Wednesday | Thursday | Friday |
---|---|---|---|---|---|---|---|
@GTMAnalyticsTriage |
Sales Strategy and Analytics | @melia |
@Noel Figuera |
||||
Marketing Strategy and Performance | @David Egan |
@rkohnke |
|||||
Business Insights and Analytics | @Vikas Agrawal |
@Jay Zhang |
@Kelly Chen |
@charan karthikeyan |
@Jay Stemmer |
||
GTM Data Fusion | @Peter Empey |
@Michelle Cooper |
@Jean Peguero |
||||
@R&DAnalyticsTriage |
Product Analysis | @Carolyn Braza |
@Dave Peterson |
@Emma Neuberger |
@Nicole Galang |
@Neil Raisinghani |
|
R&D Data Fusion | @Tim Poole |
@Chris Sharp |
@snalamaru |
@Miles Russell |
|||
@PeopleAnalyticsTriage |
People Analytics | @Adrian Pérez |
|||||
G&A Data Fusion | @Peter Empey |
||||||
@EngineeringAnalyticsTriage |
Engineering Analytics | @lily |
|||||
Engineering Data Fusion | @Peter Empey |
||||||
@DataPlatformTriage |
Data Platform | @vprakash |
@Paul Armstrong |
@Radovan Bacovic |
@Rigerta Demiri |
@jstark |
|
@DataCollaborationTriage |
Data Collaboration | @Ken |
@Ken |
@Trang Nguyen |
@Trang Nguyen |
A team member who is off, on vacation, or working on a high priority project is responsible for finding coverage and communicating to the team who is taking over their coverage; this should be updated on the Data Program's Google Calendar. To avoid putting the Monday workload on the same shoulders every week again, the Data Platform Team will rotate/exchange every now and then triage days in good collaboration on an ad-hoc basis.
Having dedicated triagers on the team helps address the bystander affect. The schedule shares clear daily ownership information but is not an on-call position. Through clear ownership, we create room for everyone else on the team to spend most of the day around deep work. The triager is encouraged to plan their day for the kind of work that can be accomplished successfully with this additional demand on time.
Data triagers are the first responders to requests and problems for the Data Program.
dbt-test
and dbt-run
errors. Issues created from these errors should use the AE Triage Errors issue template.Data Analysts
- Respond to #data questions and assign to right Triage group
- Respond to new unassigned issues in our project and assign to the right team
Data Engineers:
- Respond to Data infrastructural incidents
- Respond to broken data pipelines
- Respond to data freshness issues
- Respond to broken dags (except for dbt issues, see next)
Analytics Engineers:
- Respond to errors in dbt models
- Respond to dbt test failures
For issue triaging, the triager should add the appropriate labels to the issue to put the issue into the respective project's workflow. It is not the responsibility of the triager to validate the issue and determine any root causes or solutions. Rather, for the Data Team project for example, the triager should add the appropriate scoped TEAM label (Ex.Team:GTM
) and the scoped Workflow::triage
label. From there, those issues are then put into the respective team's workflow.
Note: The Triager:
The Central Data Team triager will create an issue in the Data Team project. Task and duties are stated in the issue template.
Read the FAQ and common issues.
The Data Team follows the incident definition from Engineering: Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status.
Service degradation or outages in data can be seen as:
This means the following events (not extensive) are likely to be incidents:
Depending on the nature and impact of the incident a severity needs to be determined. Currently we don't have a decision matrix in place. To determine the severity, take the following aspects into consideration:
#data
Slack channel, followed by a short description, ETA and link to the incident. The right GitLab Team Members are tagged.
Parts of triage are assisted by the GitLab Triage Bot, which is setup in the Analytics project. The bot runs every hour and takes actions based on a set of rules defined in a policies file. The GitLab Triage README contains all documentation for the formatting and definition of rules.
Changes to the triage bot policy file should be tested in the MR by running the "dry-run:triage" CI job and inspecting the log output. This CI job is a dry-run, meaning it will not actually take any actions in the project but will print out what would happen if the policy was actually executed.
In order to get better and be more efficient in daily triage, we wrap-up the work by the end of the day. The following information is provided by the Data Analyst and Data Engineer each day:
A triage roundup will take place at the end of every milestone by the data leadership team to consolidate the milestones triage efforts. Please bear in mind the purpose of the information provided, to make it useful and improve Triage.
1 of the most important data source, that regularly changes, is the GitLab.com database. In order not to break the daily operation, changes to the database needs to be tracked and checked. Any change to the GitLab.com database, is made to the db/structure.sql file. The Data Team gets notified, by applying labels to the MR, if a change to the db/structure.sql is made, via the Danger Bot.
A label Data Warehouse::Impact Check
is added by the Danger Bot as call to action for the data team.
Data Warehouse::Impact Check
.The following actions are performed by Data Team Triager:
MR
) will be judged
MR
contains the label group::product intelligence
along with Data Warehouse::Impact Check
, there are a couple of checks that need to do:
Data team
should ensure the change will not break the Service ping
extraction processSQL
statement from the original MR
(a typical example is gitlab-org/gitlab/merge_requests/75504) and execute it on Snowflake
- usually, it is just a SELECT
SQL
statementSQL
file are not causing a break in the operation, the label will be changed to Data Warehouse::Not Impacted
.SQL
file causing a break in the operation:
Data Warehouse::Impacted
GitLab Data Team project
, assigned to the correct DRI and linked to the original MR.MR
does not contains the label group::product intelligence
and it concerns changes to SQL
structure:
MR
will cause a break in the operation, the label will be changed to Data Warehouse::Not Impacted
.MR
will cause a break in the operation:
Data Warehouse::Impacted
GitLab Data Team project
, assigned to the correct DRI and linked to the original MR.Click to expand graphical representation of the process</summary>
</details>
Determination matrix: **
Change | Call to action needed* |
---|---|
New table created | :x: |
Table deleted | :white_check_mark: |
Table renamed | :white_check_mark: |
Field added | :x: |
Field removed | :white_check_mark: |
Field name altered | :white_check_mark: |
Field datatype altered | :question: |
Constraints changed | :question: |
*We are not loading all the tables and columns by default. Thus if new tables or columns are added, we only will load these tables if there is a specific business request. Any change to the current structure that causes a potential break of operation needs to be determined.
** Determination matrix is not extensive. Every MR should be checked carefully.
In this section we state down common issues and resolutions
In a scenario when gitlab cloned Postgres database is not accessible, the airflow task log is showing below error.
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: the database system is starting up\n
b'FATAL: the database system is starting up\n'
Follow the steps mentioned below.
gitlab_com_data_reconciliation_extract_load
, gitlab_com_db_extract
,gitlab_com_db_incremental_backfill
,gitlab_com_scd_db_sync
. The reason behind is to keep the alerting down and not use unwanted resources.Firing 1 - GitLab Job has failed
The GitLab job "clone" resource "zlonk.datalytics.dailyx" has failed.
:chart: View Prometheus graph:label: Labels:
Alertname: JobFailed
Alert_type: symptom
Env: gprd
Environment: gprd
Fqdn: blackbox-01-inf-gprd.c.gitlab-production.internal
Job: clone
Monitor: default
Provider: gcp
Region: us-east
Resource: zlonk.datalytics.dailyx
Severity: s3
Shard: default
Stage: main
Tier: db
Type: zlonk.postgres
Show less
@sre-oncall
slack handle to look into the issue also raise an incident request using incident declare. This will create a production incident issue for the SRE on-call team to act upon also cc @gitlab-data/engineers
for broader visibility of the incident.@sre-oncall
person or someone from the DBRE team, try re-running one of the failed tasks by clearing one alone to validate the stability of the connection.gitlab_com_scd_db_sync
, gitlab_com_data_reconciliation_extract_load
and gitlab_com_db_incremental_backfill
clear failed task so that it get picked up for run as these task runs only once in 24 hour window.In case we have missed the whole schedule, we re-trigger the DAG itself.In a situation when Service ping fail while it generates metrics, we should be informed either via Trusted data dashboard
or Airflow
log - generally, the error log is stored in RAW.SAAS_USAGE_PING.INSTANCE_SQL_ERRORS
table. Follow the instructions from the link error-handling-for-sql-based-service-ping in order to fix the issue.
It could happen, in any case, to reset the table in Stitch for the Zuora data pipeline, in order to backfill a table completely (i.e. new columns added to in the source, technical error etc). Currently, Zuora Stitch integration does not provide table level reset, and thus we have to perform a reset of all the tables in the integration. This will result in extra costs and risks.
To this below steps can be followed using which we have successfully done the table level reset.
In this example, we have used Zuora subscription
table, but this could be applied to any other table in the Stitch Zuora data pipeline.
ALTER TABLE "RAW"."ZUORA_STITCH"."SUBSCRIPTION" RENAME TO "RAW"."ZUORA_STITCH"."SUBSCRIPTION_20210903";
While setting it up setup the extraction frequency to 30 minutes and date from extraction to 1st Jan 2012 to ensure all data gets pulled through.
Try running the newly created integration manually and wait for it to complete. Once completed then and it shows on the home page successfully. Once done Pause the newly integration task because we don't want any misaligned data while we follow the next steps.
In the newly created table "RAW"."ZUORASUBSCRIPTION"."SUBSCRIPTION"
cross-check the number of rows showing as loaded in the integration UI in stitch and loaded in the table is same.
Move the newly loaded data to ZUORA_STITCH
schema because the new integration will create the table in the ZUORASUBSCRIPTION
as stated above in the image.
CREATE TABLE "RAW"."ZUORA_STITCH"."SUBSCRIPTION" CLONE "RAW"."ZUORASUBSCRIPTION"."SUBSCRIPTION"; **Note:** Check for the primary key present in the table post clone or not if not check for the primary key in the [link](https://www.stitchdata.com/docs/integrations/saas/zuora#subscription) and add the constraints on those columns.
select count(*) from "RAW"."ZUORA_STITCH"."SUBSCRIPTION_20210903" where deleted = 'FALSE';
select count(*) from "RAW"."ZUORA_STITCH"."SUBSCRIPTION" ;
DROP SCHEMA "RAW"."ZUORASUBSCRIPTION" CASCADE ;
This is to ensure that error observed previously to the table is gone and data is getting populated in the table. Check on duplicate ids due to 2 different extractors, to ensure the data is getting populated in the table correctly.
select id, count(*) from "RAW"."ZUORA_STITCH"."SUBSCRIPTION"
group by id
having count(*) > 1 **Note** Refer to the [MR (internal link)](https://gitlab.com/gitlab-data/analytics/-/issues/10065#note_668365681) for more information.
See the source contact spreadsheet for who to contact if there are external errors related to a source.
DAG gitlab_com_db_extract Task gitlab-com-dbt-incremental-source-freshness |
Background: This extract relies on a copy (replication) database of the GitLab.com environment. Its high likely that this is the root cause of a high replication lag. |
More information of the setup here (internal link). |
Possible steps, resolution and actions: - Check for replication lag - Pause the DAG if needed - Check for data gaps - Perform backfilling - Reschedule the DAG |
Note: The GitLab.com data source is a very important data source and commonly used. Please inform an update business stakeholders accordingly. |
DAG sheetload Task dbt-sheetload |
Background: This is an issue with Google sheets when data is being imported from a second sheet using Google sheets' import function. Occasionally the connections between the sheets stop working and the sheet needs to be refreshed. |
More information of the setup here. |
Possible steps, resolution and actions: - In general you should just need to open the Google sheet which is failing and confirm the data has been re-populated. - If you do not have access to the sheet contact @gitlab-data/engineers and confirm if anyone else does. |
When got an error for model version_usage_data_unpacked
and error looks like:
[2022-01-26 11:56:32,233] INFO - b'\x1b[33mDatabase Error in model version_usage_data_unpacked (models/legacy/version/xf/version_usage_data_unpacked.sql)\x1b[0m\n'
[2022-01-26 11:56:32,233] INFO - b' 000904 (42000): SQL compilation error: error line 241 at position 12\n'
[2022-01-26 11:56:32,233] INFO - b" invalid identifier '{metrics_name}'\n"
[2022-01-26 11:56:32,233] INFO - b' compiled SQL at target/compiled/gitlab_snowflake/models/legacy/version/xf/version_usage_data_unpacked.sql\n'
[2022-01-26 11:56:32,234] INFO - b'\n'
The root cause of this issue is when new metrics are introduced in an upstream model - and this model (along with model version_usage_data_unpacked_intermediate
) try to pivot values to columns. Without full refresh, this will not happen under the pipeline.
Full refresh required as per instructions from dbt models full refresh.
An example for this failure is the issue: #11524 (internal link)
Sometimes Zuora Revenue source system as part of a certain release modify the source table definition by adding/removing column(In a year once or twice). Whenever this type of change happens it triggers failure in the loading task in DAG zuora_revenue_load_snow
.
For example, there was 3 additional column added to table BI3_RC_POB
which lead to the below error message.
[2022-03-21, 13:26:48 UTC] INFO - sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 100080 (22000): 01a31326-0403-d02c-0000-289d37f10d4e: Number of columns in file (102) does not match that of the corresponding table (99), use file format option error_on_column_count_mismatch=false to ignore this error
[2022-03-21, 13:26:48 UTC] INFO - File 'RAW_DB/staging/BI3_RC_POB/BI3_RC_POB_12.csv', line 3, character 1
[2022-03-21, 13:26:48 UTC] INFO - Row 1 starts at line 2, column "BI3_RC_POB"[102]
[2022-03-21, 13:26:48 UTC] INFO - If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
When this kind of failure happens we need to alter/create or replace the target table and add the additional column to the table definition in RAW.ZUORA_REVENUE schema. It is good to practise creating or replacing because it requires doing the full refresh of the data because of the addition of a new column.
Below are set of steps that will guide you to resolve this.
Step 1:- Download the first file from the storage to local to view the additional column. For any file, only the file name and folder name should be modified.
gsutil cp gs://zuora_revpro_gitlab/RAW_DB/staging/BI3_RC_POB/BI3_RC_POB_1.csv .
For any other file, only the file name and folder name should be modified. For example table, BI3_RC_BILL
above command will change to ` gsutil cp gs://zuora_revpro_gitlab/RAW_DB/staging/BI3_RC_BILL/BI3_RC_BILL_1.csv .`
Step 2:- Once the file is downloaded look for the header of the file using the below command
head -1 BI3_RC_POB_1.csv
Step 3:- Pull out the current table definition from snowflake using the below query
USE ROLE LOADER;
USE DATABASE RAW;
USE SCHEMA ZUORA_REVENUE;
select get_ddl('table','BI3_RC_POB');
Prepare the CREATE OR REPLACE TABLE query by comparing the missing column from the header and the table definition and adding them to the new table. The missing column should be at the end and the data type should be varchar. Deploy the modified SQL.
Note:- If the table doesn't exist then create the table with all the column named present in header with datatype as Varchar.
Step 4:- Move the log file from the process folder to the staging folder of the table.
gsutil cp gs://zuora_revpro_gitlab/RAW_DB/processed/22-03-2022/BI3_RC_POB/BI3_RC_POB_22-03-2022.log gs://zuora_revpro_gitlab/RAW_DB/staging/BI3_RC_POB/
The point to consider in this command is the date in the path and the log file name. If the failure happened on 23-03-2022
then it will become gs://zuora_revpro_gitlab/RAW_DB/processed/23-03-2022/BI3_RC_POB/BI3_RC_POB_23-03-2022.log
Validate in GCS storage that the log file is present for the respective table.
Step 5:- Re-run the task from the airflow by clearing the task.
Is Data Triage 24/7 support or shift where we need to support it for 24 hours?
We need to work in our normal working hour perform the list of task mentioned for the triage day in the Triage Template (internal link)
If any issue is found do we directly jump to fix it in production or take it as part of the incident and solve it within the defined time?
On the Triage day the data team member present will look for all the failures, questions or errors in:
It includes all the failures since the last person did sign off and will create an issue for all the failures since then till the person signs off. If any data pipeline has broken and there is expected to be a delay in getting data loaded or refreshed. The concerned team has to be notified using the [Triage Template (internal link)](https://gitlab.com/gitlab-data/analytics/-/issues/new?
Is there ETA for a different kind of issue?
If the pipeline is broken it needs to be fixed, currently we are working on defining SLO's for our data assets. For our data extraction pipelines, there is a comprehensive overview here.
If I work my normal hours on triage day i.e. till 11 AM of US timeline. What happens when the pipeline breaks post my normal hours and there is a delay in data availability?
Yes, the benefit of our presence is that we have a wide overage of hours. If the person who is on Triage is ahead of US timelines, we have an advantage of solving issues timely. The downside is that we have not full coverage that day for US timelines. This is an attention point towards the future.
^(?!.*(<First term to find>|<Second term to find>)).*$
e.g. For cleaning up Airflow logs:
^(?!.*(Failure in test|Database error)).*$