This guide establishes our standards for SQL and are enforced by the SQLFluff linter and by code review. The target code changes that this stile guide apply to are those made using dbt.
If you are not on the Data Team or you are developing SQL outside of dbt, keep in mind that the linting tools may be more difficult to apply, but you are welcome to follow the guidance given within this guide.
We expect that people will use the style presented in this guide during there development. Enforcement is still expected to happen at the time of review as the CI pipeline is only run manually. At a latter date the pipeline will run with every change and eventually set to be mandatory for the pipeline to pass.
The previous style guide used a mixed indentation style that the linter does not support so every model will likely fail the first time it is checked. The intention is that models will be updated to the new style as they are worked on normally as not all of the rules can automatically enforced (such as explicit column aliasing). The intention of working updating them in the course of other development is to keep the changes manageable.
SQLFLuff is a SQL linter that works with templating tools like dbt. We use it to define the basic structure and style of the SQL that we write and move the review of that structure and style into the hands of the authors. SQLFluff is included within the dbt development environment and it uses the dbt templating engine during the linting process. It can be used with the following command:
$ sqlfluff lint models/path/to/file/file-to-lint.sql
A dbt command can also be used to get a list of files to lint:
$ sqlfluff lint $(dbt list --model model_name --output path)
If you are writing SQL that is not templated using dbt then you can install and use SQLFluff directly as it is a stand alone python package.
$ pip install sqlfluff $ sqlfluff lint path/to/file/file-to-lint.sql
SQLFluff includes a
fix command that will apply fixes to rule violations when possible. Not all rule violations are automatically fixable; therefore, you are encouraged to run the
lint command after using the
fix command to ensure that all rule violations have been resolved.
The configuration file that the Data Team uses can be found in the GitLab Data Team repository.
Do not optimize for fewer lines of code, new lines are cheap but brain time is expensive.
Familiarize yourself with the DRY Principal. Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places.
Be consistent. Even if you are not sure of the best way to do something do it the same way throughout your code, it will be easier to read and make changes if they are needed.
Be explicit. Defining something explicitly will ensure that it works the way you expect and it is easier for the next person, which may be you, when you are explicit in SQL.
No tabs should be used - only spaces. Your editor should be setup to convert tabs to spaces - see our onboarding template for more details.
Wrap long lines of code, between 80 and 100, to a new line.
Do not use the
USING command in joins because it produces inaccurate results in Snowflake. Create an account to view the forum discussion on this topic.
AS operator when aliasing a column or table.
DATEDIFF to inline additions
date_column + interval_column. The function is more explicit and will work for a wider variety of date parts.
<>. This is because
!= is more common in other programming languages and reads like "not equal" which is how we're more likely to speak.
LOWER(column) LIKE '%match%' to
column ILIKE '%Match%'. This lowers the chance of stray capital letters leading to an unexpected result.
HAVINGwhen either would suffice.
TODOcomments, create new issues for improvement
An ambiguous field name such as
type should always be prefixed by what it is identifying or naming:
-- Preferred SELECT id AS account_id, name AS account_name, type AS account_type, ... -- vs -- Not Preferred SELECT id, name, type, ...
All field names should be snake-cased:
-- Preferred SELECT dvcecreatedtstamp AS device_created_timestamp ... -- vs -- Not Preferred SELECT dvcecreatedtstamp AS DeviceCreatedTimestamp ...
Boolean field names should start with
-- Preferred SELECT deleted AS is_deleted, sla AS has_sla ... -- vs -- Not Preferred SELECT deleted, sla, ...
_atand should always be in UTC.
When truncating dates name the column in accordance with the truncation.
SELECT original_at, -- 2020-01-15 12:15:00.00 original_date, -- 2020-01-15 DATE_TRUNC('month',original_date) AS original_month -- 2020-01-01 ...
monthas a column name.
always qualify each column in the SELECT statement with the table name / alias for easy navigation
-- Preferred SELECT budget_forecast.account_id, date_details.fiscal_year, date_details.fiscal_quarter, date_details.fiscal_quarter_name, cost_category.cost_category_level_1, cost_category.cost_category_level_2 FROM budget_forecast_cogs_opex AS budget_forecast LEFT JOIN date_details ON date_details.first_day_of_month = budget_forecast.accounting_period LEFT JOIN cost_category ON budget_forecast.unique_account_name = cost_category.unique_account_name -- vs -- Not Preferred SELECT a.account_id, b.fiscal_year, b.fiscal_quarter, b.fiscal_quarter_name, c.cost_category_level_1, c.cost_category_level_2 FROM budget_forecast_cogs_opex a LEFT JOIN date_details b ON b.first_day_of_month = a.accounting_period LEFT JOIN cost_category c ON b.unique_account_name = c.unique_account_name
Only use double quotes when necessary, such as columns that contain special characters or are case sensitive.
-- Preferred SELECT "First_Name_&_" AS first_name, ... -- vs -- Not Preferred SELECT FIRST_NAME AS first_name, ...
Prefer accessing JSON using the bracket syntax.
-- Preferred SELECT data_by_row['id']::bigint as id_value ... -- vs -- Not Preferred SELECT data_by_row:"id"::bigint as id_value ...
Prefer explicit join statements.
-- Preferred SELECT * FROM first_table INNER JOIN second_table ... -- vs -- Not Preferred SELECT * FROM first_table, second_table ...
Prefer CTEs over sub-queries as CTEs make SQL more readable and are more performant:
-- Preferred WITH important_list AS ( SELECT DISTINCT specific_column FROM other_table WHERE specific_column != 'foo' ) SELECT primary_table.column_1, primary_table.column_2 FROM primary_table INNER JOIN important_list ON primary_table.column_3 = important_list.specific_column -- vs -- Not Preferred SELECT primary_table.column_1, primary_table.column_2 FROM primary_table WHERE primary_table.column_3 IN ( SELECT DISTINCT specific_column FROM other_table WHERE specific_column != 'foo')
replace_sfdc_account_id_with_master_record_idand prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joins.
The exception to this is for timestamps. Prefer
TIME. Note that the default for
TIMESTAMP_NTZ which does not include a time zone.
IFF to a single line
-- Preferred SELECT IFF(column_1 = 'foo', column_2,column_3) AS logic_switch, ... -- vs -- Not Preferred SELECT CASE WHEN column_1 = 'foo' THEN column_2 ELSE column_3 END AS logic_switch, ...
IFF to selecting a boolean statement:
-- Preferred SELECT IFF(amount < 10,TRUE,FALSE) AS is_less_than_ten, ... -- vs -- Not Preferred SELECT (amount < 10) AS is_less_than_ten, ...
Prefer simplifying repetitive
CASE statements where possible:
-- Preferred SELECT CASE field_id WHEN 1 THEN 'date' WHEN 2 THEN 'integer' WHEN 3 THEN 'currency' WHEN 4 THEN 'boolean' WHEN 5 THEN 'variant' WHEN 6 THEN 'text' END AS field_type, ... -- vs -- Not Preferred SELECT CASE WHEN field_id = 1 THEN 'date' WHEN field_id = 2 THEN 'integer' WHEN field_id = 3 THEN 'currency' WHEN field_id = 4 THEN 'boolean' WHEN field_id = 5 THEN 'variant' WHEN field_id = 6 THEN 'text' END AS field_type, ...
Prefer the explicit date function over
date_part, but prefer
DAYOFWEEK(created_at) > DATE_PART(dayofweek, 'created_at') > EXTRACT(dow FROM created_at)
DATEDIFFfunction as the function will only return whole interval results.
This example code has been processed though SQLFluff linter and had the style guide applied.
WITH my_data AS ( SELECT my_data.* FROM prod.my_data_with_a_long_table_name AS my_data INNER JOIN prod.other_thing WHERE my_data.filter = 'my_filter' ), some_cte AS ( SELECT DISTINCT id AS other_id, other_field_1, other_field_2, date_field_at, data_by_row, field_4, field_5, LAG( other_field_2 ) OVER (PARTITION BY other_id, other_field_1 ORDER BY 5) AS previous_other_field_2 FROM prod.my_other_data ), /* This is a very long comment: It is good practice to leave comments in code to explain complex logic in CTEs or business logic which may not be intuitive to someone who does not have intimate knowledge of the data source. This can help new users familiarize themselves with the code quickly. */ final AS ( SELECT -- This is a singel line comment my_data.field_1 AS detailed_field_1, my_data.field_2 AS detailed_field_2, my_data.detailed_field_3, DATE_TRUNC('month', some_cte.date_field_at) AS date_field_month, some_cte.data_by_row['id']::NUMBER AS id_field, IFF(my_data.detailed_field_3 > my_data.field_2, TRUE, FALSE) AS is_boolean, CASE WHEN my_data.cancellation_date IS NULL AND my_data.expiration_date IS NOT NULL THEN my_data.expiration_date WHEN my_data.cancellation_date IS NULL THEN my_data.start_date + 7 -- There is a reason for this number ELSE my_data.cancellation_date END AS adjusted_cancellation_date, COUNT(*) AS number_of_records, SUM(some_cte.field_4) AS field_4_sum, MAX(some_cte.field_5) AS field_5_max FROM my_data LEFT JOIN some_cte ON my_data.id = some_cte.id WHERE my_data.field_1 = 'abc' AND (my_data.field_2 = 'def' OR my_data.field_2 = 'ghi') GROUP BY 1, 2, 3, 4, 5, 6 HAVING COUNT(*) > 1 ORDER BY 8 DESC ) SELECT * FROM final