This guide establishes our standards for SQL and are enforced by the SQLFluff linter and by code review. The target code changes that this stile guide apply to are those made using dbt.
If you are not on the Data Team or you are developing SQL outside of dbt, keep in mind that the linting tools may be more difficult to apply, but you are welcome to follow the guidance given within this guide.
We expect that people will use the style presented in this guide during there development. Enforcement is still expected to happen at the time of review as the CI pipeline is only run manually. At a latter date the pipeline will run with every change and eventually set to be mandatory for the pipeline to pass.
The previous style guide used a mixed indentation style that the linter does not support so every model will likely fail the first time it is checked. The intention is that models will be updated to the new style as they are worked on normally as not all of the rules can automatically enforced (such as explicit column aliasing). The intention of working updating them in the course of other development is to keep the changes manageable.
SQLFLuff is a SQL linter that works with templating tools like dbt. We use it to define the basic structure and style of the SQL that we write and move the review of that structure and style into the hands of the authors. SQLFluff is included within the dbt development environment and it uses the dbt templating engine during the linting process. It can be used with the following command:
$ sqlfluff lint models/path/to/file/file-to-lint.sql
A dbt command can also be used to get a list of files to lint:
$ sqlfluff lint $(dbt list --model model_name --output path)
If you are writing SQL that is not templated using dbt then you can install and use SQLFluff directly as it is a stand alone python package.
$ pip install sqlfluff
$ sqlfluff lint path/to/file/file-to-lint.sql
SQLFluff includes a fix
command that will apply fixes to rule violations when possible. Not all rule violations are automatically fixable; therefore, you are encouraged to run the lint
command after using the fix
command to ensure that all rule violations have been resolved.
The configuration file that the Data Team uses can be found in the GitLab Data Team repository.
Do not optimize for fewer lines of code, new lines are cheap but brain time is expensive.
Familiarize yourself with the DRY Principal. Leverage CTEs, jinja and macros in dbt, and snippets in Sisense. If you type the same line twice, it needs to be maintained in two places.
Be consistent. Even if you are not sure of the best way to do something do it the same way throughout your code, it will be easier to read and make changes if they are needed.
Be explicit. Defining something explicitly will ensure that it works the way you expect and it is easier for the next person, which may be you, when you are explicit in SQL.
No tabs should be used - only spaces. Your editor should be setup to convert tabs to spaces - see our onboarding template for more details.
Wrap long lines of code, between 80 and 100, to a new line.
Do not use the USING
command in joins because it produces inaccurate results in Snowflake. Create an account to view the forum discussion on this topic.
UNION ALL
and UNION
LIKE
and ILIKE
NOT
and !
and <>
DATE_PART()
and DATE_TRUNC()
Use the AS
operator when aliasing a column or table.
Prefer DATEDIFF
to inline additions date_column + interval_column
. The function is more explicit and will work for a wider variety of date parts.
Prefer !=
to <>
. This is because !=
is more common in other programming languages and reads like "not equal" which is how we're more likely to speak.
Prefer LOWER(column) LIKE '%match%'
to column ILIKE '%Match%'
. This lowers the chance of stray capital letters leading to an unexpected result.
WHERE
to HAVING
when either would suffice.--
syntax/* */
syntaxTODO
comments, create new issues for improvementAn ambiguous field name such as id
, name
, or type
should always be prefixed by what it is identifying or naming:
-- Preferred
SELECT
id AS account_id,
name AS account_name,
type AS account_type,
...
-- vs
-- Not Preferred
SELECT
id,
name,
type,
...
All field names should be snake-cased:
-- Preferred
SELECT
dvcecreatedtstamp AS device_created_timestamp
...
-- vs
-- Not Preferred
SELECT
dvcecreatedtstamp AS DeviceCreatedTimestamp
...
Boolean field names should start with has_
, is_
, or does_
:
-- Preferred
SELECT
deleted AS is_deleted,
sla AS has_sla
...
-- vs
-- Not Preferred
SELECT
deleted,
sla,
...
_at
and should always be in UTC._date
.When truncating dates name the column in accordance with the truncation.
SELECT
original_at, -- 2020-01-15 12:15:00.00
original_date, -- 2020-01-15
DATE_TRUNC('month',original_date) AS original_month -- 2020-01-01
...
date
or month
as a column name.always qualify each column in the SELECT statement with the table name / alias for easy navigation
-- Preferred
SELECT
budget_forecast.account_id,
date_details.fiscal_year,
date_details.fiscal_quarter,
date_details.fiscal_quarter_name,
cost_category.cost_category_level_1,
cost_category.cost_category_level_2
FROM budget_forecast_cogs_opex AS budget_forecast
LEFT JOIN date_details
ON date_details.first_day_of_month = budget_forecast.accounting_period
LEFT JOIN cost_category
ON budget_forecast.unique_account_name = cost_category.unique_account_name
-- vs
-- Not Preferred
SELECT
a.account_id,
b.fiscal_year,
b.fiscal_quarter,
b.fiscal_quarter_name,
c.cost_category_level_1,
c.cost_category_level_2
FROM budget_forecast_cogs_opex a
LEFT JOIN date_details b
ON b.first_day_of_month = a.accounting_period
LEFT JOIN cost_category c
ON b.unique_account_name = c.unique_account_name
Only use double quotes when necessary, such as columns that contain special characters or are case sensitive.
-- Preferred
SELECT
"First_Name_&_" AS first_name,
...
-- vs
-- Not Preferred
SELECT
FIRST_NAME AS first_name,
...
Prefer accessing JSON using the bracket syntax.
-- Preferred
SELECT
data_by_row['id']::bigint as id_value
...
-- vs
-- Not Preferred
SELECT
data_by_row:"id"::bigint as id_value
...
Prefer explicit join statements.
-- Preferred
SELECT *
FROM first_table
INNER JOIN second_table
...
-- vs
-- Not Preferred
SELECT *
FROM first_table,
second_table
...
Also prefer explicit join statements for LATERAL FLATTEN
. However, it should be noted that the current code-base does not consistently adhere to this practice.
-- Preferred
SELECT
data.value,
source.uploaded_at
FROM source
INNER JOIN LATERAL FLATTEN(input => source.jsontext['data']) AS data
...
-- vs
-- Not Preferred
SELECT
data.value,
source.uploaded_at
FROM source,
LATERAL FLATTEN(input => source.jsontext['data']) AS data
...
Prefer CTEs over sub-queries as CTEs make SQL more readable and are more performant:
-- Preferred
WITH important_list AS (
SELECT DISTINCT
specific_column
FROM other_table
WHERE specific_column != 'foo'
)
SELECT
primary_table.column_1,
primary_table.column_2
FROM primary_table
INNER JOIN important_list
ON primary_table.column_3 = important_list.specific_column
-- vs
-- Not Preferred
SELECT
primary_table.column_1,
primary_table.column_2
FROM primary_table
WHERE primary_table.column_3 IN (
SELECT DISTINCT specific_column
FROM other_table
WHERE specific_column != 'foo')
replace_sfdc_account_id_with_master_record_id
and prefer a shorter name with a comment in the CTE. This will help avoid table aliasing in joins.NUMBER
instead of DECIMAL
, NUMERIC
, INTEGER
, BIGINT
, etc.FLOAT
instead of DOUBLE
, REAL
, etc.VARCHAR
instead of STRING
, TEXT
, etc.TIMESTAMP
instead of DATETIME
The exception to this is for timestamps. Prefer TIMESTAMP
to TIME
. Note that the default for TIMESTAMP
is TIMESTAMP_NTZ
which does not include a time zone.
IFNULL
to NVL
.Prefer IFF
to a single line CASE
statement:
-- Preferred
SELECT
IFF(column_1 = 'foo', column_2,column_3) AS logic_switch,
...
-- vs
-- Not Preferred
SELECT
CASE
WHEN column_1 = 'foo' THEN column_2
ELSE column_3
END AS logic_switch,
...
Prefer IFF
to selecting a boolean statement:
-- Preferred
SELECT
IFF(amount < 10,TRUE,FALSE) AS is_less_than_ten,
...
-- vs
-- Not Preferred
SELECT
(amount < 10) AS is_less_than_ten,
...
Prefer simplifying repetitive CASE
statements where possible:
-- Preferred
SELECT
CASE field_id
WHEN 1 THEN 'date'
WHEN 2 THEN 'integer'
WHEN 3 THEN 'currency'
WHEN 4 THEN 'boolean'
WHEN 5 THEN 'variant'
WHEN 6 THEN 'text'
END AS field_type,
...
-- vs
-- Not Preferred
SELECT
CASE
WHEN field_id = 1 THEN 'date'
WHEN field_id = 2 THEN 'integer'
WHEN field_id = 3 THEN 'currency'
WHEN field_id = 4 THEN 'boolean'
WHEN field_id = 5 THEN 'variant'
WHEN field_id = 6 THEN 'text'
END AS field_type,
...
Prefer the explicit date function over date_part
, but prefer date_part
over extract
:
DAYOFWEEK(created_at) > DATE_PART(dayofweek, 'created_at') > EXTRACT(dow FROM created_at)
DATEDIFF
function as the function will only return whole interval results.This example code has been processed though SQLFluff linter and had the style guide applied.
WITH my_data AS (
SELECT my_data.*
FROM prod.my_data_with_a_long_table_name AS my_data
INNER JOIN prod.other_thing
WHERE my_data.filter = 'my_filter'
),
some_cte AS (
SELECT DISTINCT
id AS other_id,
other_field_1,
other_field_2,
date_field_at,
data_by_row,
field_4,
field_5,
LAG(
other_field_2
) OVER (PARTITION BY other_id, other_field_1 ORDER BY 5) AS previous_other_field_2
FROM prod.my_other_data
),
/*
This is a very long comment: It is good practice to leave comments in code to
explain complex logic in CTEs or business logic which may not be intuitive to
someone who does not have intimate knowledge of the data source. This can help
new users familiarize themselves with the code quickly.
*/
final AS (
SELECT
-- This is a singel line comment
my_data.field_1 AS detailed_field_1,
my_data.field_2 AS detailed_field_2,
my_data.detailed_field_3,
DATE_TRUNC('month', some_cte.date_field_at) AS date_field_month,
some_cte.data_by_row['id']::NUMBER AS id_field,
IFF(my_data.detailed_field_3 > my_data.field_2, TRUE, FALSE) AS is_boolean,
CASE
WHEN my_data.cancellation_date IS NULL
AND my_data.expiration_date IS NOT NULL
THEN my_data.expiration_date
WHEN my_data.cancellation_date IS NULL
THEN my_data.start_date + 7 -- There is a reason for this number
ELSE my_data.cancellation_date
END AS adjusted_cancellation_date,
COUNT(*) AS number_of_records,
SUM(some_cte.field_4) AS field_4_sum,
MAX(some_cte.field_5) AS field_5_max
FROM my_data
LEFT JOIN some_cte
ON my_data.id = some_cte.id
WHERE my_data.field_1 = 'abc'
AND (my_data.field_2 = 'def' OR my_data.field_2 = 'ghi')
GROUP BY 1, 2, 3, 4, 5, 6
HAVING COUNT(*) > 1
ORDER BY 8 DESC
)
SELECT *
FROM final