It is the Data team's collective responsibility to promote, respect, and improve our Python Style Guide since not all the things can be caught up by the tools we are using. This deliberate attention makes a difference and does a lot to ensure high-quality code. The main motivation to have comprehensive and useful Python guidelines is to ensure a high standard of code quality is maintained in our work. As we use this guideline as a support in that daily work, which is always changing, this guide is likewise always subject to iteration. In the long run this guide will help us to have a world-class code quality we should be proud of. All the changes are driven by our values.
Campsite rule: As these guidelines are themselves in a constantly Work in Progress (WIP
) status - if you work with any code style, hint, or guideline which does not currently adhere to the style guide - please submit a merge request with the relevant changes and tag the Data Platform team to update the guide.
Starting in Jan 2022 all new custom python extracts should adhere to the Singer standard.
This section brings details on a high level of how to follow the best practices regarding Python
code. Following recommendations, will ensure we are able to fully understand the advantage of high code quality and leverage our code base.
It is difficult to resist writing a python guide without mentioning the Zen of Python
, a cornerstone for ensuring the code quality on a high level.
It is a helpful mental exercise when you want to write outstanding Python without overlooking basic ideas.
╰─$ python3
Python 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
PEP
stands for Python Enhancement Proposal, and there are several of them. A PEP
is a document that describes new features proposed for Python and documents aspects of Python, like design and style, for the community.
As per definition:
PEP 8
, sometimes spelledPEP8
orPEP-8
, is a document that provides guidelines and best practices on how to write Python code. It was written in 2001 byGuido van Rossum
,Barry Warsaw
, andNick Coghlan
. The primary focus ofPEP 8
is to improve the readability and consistency of Python code.
Why We Need PEP 8?
Readability counts.
— The Zen of Python
Code is much more often read than it is written.
— Guido van Rossum
Among many other things, it is vital to underscore the need to have a clean, clear and consistent code in order to be able to maintain and scale the codebase.
As a supplement of the Zen of Python
, we had an intention to underscore a couple of more viewpoints to ensure and keep our code base in a good shape. This is half a joke, half a truth, but provides good general overview for the high standards we want to leverage.
Here is our GitLab Zen of Python
proposal:
G
ratitude and respect for PEP 8
I
nsist on writing well-structured codeT
rust Pythonic
way of thinking and coding, and introduce a good habit of using it on a daily basisL
everage and promote proper comments and documentationA
lways have Zen of Python
on the top of your mindB
oost usage of a modular code style over script-like approachProbably a couple more of them are count:
With a good framework definition on the high level of using the proper architectural approach for designing and structuring the Python
code, now it is time to do a deep dive and leverage details should make the difference between good and the outstanding code.
A programming idiom, put simply, is nothing else but a way to write code. Idiomatic Python code is often referred to as being Pythonic
. Although there is usually one — and preferably not only one — obvious way to do it. The way to write idiomatic Python code can be non-obvious to Python beginners.
So, good idioms must be consciously acquired.
While (almost) any kind of magic is possible with Python, the most explicit and straightforward manner is preferred. Keep it simple and smart.
## Bad!
def foo(*args):
x, y = args
return dict(**locals())
## Good!
def bar(x, y):
return {'x': x, 'y': y}
Arguments can be passed to routines in four different ways:
Positional arguments
- for instance, foo(message, recipient)
Keyword arguments
- for instance, foo(message, to, cc=None, bcc=None)
. Here cc
and bcc
are optional, and evaluate to None
when they are not passed another value.Arbitrary argument list
(*args
)Arbitrary keyword argument dictionary
(**kwargs
)It is up to the engineer writing the function to determine which arguments are positional arguments and which are optional keyword arguments, and to decide whether to use the advanced techniques of arbitrary argument passing. If the advice above is followed wisely, it is possible and enjoyable to write Python
functions that are:
When a function grows in complexity, it is not uncommon to use multiple return statements inside the function’s body. However, in order to keep a clear intent and sustainable readability, it is preferable to avoid returning meaningful values from many output points in the body. When a function has multiple main exit points for its normal course, it becomes difficult to debug the returned result, so it may be preferable to keep a single exit point
## Bad! Probably too complex and difficult to read
def foo(a, b, c):
if not a:
return None # Raising an exception might be better
if not b:
return None # Raising an exception might be better
# Some complex code trying to compute x from a, b and c
# Resist temptation to return x if succeeded
if not x:
# Some Plan-B computation of x
return x # One single exit point for the returned value x will help
# when maintaining the code.
## Good!
def bar(a, b, c):
res = None
if not a:
res = None
if not b:
res = None
# Some complex code trying to compute x from a, b and c
# Resist temptation to return x if succeeded
if not x:
# Some Plan-B computation of x
res = 42
return res
If you know the length of a list or tuple, you can assign names to its elements with unpacking. For example, since enumerate()
will provide a tuple of two elements for each item in list:
# Bad! - can be difficult to read and maintain
for index in range(0:len(foo_list)):
# do something with foo_list[index]
# Good! - this is optimized way to use for loop
for index, item in enumerate(foo_list):
# do something with index and item
You can use this to swap variables as well:
# Good!
a, b = b, a
Nested unpacking works fine as well:
# Good!
a, (b, c) = 1, (2, 3)
New method of extended unpacking was introduced by PEP 3132
:
## Good!
a, *rest = [1, 2, 3]
# a = 1, rest = [2, 3]
a, *middle, c = [1, 2, 3, 4]
# a = 1, middle = [2, 3], c = 4
Ignored variable _
can be part of unpacking as well:
## Bad!
a, _ , c = [1, 2, 3, 4] # This will raise an error
## Good! This will work (* is going before _)
a, *_ , c = [1, 2, 3, 4]
## Good!
_, *rest = [1, 2, 3]
# rest = [2, 3]
Note: bad praxis is to unpack more than 3 values. Yes, it is allowed to do that, but will rapidly decrease code readability.
## Bad! (if you have more than 3 values)
a, b, c, d = 1, 2, 3, 4
# Good!
a, b, c = 1, 2, 3
d = 4
# Better!
a = 1
b = 2
c = 3
d = 4
This section should expose effective techniques to deal with conventions and how to integrate them into your toolbox.
It is not needed to explicitly compare a value to True, or None, or 0 – you can just add it to the if statement. See Truth Value Testing for a list of what is considered false.
## Bad!
if attr == True:
print('True!')
if attr == None:
print('attr is None!')
## Good!
# Just check the value
if attr:
print('attr is truthy!')
# or check for the opposite
if not attr:
print('attr is falsey!')
# or, since None is considered false, explicitly check for it
if attr is None:
print('attr is None!')
# same goes for dict, list sets:
check_list = []
if check_list:
print('This is not empty list.')
else:
print('The list is empty.')
There are many ways to do string concatenation. Here is just a short exercise on how to do that in an efficient way.
string1 = 'Python'
string2 = 'Guideline'
## Bad!
print(string1 + " " + string2)
# Python Guideline
## Good!
print('{} {}'.format(string1, string2))
# Python Guideline
## Better!
print(f"{string1} {string2}")
# Python Guideline
As you noticed here, the result is the same, but details make a difference. In the example above, we dealt with strings, but what will gonna happen when we introduce more data types within the same code.
string1 = 'Python'
int1 = 2 # now, this is int
## Bad!
# print(string1 + " " + int1)
# TypeError: can only concatenate str (not "int") to str
## Good!
print('{} {}'.format(string1, int1))
# Python 2
## Better!
print(f"{string1} {int1}")
# Python 2
See why it is better to use placeholders instead of simple string concatenation (ie. a + b
).
List comprehensions
provides a powerful, concise way to work with lists.Generator expressions
follows almost the same syntax as list comprehensions but returns a generator instead of a list. This is crucial to remember: performance and memory matter, and it is a great consideration to understand the leverage of using generators where it is appropriate.## Bad!
# will return a list first and then do the max calculation, the trick is as [] stands for the list
foo = max([(student.id, student.name) for student in graduates])
## Good!
# will return a generator object first and then do the max calculation till the generator exhausted, the trick is as () stands for the generator object
bar = max((student.id, student.name) for student in graduates)
Note: performance and memory resources matter!
Comprehensions in Python provide us with a short and concise way to construct new sequences (such as lists, set, dictionary etc.) using sequences which have been already defined. Python supports the following 4 types of comprehensions:
List
ComprehensionsDictionary
ComprehensionsSet
ComprehensionsGenerator
Comprehensions - Kindly suggest spending some time to understand generators and how they are implemented in Python
as they are memory optimized and should be an efficient choice for processing a massive set of data.In other words, any iterable can be part of comprehensions.
Comprehensions are the great example of Pythonic
way of thinking, as provides clean and neat coding standards.
output_list = [output_exp for var in input_list if (var satisfies this condition)]
Note: nested comprehensions are allowed as well, but it is not best practice to have more than 2 comprehensions in one statement.
There are plenty of ways to filter an iterables. Let see some of them and how they fit in a high coding standards.
## Bad!
# Never remove items from a list while you are iterating through it.
# Filter elements greater than 4
foo = [3, 4, 5]
for i in foo:
if i > 4:
foo.remove(i)
# Bad!
## Don’t make multiple passes through the list.
while i in foo:
foo.remove(i)
## Good!
# Use a list comprehension or generator expression.
# comprehensions create a new list object
filtered_values = [value for value in sequence if value != x]
# generators don't create another list
filtered_values = (value for value in sequence if value != x)
## Good!
# you can use function as a filter
sequence= [1, 2, 3]
def dummy_filter(member: int) -> bool:
return member != 2
filtered_values = [value for value in sequence if dummy_filter(value)]
# [1, 3]
As an alternative, map
| filter
| reduce
functions can be used for this purpose. Handy link is Map, Filter and Reduce
Remember that assignment never creates a new object. If two or more variables refer to the same list, changing one of them changes them all.
# Add three to all list members.
list_a = [3, 4, 5]
list_b = list_a # list_a and list_b refer to the same list object
for i in range(len(list_a)):
list_a[i] += 3 # list_b[i] also changes
# for copying a list, use .copy() method
list_a = [1, 2, 3]
list_b = list_a.copy()
# extend list_b as list_a will stay in a original shape
list_b.extend(list_b)
print(F"list_a: {list_a}")
print(F"list_b: {list_b}")
## list_a: [1, 2, 3]
## list_b: [1, 2, 3, 1, 2, 3]
It’s safer to create a new list object and leave the original alone.
list_a = [3, 4, 5]
list_b = list_a
# assign the variable "list_a" to a new list without changing "list_b"
list_a = [i + 3 for i in list_a]
Use enumerate()
keep a count of your place in the list.
## Good!
foo = [3, 4, 5]
for i, item in enumerate(foo):
print(i, item)
# prints
# 0 3
# 1 4
# 2 5
Note: The
enumerate()
function has better readability than handling a counter manually. Moreover, it is better optimized for iterators.
Always good advice is to use context manager
over value assigning when loading data from a file. This approach will automatically close a file for you.
## Bad!
f = open('file.txt')
a = f.read()
print(a)
f.close() # we always forgot something like this.
## Good!
with open('file.txt') as f:
for line in f:
print(line)
# This approach will close a file for you
When a logical line of code is longer than the accepted limit, you need to split it over multiple physical lines. The Python interpreter will join consecutive lines if the last character of the line is a backslash. This is helpful in some cases, but should usually be avoided because of its fragility: a white space added to the end of the line, after the backslash, will break the code and may have unexpected results
## Bad!
my_very_big_string = """When a logical line of code is longer than the accepted limit, \
you need to split it over multiple physical lines. \
The Python interpreter will join consecutive lines if the last character of the line is a backslash.”"""
from some.deep.module.inside.a.module import a_nice_function, another_nice_function, \
yet_another_nice_function
## Good!
my_very_big_string = (
"When a logical line of code is longer than the accepted limit, "
"you need to split it over multiple physical lines. "
"The Python interpreter will join consecutive lines if the last character of the line is a backslash.”"
)
from some.deep.module.inside.a.module import (
a_nice_function, another_nice_function, yet_another_nice_function)
Following PEP8 we recommend you put blank lines around logical sections of code.
When starting a for
loop or if/else
block, add a new line above the section to give the code some breathing room. Newlines are cheap - brain time is expensive.
## Bad!
def foo(input_number:int) -> int:
"""
Do some simple comparing
"""
res = input_number
if res == 2:
return res
else:
return res ** 2
## Good!
def bar(input_number:int) -> int:
"""
Do some simple comparing
"""
res = input_number
if res == 2:
return res
else:
return res ** 2
All function signatures should contain type hints, including for the return type, even if it is None
.
This is good documentation and can also be used with mypy
for type checking and error checking.
## Bad
def foo(x, y):
"""
Add two numbers together and return.
"""
return x + y
## Good!
def foo(x: int, y: int) -> int:
"""
Add two numbers together and return.
"""
return x + y
## Good! (for None as return type)
def bar(some_str: str) -> None:
"""
Print a string.
"""
print(some_str)
return
Imports should follow the PEP8 rules and furthermore should be ordered with any import ...
statements coming before from .... import ...
## Bad!
from os import environ
import logging
import some_local_module
from requests import get
import pandas as pd
from another_local_module import something
import sys
## Good!
import logging
import sys
from os import environ
import pandas as pd
from requests import get
import some_local_module
from another_local_module import something
Also, linters should help you with this issue: mypy
, flake8
, pylint
.
## Good!
def foo(x: int, y: int) -> int:
"""
Add two numbers together and return the result.
"""
return x + y
## Good! (for None as return type)
def bar(some_str: str) -> None:
"""
Print a string.
This is another proper sentence.
"""
print(some_str)
return
## Better! Have Docstring on a module level.
"""
This is a Docstrings on a module level.
Should be handy to describe a purpose of your module
"""
def bar(some_str: str) -> None:
"""
Print a string.
This is another proper sentence.
"""
print(some_str)
return
To make functions as reusable as possible, it is highly discouraged (unless there is a "very" good reason) from using environment variables directly in functions (there is an example of this below). Instead, the best practice is to either pass in the variable you want to use specifically or pass all of the environment variables in as a dictionary. This allows you to pass in any dictionary and have it be compatible while also not requiring the variables to being defined at the environment level.
import os
from typing import Dict
## Bad!
def foo(x: int) -> int:
"""
Add two numbers together and return.
"""
return x + os.environ["y"]
foo(1)
## Good!
env_vars = os.environ.copy() # The copy method returns a normal dict of the env vars.
def bar(some_str: str, another_string: str) -> None:
"""
Print two strings concatenated together.
"""
print(f"{some_str} {another_string}")
return
bar("foo", env_vars["bar"])
## Better!
def bar(some_str: str, env_vars: Dict[str, str]) -> None:
"""
Print two strings concatenated together.
"""
print({some_str} + {env_vars["another_string"]})
return
bar("foo", env_vars)
Ideally, never hardcode the date format using datetime.strptime unless absolutely necessary in cases where the format is very unusual. A better solution is to use the generic date parser in the dateutil library, as it handles a large variety of formats very reliably:
## Bad !
datevar = datetime.strptime(tstamp, timestamp_format = "%Y-%m-%dT%H:%M:%S%z")
## Good !
from dateutil import parser as date_parser
...
datevar = date_parser.parse(tstamp)
We use a few standard aliases for common third-party packages. They are as follows:
import pandas as pd
import numpy as np
Adding the type to the name is good self-documenting code. When possible, always use descriptive naming for variables, especially with regards to data type. Here are some examples:
data_df
is a dataframeparams_dict
is a dictionaryretries_int
is an intbash_command_str
is a stringIf passing a constant through to a function, name each variable that is being passed so that it is clear what each thing is.
Lastly, try and avoid redundant variable naming.
def bar(some_str: str, another_string: str) -> None:
"""
Print two strings concatenated together.
"""
print(some_str + another_string)
return
## Good!
bar(some_str="foo", another_string="bar")
## Better!
some_str = "foo"
another_string = "bar"
bar(some_str, another_string)
## But Bad!
bar(some_str=some_str, another_string=another_string)
If your script is not able to be run even though you've just made it, it most likely needs to be executable. Run the following:
chmod 755 yourscript.py
For an explanation of chmod 755 read this askubuntu page.
Using mutable data structures as default arguments in functions can introduce bugs into your code. This is because a new mutable data structure is created once when the function is defined, and the data structure is used in each successive call.
def append_to(element, to=[]):
to.append(element)
return to
my_list = append_to(12)
print(my_list)
my_other_list = append_to(42)
print(my_other_list)
Output:
[12]
[12, 42]
Note: Handy link for this topic: Python gotchas
Our main linter is Black
. We use the default configuration of it.
There is a manual CI job in the review
stage (Python
section) that will lint the entire repo and return a non-zero exit code if files need to be formatted. It is up to both the MR author and the reviewer to make sure that this job passes before the MR is merged. To lint the entire repo, run the command:
$ jump analytics
$ black .
When writing a python class to extract data from an API it is the responsibility of that class to highlight any errors in the API process. Data modelling, source freshness and formatting issues should be highlighted using dbt tests
.
Avoid use of general try/except
blocks as it is too broad, and will be difficult to find the real error:
## Bad!
try:
print("Do something")
except:
print("Caught every type of exception")
# Good!
while maximum_backoff_sec > (2 ** n):
try:
print("Do something")
except APIError as gspread_error:
if gspread_error.response.status_code in (429, 500, 502, 503):
self.wait_exponential_backoff(n)
n = n + 1
else:
raise
else:
error(f"Max retries exceeded, giving up on {file_name}")
## Better! fine error granulation
while maximum_backoff_sec > (2 ** n):
try:
print("Do something")
except APIError as gspread_error:
if gspread_error.response.status_code in (429, 500, 502, 503):
self.wait_exponential_backoff(n)
n = n + 1
else:
raise
except AttributeError as attribute_error:
raise
except KeyError as key_error:
print('Caught this error: ' + repr(key_error))
/extract
folder, any API Clients which may be reused should be stored in /orchestration
folder under the /analytics
repoextract_{source}_{dataset_name}
like extract_qualtrics_mailingsends
or extract_qualtrics
if the script extracts multiple datasets. This script can be considered the main function of the extract, and is the file which gets run as the starting point of the extract DAG.Pytest is used to run unit tests in the /analytics
project. The tests are executed from the root directory of the project with the python_pytest
CI pipeline job. The job produces a JUnit
report of test results which is then processed by GitLab
and displayed on merge requests.
Most functional tests frameworks, and pytest
as well, follow the Arrange-Act-Assert
model:
Arrange
- or set up, the conditions for the testAct
- by calling some function or methodAssert
- that some end condition is True
(a test will pass) or False
(a test will fail)pytest
simplifies testing workflow by allowing you to use Python’s assert keyword directly without any boilerplate code.
New testing file names should follow the pattern test_*.py
so they are found by pytest
and easily recognizable in the repository.
New testing files should be placed in a directory named test
. The test directory should share the same parent directory as the file that is being tested.
A testing file should contain one or more tests.
Test functions should have test_*
naming pattern in their name.
An individual test is created by defining a function that has one or many plain Python assert
statements.
Note: When writing imports, it is important to remember that tests are executed from the root directory.
In the future, additional directories may be added to the PythonPath
for ease of testing as need allows.
When create a test case, keep it simple and clear. The main point is to have small test cases be able to keep it consistent and easy to maintain..
# example when test passed
import pytest
def test_example_pass():
assert 1 == 1
# test.py::test_example_pass PASSED
# example when test failed
import pytest
def test_example_failed():
assert 1 == 2
# test.py::test_example_failed FAILED
pytest
fixtures are a way of providing data, configs or state setup to tests. Fixtures are functions that can return a wide range of values, especially for repeatable tasks and config items. Each test functions that depend on a fixture must explicitly accept that fixture as an argument along with decorator @pytest.fixture
.
import pytest
@pytest.fixture()
def myfixture():
# define some boring repeatable task needed for test cases
return "This is my fixture"
# this will pass
def test_example(myfixture):
assert myfixture == "This is my fixture"
# test.py::test_example PASSED
# this will also pass as myfixture is reused
def test_example_additional(myfixture):
assert type(myfixture) == str
# test.py::test_example_additional PASSED
A great way to avoid code duplication is to use Parametrizing tests and for that purpose, the magic happens behind the @pytest.mark.parametrize
decorator.
The builtin pytest.mark.parametrize
decorator enables parametrization of arguments for a test function.
This enables us to test different scenarios, all in one function. We make use of the @pytest.mark.parametrize
decorator, where we are able to specify the names of the arguments that will be passed to the test function, and a list of arguments corresponding to the names.
import pytest
# here is the magic word to parametrize more than one scenario
import pytest
@pytest.mark.parametrize("test_value, expected_value", [("1+1", 2), ("2+3", 5), ("6*9", 54)])
def test_eval(test_value, expected_value):
assert eval(test_value) == expected_value
# test.py::test_eval[1+1-2] PASSED [ 33%]
# test.py::test_eval[2+3-5] PASSED [ 66%]
# test.py::test_eval[6*9-54] PASSED [100%]
In other words, you can think of this decorator behaving as a zip*
function and returning a tuple for the 2 list for more scenarios.
Note: for the decorator
@pytest.mark.parametrize
the first argument to parametrize() is a comma-delimited string of parameter names. The second argument is a list of either tuples or single values that represent the parameter value(s).
In any large test suite, some of the tests will inevitably be slow. They might test timeout behavior, for example, or they might exercise a broad area of the code. Whatever the reason, it would be nice to avoid running all the slow tests when you’re trying to iterate quickly on a new feature. pytest enables you to define categories for your tests and provides options for including or excluding categories when you run your suite. You can mark a test with any number of categories.
Marking tests is useful for categorizing tests by subsystem or dependencies. If some of your tests require access to a network, for example, then you could create a @pytest.mark.network_access
mark for them.
pytest.ini
file:
[pytest]
markers =
network_access: requires network access
local_test: can run locally
import pytest
@pytest.mark.network_access
def test_network():
assert 1 == 2
@pytest.mark.local_test
def test_local():
assert 1 == 1
network_access
test(s):# will fail, just to recognize what we run
╰─$ pytest test.py -m network_access
...
collected 2 items / 1 deselected / 1 selected
test.py F
local_test
test(s):# this will pass
╰─$ pytest test.py -m local_test 1 ↵
...
collected 2 items / 1 deselected / 1 selected
test.py .
If you plan to improve the speed of your tests, then it’s useful to know which tests might offer the biggest improvements. pytest
can automatically record test durations for you and report the top offenders.
Use the --durations
option to the pytest command to include a duration report in your test results. --durations
expects an integer value n and will report the slowest n number of tests.
import pytest
from time import sleep
def test_slow():
sleep(1) # make it sleep 1s
assert 1 == 1
def test_slower():
sleep(2) # make it sleep 2s
assert 1 == 1
def test_slowest():
sleep(3) # make it sleep 3s
assert 1 == 1
pytest
:╰─$ pytest test.py --durations=1
= test session starts =
...
collected 3 items
test.py ... [100%]
= slowest 1 durations =
3.00s call test.py::test_slowest
= 3 passed in 6.03s =
When pytest
is not able to answer your needs is more complicated scenarios, handy plugins should be found. By now, didn't find any usage outside of pytest
in /analytics
repo, and it is good to know there are some useful tools can help you do your work.
random.seed
.Subprocess support
, Xdist support
, Consistent pytest behavior
pytest
plugins.Python community provides a comprehensive set of tools to ensure code quality and high standards for the codebase. The idea is to list a couple of tools from our toolbox for the code quality and expose some interesting items worth considering as a potential option for future use.
Libraries already in use and automated in CI pipeline:
Nice to check, explore and considering:
pycodestyle
flake8
yapf
autopep8
HowDOI
- good tool for a quick search.For automating code quality and testing, we are using our own product GitLab CI/CD pipeline. Details of pipelines we use for python should be found on the page CI jobs (Python).
Since this style guide is for the entire data team, it is important to remember that there is a time and place for using Python
and it is usually outside of the data modeling phase.
Stick to SQL
for data manipulation tasks where possible.