Python Guide
Python Guide
Motivation
It is the Data team’s collective responsibility to promote, respect, and improve our Python Style Guide since not all the things can be caught up by the tools we are using. This deliberate attention makes a difference and does a lot to ensure high-quality code. The main motivation to have comprehensive and useful Python guidelines is to ensure a high standard of code quality is maintained in our work. As we use this guideline as a support in that daily work, which is always changing, this guide is likewise always subject to iteration. In the long run this guide will help us to have a world-class code quality we should be proud of. All the changes are driven by our values.
Values
Campsite rule: As these guidelines are themselves in a constantly Work in Progress (WIP
) status - if you work with any code style, hint, or guideline which does not currently adhere to the style guide - please submit a merge request with the relevant changes and tag the Data Platform team to update the guide.
Technology Standardisation
Starting in Jan 2022 all new custom python extracts should adhere to the Singer standard.
High-level guidelines
This section brings details on a high level of how to follow the best practices regarding Python
code. Following recommendations, will ensure we are able to fully understand the advantage of high code quality and leverage our code base.
Zen of Python
It is difficult to resist writing a python guide without mentioning the Zen of Python
, a cornerstone for ensuring the code quality on a high level.
It is a helpful mental exercise when you want to write outstanding Python without overlooking basic ideas.
|
|
PEP 8
PEP
stands for Python Enhancement Proposal, and there are several of them. A PEP
is a document that describes new features proposed for Python and documents aspects of Python, like design and style, for the community.
As per definition:
PEP 8
, sometimes spelledPEP8
orPEP-8
, is a document that provides guidelines and best practices on how to write Python code. It was written in 2001 byGuido van Rossum
,Barry Warsaw
, andNick Coghlan
. The primary focus ofPEP 8
is to improve the readability and consistency of Python code.
Why We Need PEP 8?
Readability counts.
— The Zen of Python
Code is much more often read than it is written.
— Guido van Rossum
Among many other things, it is vital to underscore the need to have a clean, clear and consistent code in order to be able to maintain and scale the codebase.
GitLab Zen of Python
As a supplement of the Zen of Python
, we had an intention to underscore a couple of more viewpoints to ensure and keep our code base in a good shape. This is half a joke, half a truth, but provides good general overview for the high standards we want to leverage.
Here is our GitLab Zen of Python
proposal:
G
ratitude and respect forPEP 8
I
nsist on writing well-structured codeT
rustPythonic
way of thinking and coding, and introduce a good habit of using it on a daily basisL
everage and promote proper comments and documentationA
lways haveZen of Python
on the top of your mindB
oost usage of a modular code style over script-like approach
Probably a couple more of them are count:
- Advocate for proper naming of variables, classes, functions and modules
- Favor a modular code style over script-like approach
- Prefer using a virtual environment over existing interpreter
Specific guidelines
With a good framework definition on the high level of using the proper architectural approach for designing and structuring the Python
code, now it is time to do a deep dive and leverage details should make the difference between good and the outstanding code.
Project setup - Poetry
For project setup, we are using poetry.
Poetry
is a python dependency management tool to manage dependencies, packages, and libraries in your python project. It will help you to make your project more simple by resolving dependency complexities in your project and managing to install and update for you.
Here is the set of command we are using to initiate and install Python
(in our case it is 3.10.3
), refer to the file onboarding_script.zsh.
One good example of how we define pyproject.toml
file to run and execute the poetry
setting is exposed 👇 below:
|
|
The full file is exposed in the file service-ping-metrics-check/pyproject.toml as a part of Service ping metrics check project.
Idioms
A programming idiom, put simply, is nothing else but a way to write code. Idiomatic Python code is often referred to as being Pythonic
. Although there is usually one — and preferably not only one — obvious way to do it. The way to write idiomatic Python code can be non-obvious to Python beginners.
So, good idioms must be consciously acquired.
Explicit code
While (almost) any kind of magic is possible with Python, the most explicit and straightforward manner is preferred. Keep it simple and smart.
|
|
Function arguments
Arguments can be passed to routines in four different ways:
Positional arguments
- for instance,foo(message, recipient)
Keyword arguments
- for instance,foo(message, to, cc=None, bcc=None)
. Herecc
andbcc
are optional, and evaluate toNone
when they are not passed another value.Arbitrary argument list
(*args
)Arbitrary keyword argument dictionary
(**kwargs
)
It is up to the engineer writing the function to determine which arguments are positional arguments and which are optional keyword arguments, and to decide whether to use the advanced techniques of arbitrary argument passing. If the advice above is followed wisely, it is possible and enjoyable to write Python
functions that are:
- easy to read (the name and arguments need no explanations)
- easy to change (adding a new keyword argument does not break other parts of the code)
Returning values
When a function grows in complexity, it is not uncommon to use multiple return statements inside the function’s body. However, in order to keep a clear intent and sustainable readability, it is preferable to avoid returning meaningful values from many output points in the body. When a function has multiple main exit points for its normal course, it becomes difficult to debug the returned result, so it may be preferable to keep a single exit point
|
|
Unpacking
If you know the length of a list or tuple, you can assign names to its elements with unpacking. For example, since enumerate()
will provide a tuple of two elements for each item in list:
|
|
You can use this to swap variables as well:
|
|
Nested unpacking works fine as well:
|
|
New method of extended unpacking was introduced by PEP 3132
:
|
|
Ignored variable _
can be part of unpacking as well:
|
|
Note: bad praxis is to unpack more than 3 values. Yes, it is allowed to do that, but will rapidly decrease code readability.
|
|
Conventions
This section should expose effective techniques to deal with conventions and how to integrate them into your toolbox.
Check if a variable equals a constant
It is not needed to explicitly compare a value to True, or None, or 0 – you can just add it to the if statement. See Truth Value Testing for a list of what is considered false.
|
|
String concatenation
There are many ways to do string concatenation. Here is just a short exercise on how to do that in an efficient way.
|
|
As you noticed here, the result is the same, but details make a difference. In the example above, we dealt with strings, but what will gonna happen when we introduce more data types within the same code.
|
|
See why it is better to use placeholders instead of simple string concatenation (ie. a + b
).
Short Ways to Manipulate Lists
List comprehensions
provides a powerful, concise way to work with lists.Generator expressions
follows almost the same syntax as list comprehensions but returns a generator instead of a list. This is crucial to remember: performance and memory matter, and it is a great consideration to understand the leverage of using generators where it is appropriate.
|
|
Note: performance and memory resources matter!
Comprehensions
Comprehensions in Python provide us with a short and concise way to construct new sequences (such as lists, set, dictionary etc.) using sequences which have been already defined. Python supports the following 4 types of comprehensions:
List
ComprehensionsDictionary
ComprehensionsSet
ComprehensionsGenerator
Comprehensions - Kindly suggest spending some time to understand generators and how they are implemented inPython
as they are memory optimized and should be an efficient choice for processing a massive set of data.
In other words, any iterable can be part of comprehensions.
Comprehensions are the great example of Pythonic
way of thinking, as provides clean and neat coding standards.
|
|
Note: nested comprehensions are allowed as well, but it is not best practice to have more than 2 comprehensions in one statement.
Filtering an iterables
There are plenty of ways to filter an iterables. Let see some of them and how they fit in a high coding standards.
|
|
As an alternative, map
| filter
| reduce
functions can be used for this purpose. Handy link is Map, Filter and Reduce
Modifying the values in a list
Remember that assignment never creates a new object. If two or more variables refer to the same list, changing one of them changes them all.
|
|
It’s safer to create a new list object and leave the original alone.
|
|
Use enumerate()
keep a count of your place in the list.
|
|
Note: The
enumerate()
function has better readability than handling a counter manually. Moreover, it is better optimized for iterators.
Read From a File
Always good advice is to use context manager
over value assigning when loading data from a file. This approach will automatically close a file for you.
|
|
Line Continuations/Line Length
When a logical line of code is longer than the accepted limit, you need to split it over multiple physical lines. The Python interpreter will join consecutive lines if the last character of the line is a backslash. This is helpful in some cases, but should usually be avoided because of its fragility: a white space added to the end of the line, after the backslash, will break the code and may have unexpected results
|
|
Spacing
Following PEP8 we recommend you put blank lines around logical sections of code.
When starting a for
loop or if/else
block, add a new line above the section to give the code some breathing room. Newlines are cheap - brain time is expensive.
|
|
Type Hints
All function signatures should contain type hints, including for the return type, even if it is None
.
This is good documentation and can also be used with mypy
for type checking and error checking.
|
|
Import Order
Imports should follow the PEP8 rules and furthermore should be ordered with any import ...
statements coming before from .... import ...
|
|
|
|
Also, linters should help you with this issue: iSort
, mypy
, flake8
, pylint
.
Docstrings
- Docstrings should be used in every single file.
- Docstrings should be used in every single function. Since we are using type hints in the function signature there is no requirement to describe each parameter.
- Docstrings should use triple double-quotes and use complete sentences with punctuation.
|
|
How to integrate Environment Variables
To make functions as reusable as possible, it is highly discouraged (unless there is a “very” good reason) from using environment variables directly in functions (there is an example of this below). Instead, the best practice is to either pass in the variable you want to use specifically or pass all of the environment variables in as a dictionary. This allows you to pass in any dictionary and have it be compatible while also not requiring the variables to being defined at the environment level.
|
|
Parsing Dates
Ideally, never hardcode the date format using datetime.strptime unless absolutely necessary in cases where the format is very unusual. A better solution is to use the generic date parser in the dateutil library, as it handles a large variety of formats very reliably:
|
|
Package Aliases
We use a few standard aliases for common third-party packages. They are as follows:
import pandas as pd
import numpy as np
Variable Naming Conventions
Adding the type to the name is good self-documenting code. When possible, always use descriptive naming for variables, especially with regards to data type. Here are some examples:
data_df
is a dataframeparams_dict
is a dictionaryretries_int
is an intbash_command_str
is a string
If passing a constant through to a function, name each variable that is being passed so that it is clear what each thing is.
Lastly, try and avoid redundant variable naming.
|
|
Making your script executable
If your script is not able to be run even though you’ve just made it, it most likely needs to be executable. Run the following:
|
|
For an explanation of chmod 755 read this askubuntu page.
Mutable default function arguments
Using mutable data structures as default arguments in functions can introduce bugs into your code. This is because a new mutable data structure is created once when the function is defined, and the data structure is used in each successive call.
|
|
Output:
[12]
[12, 42]
Note: Handy link for this topic: Python gotchas
Exception handling
When writing a python class to extract data from an API it is the responsibility of that class to highlight any errors in the API process. Data modelling, source freshness and formatting issues should be highlighted using dbt tests
.
Avoid use of general try/except
blocks as it is too broad, and will be difficult to find the real error:
|
|
Folder structure for new extracts
- All client specific logic should be stored in
/extract
folder, any API Clients which may be reused should be stored in/orchestration
folder under the/analytics
repo - Pipeline specific operations should be stored in /extract.
- The folder structure in extract should include a file called
extract_{source}_{dataset_name}
likeextract_qualtrics_mailingsends
orextract_qualtrics
if the script extracts multiple datasets. This script can be considered the main function of the extract, and is the file which gets run as the starting point of the extract DAG.
Unit Testing with pytest
Pytest is used to run unit tests in the /analytics
project. The tests are executed from the root directory of the project with the python_pytest
CI pipeline job. The job produces a JUnit
report of test results which is then processed by GitLab
and displayed on merge requests.
Most functional tests frameworks, and pytest
as well, follow the Arrange-Act-Assert
model:
Arrange
- or set up, the conditions for the testAct
- by calling some function or methodAssert
- that some end condition isTrue
(a test will pass) orFalse
(a test will fail)
pytest
simplifies testing workflow by allowing you to use Python’s assert keyword directly without any boilerplate code.
Writing New Tests
New testing file names should follow the pattern test_*.py
so they are found by pytest
and easily recognizable in the repository.
New testing files should be placed in a directory named test
, usually under the current working folder. The test directory should share the same parent directory as the file that is being tested. For instance, you are working on integration xyz
, probably you should have folder structure like following:
|
|
A testing file should contain one or more tests.
Test functions should have test_*
naming pattern in their name.
An individual test is created by defining a function that has one or many plain Python assert
statements.
- If the asserts are all
True
, the test passes. - If any asserts are
Fals
e, then the test will fail.
Note: When writing imports, it is important to remember that tests are executed from the root directory.
In the future, additional directories may be added to the PythonPath
for ease of testing as need allows.
Basic Pytest Usage
When create a test case, keep it simple and clear. The main point is to have small test cases be able to keep it consistent and easy to maintain..
|
|
Using Pytest Fixtures
pytest
fixtures are a way of providing data, configs or state setup to tests. Fixtures are functions that can return a wide range of values, especially for repeatable tasks and config items. Each test functions that depend on a fixture must explicitly accept that fixture as an argument along with decorator @pytest.fixture
.
|
|
Parametrized Test Functions
A great way to avoid code duplication is to use Parametrizing tests and for that purpose, the magic happens behind the @pytest.mark.parametrize
decorator.
The builtin pytest.mark.parametrize
decorator enables parametrization of arguments for a test function.
This enables us to test different scenarios, all in one function. We make use of the @pytest.mark.parametrize
decorator, where we are able to specify the names of the arguments that will be passed to the test function, and a list of arguments corresponding to the names.
|
|
In other words, you can think of this decorator behaving as a zip*
function and returning a tuple for the 2 list for more scenarios.
Note: for the decorator
@pytest.mark.parametrize
the first argument to parametrize() is a comma-delimited string of parameter names. The second argument is a list of either tuples or single values that represent the parameter value(s).
Categorizing Tests using marks
In any large test suite, some of the tests will inevitably be slow. They might test timeout behavior, for example, or they might exercise a broad area of the code. Whatever the reason, it would be nice to avoid running all the slow tests when you’re trying to iterate quickly on a new feature. pytest enables you to define categories for your tests and provides options for including or excluding categories when you run your suite. You can mark a test with any number of categories.
Marking tests is useful for categorizing tests by subsystem or dependencies. If some of your tests require access to a network, for example, then you could create a @pytest.mark.network_access
mark for them.
- First, need to define markers in
pytest.ini
file:
|
|
- Create a test file
|
|
- run just
network_access
test(s):
|
|
- run just
local_test
test(s):
|
|
Duration Report
If you plan to improve the speed of your tests, then it’s useful to know which tests might offer the biggest improvements. pytest
can automatically record test durations for you and report the top offenders.
Use the --durations
option to the pytest command to include a duration report in your test results. --durations
expects an integer value n and will report the slowest n number of tests.
|
|
- Calling
pytest
:
|
|
Using pytest with exceptions
Sometimes, there is a use case you want to force expectations in pytest
module. The solution is fairly simple, here is an example:
|
|
Using pytest with fake RESTful API
Use the unitest.mock
library when you need to test your code against a RESTful API. unitest.mock
allows you define API calls and responses, enabling you to test such behaviors within a test environment.
Usage is pretty simple:
|
|
Beyond pytest: Useful pytest Plugins
When pytest
is not able to answer your needs is more complicated scenarios, handy plugins should be found. By now, didn’t find any usage outside of pytest
in /analytics
repo, and it is good to know there are some useful tools can help you do your work.
- pytest-randomly - Pytest plugin to randomly order tests and control
random.seed
. - pytest-cov - This plugin produces coverage reports. Compared to just using coverage run this plugin does some extras:
Subprocess support
,Xdist support
,Consistent pytest behavior
- Full plugin list - list of
pytest
plugins.
Tools for supporting coding quality
Python community provides a comprehensive set of tools to ensure code quality and high standards for the codebase. The idea is to list a couple of tools from our toolbox for the code quality and expose some interesting items worth considering as a potential option for future use.
Black
Our main linter is Black. We use the default configuration of it.
There is a manual CI job in the review
stage (Python
section) that will lint the entire repo and return a non-zero exit code if files need to be formatted. It is up to both the MR author and the reviewer to make sure that this job passes before the MR is merged. To lint the entire repo, run the command:
|
|
If you want to check all files (without formatting) in the entire repo or any particular folder of rile using black
, run:
|
|
mypy
Mypy is an optional static type checker for
Python
that aims to combine the benefits of dynamic (orduck
) typing and static typing. Mypy combines the expressive power and convenience of Python with a powerful type system and compile-time type checking. Mypy type checks standard Python programs.
|
|
flake8
Your tool for style guide enforcement: Flake8 is a popular lint wrapper for python. Under the hood, it runs three other tools and combines their results: pep8 for checking style. pyflakes for checking syntax. mccabe for checking complexity.
|
|
pylint
Pylint is a static code analyser for Python.
Pylint analyses your code without actually running it. It checks for errors, enforces a coding standard, looks for code smells, and can make suggestions about how the code could be refactored. Pylint can infer actual values from your code using its internal code representation (astroid). If your code is import logging as argparse, Pylint will know that argparse.error(…) is in fact a logging call and not an argparse call.
Pylint
is highly configurable and permits to write plugins in order to add your own checks (for example, for internal libraries or an internal rule). Pylint also has an ecosystem of existing plugins for popular frameworks and third party libraries.
|
|
xenon
Xenon is a monitoring tool based on Radon. It monitors your code’s complexity. Ideally, Xenon is run every time you commit code. Through command line options, you can set various thresholds for the complexity of your code. It will fail (i.e. it will exit with a non-zero exit code) when any of these requirements is not met.
|
|
vulture
Vulture finds unused code in Python programs. This is useful for cleaning up and finding errors in large code bases. If you run Vulture on both your library and test suite you can find untested code.
Due to 🐍Python’s dynamic nature, static code analyzers like Vulture are likely to miss some dead code. Also, code that is only called implicitly may be reported as unused. Nonetheless, Vulture can be a very helpful tool for higher code quality.
|
|
Few more handy libraries
For more elements on how we automated linter to keep code quality on the elevated level, refer to 🐍 Python CI jobs. All of these linters can be tested automatically and for that purpose, we created a comprehensive set of commands using Makefile.
In addition, recommendation to check, explore and considering:
pycodestyle
yapf
autopep8
HowDOI
- good tool for a quick search.iSort
- isort is a Python utility / library to sort imports alphabetically, and automatically separated into sections and by type.
Tools for automating our coding quality standards
For automating code quality and testing, we are using our own product GitLab CI/CD pipeline. Details of pipelines we use for python should be found on the page CI jobs (Python).
When not to use Python
Since this style guide is for the entire data team, it is important to remember that there is a time and place for using Python
and it is usually outside of the data modeling phase.
Stick to SQL
for data manipulation tasks where possible.
7db9c423
)