Jupyter Guide

Guidance on setting up JupyterLab

See related repository

Features

  • Full install of JupyterLab with the most useful extensions pre-installed.
  • Common python DS/ML libraries (pandas, scikit-learn, sci-py, etc.)
  • Natively connected to Snowflake using your dbt credentials. No login required!
  • Git functionality: push and pull to Git repos natively within JupyterLab (requires ssh credentials)
  • Run any python file or notebook on your computer or in a GitLab repo; the files do not have to be in the data-science container.
  • Linting python code using black natively within Jupyter
  • Need a feature you use but don’t see? Let us know on #bt-data-science!

Getting Started

First, you need to install and launch Rancher Desktop, an open-source container manager, on your local machine.

You have two options when setting up jupyter via the data-science project. Choose from one of the following:

  • Full install (Recommended): Creates a pipenv virtual environment on your local machine, installs Mambaforge and the libraries defined in this Pipfile
  • Minimal install: Installs only Creates a pipenv virtual environment and installs the libraries defined in this Pipfile. This install should be used if you already have a python environment on your local machine that you would like to use as the base image instead of the Mambaforge version. Requires Python 3.9.

Installation Instructions

  1. Prerequisites - before installing please make sure your system is setup with the following:
    • Python3
    • Pip3 (usually aliased as pip).
    • pipenv. If not installed, can be installed on the command line pip install pipenv.
    • On certain versions of MacOS, it may be required to install Xcode Command Line Tools. From the command line, xcode-select --install
  2. Clone the repo to your local machine git clone git@gitlab.com:gitlab-data/data-science.git
  3. Navigate to the directory: cd data-science
  4. Based on which version you would like to install, run one of the following:
    • For full install: make setup-jupyter-local
    • For minimal install: make setup-jupyter-local-no-mamba
  5. make jupyter-local
  6. Jupyter Lab will launch automatically in your default browser.

Linting the repository

Included in the environment setup are all of the libraries needed to lint Jupyter notebooks in the repository. When you launch JupyterLab and open a notebook you should see a new “Format Notebook” icon in the task bar of your notebook. Clicking that button will lint your entire notebook.

Alternatively, after completing the above setup instructions run:

make lint

From the root of the data science repo, this will find and correct and issues according to the Black format.

Connecting to Snowflake

  1. Make sure on your local machine you have setup /Users/{your_user_name}/.dbt/profiles.yml file which does not include your password. profiles.yml must be placed in this directory in your “home” directory otherwise, you will not be able to connect to Snowflake from your local machine. You can use the example provide here as reference
  2. Run through the auth_example notebook in the repo to confirm that you have configured everything successfully. The first time you run it you will get a browser redirect to authenticate your snowflake credientials via Okta. After that, if you run the notebook again you should be able to query data from Snowflake.
  3. If you get an error then likely Snowflake is not properly configured on your machine. Please refer to the Snowflake and dbt sections of the Data Onboarding Issue. It is likely that your .dbt/profiles.yml is not setup correctly.

Mounting a local directory

By default, the local install will use the data-science folder as the root directory for jupyter. This is not terribly useful when your code, data, and notebooks are in other repositories on your computer. To change, this you will need to create and modify a jupyter notebook config file:

  1. Open terminal and nagivate to the data-science repo, e.g. cd repos/data-science. The config file must be created with the pipenv we setup in the above steps: pipenv run jupyter-lab --generate-config. This creates the file /Users/{your_user_name}/.jupyter/jupyter_lab_config.py.
  2. Browse to the file location and open it in an editor
  3. Search for the following line in the file: #c.ServerApp.root_dir = '' and replace with c.ServerApp.root_dir = '/the/path/to/other/folder/'. If unsure, set the value to your repo directory (i.e. c.ServerApp.root_dir = '/Users/{your_user_name}/repos'). Make sure you remove the # at the beginning of the line.
  4. Make sure you use forward slashes in your path. Backslashes could be used if placed in double quotes, even if folder name contains spaces as such as \{your_user_name}\Any Folder\More Folders\
  5. Rerun make jupyter-local from the data-science directory and your root directory should now be changed to what you specified above.

Enabling Jupyter Templates

The data science team has created modeling templates that allow you to easily start building predictive models without writing python code from scratch. To enable these templates:

  • In your jupyter_lab_config.py that you created as part of the Mounting a local directory, add the following lines, replacing /Users/{your_user_name}/repos/ with the path to the data-science/templates repo on your local machine:
c.JupyterLabTemplates.template_dirs = ['/Users/{your_user_name}/repos/data-science/templates']
c.JupyterLabTemplates.include_default = False
  • Launch JupyterLab and you should see a new Template icon. Click the icon and select which template you would like to use. alt text

Increasing Container Memory Allocation

By default, rancher will allocate a small percentage of your machine’s memory to run containers. This is likely not enough RAM to work with jupyter and python, as data is held in-memory. It is recommended you increase the memory allocation to avoid out-of-memory errors.

  1. Open Rancher Desktop (on Mac, there should be an icon on the top menu bar), Preferences, then Kubernetes Settings
  2. Here you can allocate additional memory and CPUs to be used by your containers. 8GB and 2 CPUS are recommended but you may have to increase it futher if working with large datasets or an intensive multithreaded process.
  3. Click “Restart Kubernetes”.

Setting Up Jupyter Extensions

  • The data-science repo comes with many useful Jupyter Lab extensions pre-installed, including git, variable inspector, collapsible headings, execute time, and system monitor.
  • To get the most out of these (and to avoid having to configure them every time you run the container), create the following file: /Users/{your_user_name}/.jupyter/lab/user-settings/@jupyterlab/notebook-extension/tracker.jupyterlab-settings
  • Within that file, paste the following and save:
{
    "codeCellConfig": {
        "codeFolding": true,
        "lineNumbers": true,
    },
    
    "recordTiming": true,
    
}

Updating the Image Instructions

  1. From the data science repo, Pull the latest changes from the image to your local machine git pull
  2. Run the installation commands:
    • For full install: make setup-jupyter-local
    • For minimal install: make setup-jupyter-local-no-mamba
  3. Launch Jupyter Labmake jupyter-local

Some interesting libraries included

Data/Model Analysis

Visualisation tools:

ML libraries

Easy concurrency

  • Modin: Pandas optimization
  • Dask (must be self-installed): Parallel computing
Last modified December 18, 2023: Reword Gitlab to GitLab (ebb703f2)