Pre-commit hooks

This repository uses the Python package pre-commit to manage pre-commit hooks. Pre-commit hooks are actions which are run automatically, typically on each commit, to perform some common set of tasks. For example, a pre-commit hook might be used to run any code linting automatically before code is committed, ensuring common code quality.

Purpose

For this repository, we are using pre-commit for a number of purposes:

  • checking for secrets being committed accidentally — there is a strict definition of a “secret”; and

  • checking for any large files (over 5 MB) being committed.

  • cleaning Jupyter notebooks, which means removing all outputs, execution counts, Python kernels, and, for Google Colaboratory (Colab), stripping out user information.

We have configured pre-commit to run automatically on every commit. By running on each commit, we ensure that pre-commit will be able to detect all contraventions and keep our repository in a healthy state.

Pre-commit hooks and Google Colab

No pre-commit hooks will be run on Google Colab notebooks pushed directly to GitHub. For security reasons, it is recommended that you manually download your notebook, and commit up locally to ensure pre-commit hooks are run on your changes.

Installation

In order for pre-commit to run, action is needed to configure it on your system.

  • install the pre-commit package into your Python environment from requirements.txt; and

  • run pre-commit install in your terminal to set up pre-commit to run when code is committed.

Using the detect-secrets pre-commit hook

Secret detection limitations

The detect-secrets package does its best to prevent accidental committing of secrets, but it may miss things. Instead, focus on good software development practices! See the definition of a secret for further information.

We use detect-secrets to check that no secrets are accidentally committed. This hook requires you to generate a baseline file if one is not already present within the root directory. To create the baseline file, run the following at the root of the repository:

detect-secrets scan > .secrets.baseline

Next, audit the baseline that has been generated by running:

detect-secrets audit .secrets.baseline

When you run this command, you’ll enter an interactive console. This will present you with a list of high-entropy string and/or anything which could be a secret. It will then ask you to verify whether this is the case. This allows the hook to remember false positives in the future, and alert you to new secrets.

Definition of a “secret” according to detect-secrets

The detect-secrets documentation, as of January 2021, says it works:

…by running periodic diff outputs against heuristically crafted [regular expression] statements, to identify whether any new secret has been committed.

This means it uses regular expression patterns to scan your code changes for anything that looks like a secret according to the patterns. By definition, there are only a limited number of patterns, so the detect-secrets package cannot detect every conceivable type of secret.

To understand what types of secrets will be detected, read the detect-secrets documentation on caveats, and the list of supported plugins. Also, you should use secret variable names with words that will trip the KeywordDetector plugin; see the [DENYLIST variable for the full list of words][detect-secrets-keyword-detector].

If pre-commit detects secrets during commit

If pre-commit detects any secrets when you try to create a commit, it will detail what it found and where to go to check the secret.

If the detected secret is a false positive, there are two options to resolve this, and prevent your commit from being blocked:

In either case, if an actual secret is detected (or a combination of actual secrets and false positives), first remove the actual secret. Then following either of these processes.

Updating .secrets.baseline

To exclude a false positive, you can also update the .secrets.baseline by repeating the same two commands as in the initial setup.

During auditing, if the detected secret is actually a secret (or other sensitive information), remove the secret and re-commit. There is no need to update the .secrets.baseline file in this case.

If your commit contains a mixture of false positives and actual secrets, remove the actual secrets first before updating and auditing the .secrets.baseline file.

Keeping specific Jupyter notebook outputs

It may be necessary or useful to keep certain output cells of a Jupyter notebook, for example charts or graphs visualising some set of data. To do this, according to the documentation for the nbstripout package, either:

  1. add a keep_output tag to the desired cell; or

  2. add "keep_output": true to the desired cell’s metadata.

You can access cell tags or metadata in Jupyter by enabling the “Tags” or “Edit Metadata” toolbar (View > Cell Toolbar > Tags; View > Cell Toolbar > Edit Metadata).

For the tags approach, enter keep_output in the text field for each desired cell, and press the “Add tag” button. For the metadata approach, press the “Edit Metadata” button on each desired cell, and edit the metadata to look like this:

{
  "keep_output": true
}

This will tell the hook not to strip the resulting output of the desired cell(s), allowing the output(s) to be committed.

Tags and metadata on Google Colab

Currently (March 2020) there is no way to add tags and/or metadata to Google Colab notebooks.

It’s strongly suggested that you download the Colab as a .ipynb file, and edit tags and/or metadata using Jupyter before committing the code if you want to keep some outputs.

Pre-commit hooks in the {{ cookiecutter.repo_name }} folder

If you’re changing any pre-commit hooks, note that .pre-commit-config.yaml can be different to {{ cookiecutter.repo_name }}/.pre-commit-config.yaml. This includes which pre-commit hooks are run, as well as on which folders.

It’s strongly recommended that you build an example project to test out your changes.