Documentation

Motivation

Even well written code requires some documentation. Good documentation makes your code easier for someone else to understand and easier to work with. It also helps people who join the project or review the code. Documentation should always be up to date and stored alongside the code itself (for example as part of the code Gitlab or Github repository), so that people don’t have to search for it.

Documentation can also be excessive! Writing readable, automated code (see code standards) reduces the need for lots of code comments and lengthy desk instructions.

4. Document everything that is needed to write and run the code

You should document everything that a contributor would need to know to review, edit or run your code. Things should be documented alongside the code whenever possible, rather than somewhere else that readers will need to find and search for. We have listed some essential forms of documentation below. You may need to make others, depending on the project.

4.1 Create README files for every project

All pipelines should have a README file. README files are easy to create and maintain but are often neglected as part of pipeline development.

The README file should be the first thing that someone looking at your code reads. As a minimum, a README should include:
* Background - what the code is for and the process it executes. A simple and concise summary of what the process does and its inputs and outputs is very helpful here.
* Set up instructions, for example how to install the code, get all the supporting packages you need, prepare your input data, and configure the workflow.
* How to run the code.

Depending on the project, you might add other important information to your README file, such as how somebody should contribute to the code and which style guide to use. After reading the README, anyone should be able to install and run your code, provided they have access to the data.

If you need detailed running instructions, you might create extra documentation for these and link it to your README file.

4.2 Embed design documentation with code

Every project should have some design documentation. How much you need depends on the complexity of the work. At the very least it should cover what the pipeline does, what it uses as its inputs and what outputs it makes. This can be included in the README file, or linked from it if there is a lot of detail. You can also use flow charts to show what the overall design looks like. It is also useful to link to relevant documentation about the methods used and why the workflow is set up as it is.

Documenting the design comprehensively helps everyone understand what the code is supposed to achieve. It also helps you to assess and assure that the code actually does what it is supposed to do. Having a clear idea of the overall design and why it is set up this way also makes it clear whether the code could be improved.

4.3 Document dependencies

Most pipelines rely on external packages or modules. Both R and python have extensive libraries of pre-built functions so you do not need to rewrite commonly used functionality yourself.

Make it as easy as possible for others to replicate the correct environment when running the code. That means documenting the packages you use, for example using requirements files in python or a DESCRIPTION file and other package documentation in R. Generally, packaging your code in python modules or an R package is the best way to make sure dependencies are handled cleanly.

At a minimum, you should have a list of dependencies in your documentation. If you need specific versions of these for the code to work, make sure you let your users know which versions they can use.

4.4 Record manual quality assurance and tests

When tests and checks are not automated, record these alongside the code. Include manual checks and tests in desk instructions, for example in the README file or a separate testing guide. This means other people running the pipeline will be able to replicate your quality assurance steps and verify the work you have done.

For manual tests and checks, maintain a clear record of what was tested, who by, when and what they found. The easiest way to do this is as part of a code review, which you can record using pull requests on GitHub or merge requests on Gitlab.

5. Document what the code is doing

Your scripts should contain enough documentation for other coders to be able to understand what each part of the code is intended to do. However, there is such a thing as too much documentation. Documentation inside the code should not repeat things that are documented elsewhere, for example in the README file.

5.1 Code comments should explain the “why” not the “how”

Use code comments (other than function documentation) sparingly. Code comments are there to explain why the code is written the way it is when it would not be obvious to an experienced programmer. They are not there to explain the syntax to inexperienced programmers or to repeat in plain English the logic that is already there in the code.

Tip

Too many code comments make code hard to read and work with!

The one exception to this rule is when the code is written specifically for learning purposes. A lot of people assume code should be littered with comments, as the tutorial code they learned from was heavily commented. If colleagues struggle to understand production code, you should support them to learn rather than turning pipeline code into tutorial code.

Explanations of methodology or your catalogue of assumptions do not belong in code comments. Rather, you should record them elsewhere and signpost them in your documentation. For example, you can create assumption or methodology documentation files and place them alongside the code in your repository.

It can be useful to note in your code comments when an assumption you have made or method you have used applies directly, but you should reference the assumption or method and provide a link to where more information about it can be found, rather than setting this out in detail in the code comments.

5.2 Document functions and classes

If your code uses functions and classes, you must document them appropriately. In python that means using docstrings. In R it means using roxygen2. These formats are especially useful when packaging code in R or writing modules and packages in python. We recommend using them for all your functions and classes because they give you a standard format to work with that makes it much easier to package code at a later date. Otherwise, you can use a block of code comments.

Good function and class documentation reduces the need to clutter the code with comments and ensures you cover the most important information. Always document what a function or class takes in as its inputs, what it does, and what output it produces. If you can, provide an example of use.