ehrql

Downloads: 1

[e988c2]: / DEVELOPERS.md

History

Download this file

407 lines (280 with data), 21.4 kB

Notes for developers

System requirements

just
Docker
recent version of Bash (see macOS notes)

Local development environment

The just command provides a list of available recipes:

just list

Running any of the just commands that need it will set up a local environment and install dependencies.

Testing

Test categories

Tests are divided into the following categories.

unit: fast tests of small code units
spec: tests generated from the ehrQL spec
generative: query engine tests generated by Hypothesis
functional: tests which exercise the ehrQL CLI end-to-end
acceptance: tests which check compatibility with real studies
integration: tests of detailed code logic that require a database
docker: tests of the ehrQL docker image
docs: tests of the documentation examples

Each category lives in its own directory (for example tests/unit) and has its own just command to run it (for
example just test-unit).

Running tests

To run all tests, as they're run in CI with code coverage):

just test-all

To run just one category of tests:

just test-<category>

For example:

just test-unit

Additional arguments can be passed to any test commands, for example:

just test-unit --verbose --durations=10

For maximum flexibility, the test command can be used to run individual test files or tests, or to do other clever things with pytest. It just delegates to pytest.
For example:

just test tests/integration/backends/test_tpp.py

There are further notes on using pytest in the wiki here:
https://github.com/opensafely-core/ehrql/wiki/Tips-for-using-pytest

Generative tests

The generative tests use Hypothesis to generate variable definitions (in
the query model) and test data. They then execute the resulting dataset
definitions using all the available query engines, and check that the
results are the same.

To get the most out of our generative tests we want them to run for a
long time and to explore different parts of the query space each time
they're run. But neither of these qualities are desirable in CI.

For this reason, the CI configuration generates only a small number of
examples, of limited query depth, and in a determinstic fashion. They
thus function more as a test that the generative test machinery is
working correctly than as a serious attempt to do generative testing.

When running locally, although randomisation is enabled by default,
the number of examples is still very small (to keep test runtimes
reasonable). To do some "proper" generative testing you can run the
command:

just test-generative

which increases the example count by setting GENTEST_EXAMPLES.

It would be worth running these locally when adding new query model
operations or making significant changes to a query engine. You may even
want to crank the settings further e.g.

GENTEST_EXAMPLES=10000 GENTEST_MAX_DEPTH=20 just test-generative

You can control which query engines the tests exercise using the
enviornment variable GENTEST_QUERY_ENGINES. For instance, if you have
made a change to the basic SQL-building logic in BaseSQLQueryEngine and
you want to rapidly test this with a large number of examples you could
compare just the in-memory and SQLite engines using:

GENTEST_QUERY_ENGINES='in_memory sqlite' GENTEST_EXAMPLES=10000 just test-generative

In addition to whatever you do locally, a scheduled Github Actions
workflow runs the generative test overnight with settings as high as we
can get away with and alerts us in Slack if it finds a failure.

You can get Hypothesis to dump statistics at the end of the run with --hypothesis-show-statistics,
or (more usefully) dump some of our own statistics about the generated data and queries by setting GENTEST_DEBUG=t.

When debugging a failure you'll probably want to reproduce it.
Hypothesis often struggles to shrink the examples it finds, and even
small examples can appear overwhelmingly verbose due to the repetitive
nature of query model reprs. To help with this there is some tooling,
and a process to follow:

Copy the dataset and data arguments from the example that
Hypothesis displays and paste them into a new file. (Don't worry
about stripping indentation or trailing commas here.)
Run the command:
just gentest-example-simplify PATH_TO_FILE.py
This should transform the copied code into a valid test example and
pull out some of the repeated elements into variables.
Run the command:
just gentest-example-run PATH_TO_FILE.py
This should execute the example and confirm that the test fails in
the expected way.
To further simplify the example you can copy a repeated element,
assign it to a variable and then re-run gentest-example-simplify on
the file. This will replace occurances of that element with a
reference to the variable.

Hypothesis can generate query graphs that are very deeply nested; after 100 draws in a test example, hypothesis will return the example as invalid. In order to avoid this, the
variable strategies check for a maximum depth and return a terminal node if the maximum depth is exceeded (A SelectColumn node for a series strategy, and a SelectTable or SelectPatientTable for a table strategy). The max depth defaults to 15 and can be overridden with environment variable GENTEST_MAX_DEPTH.

See the generative tests documentation for more details.

Writing tests

Please think carefully about how to test code that you are adding or changing.
We think that test maintainability is a big risk for this system, so we're trying to be very deliberate about the kind of tests that we write.
You should follow these guidelines and raise it with the rest of the team for discussion if you think that they are problematic.

Major features in ehrQL should be motivated by a study used in an acceptance test.
The number of such studies should be kept small in order that they don't be come a maintenance burden.
The studies we use for the acceptance tests will need to be chosen carefully as representative of how we expect ehrQL to be used; they may be real studies or synthetic ones as appropriate.
All erhQL features should be covered by spec tests.
Complex query language logic that is not fully covered by the spec tests should be covered by unit tests.
To avoid duplication, you should not write unit tests for logic that is adequately covered by the spec tests.
The main functionality of query engines will naturally be covered by spec tests, which are run against all the query engines.
Complex query engine logic that is not fully covered by the spec tests should be covered by unit or integration tests as appropriate.
To avoid duplication, you should not write such tests for logic that is adequately covered by the spec tests.
Basic operation of ehrQL as a CLI tool is exercised by one trivial embedded acceptance test.
Functionality of the Docker image and how it invokes ehrQL are covered by a small set of docker tests.
Where backend tables do not map directly to the schemas that they implement, it may be helpful to write integration tests.
All other supporting logic should be covered by unit tests. Please avoid the temptation to cover this using acceptance, integration or docker tests that run the ehrQL end-to-end.

Contrary to practice in some quarters we allow disk access by unit tests because it doesn't seem to cause any significant slow-down in those tests at the moment.
We'll keep this under review.

Logging in tests

Logging is very verbose and is turned off by default in tests. To turn it on, set the
environment variable LOG_SQL=1 and pass the -s option to turn
off log capture in pytest.

LOG_SQL=1 just test-all -s

Codebase structure

The files for test categories that target individual modules (for example unit and integration tests) are organized into roughly the same directory structure as the ehrql package itself.

Generally a module ehrql.foo will have a corresponding test file like tests/unit/test_foo.py.
However we do not stick slavishly to this: where appropriate we may collapse tests for submodules like ehrql.foo.{bar,bam} into a single test file like tests/unit/test_foo.py,
or break tests for a module like ehrql.foo into multiple test files like tests/unit/foo/test_{one,another}_aspect.py.

Test categories that run against the ehrQL as a whole or against multiple components (for example spec and acceptance tests) have their own internal structure.

Code coverage

Our approach to code coverage is to fail the build with less than 100% coverage, but be reasonably liberal about allowing lines to be marked as not requiring coverage.
If you make a change that results in a newly un-covered line, you should make a good attempt to test it and expect to have to justify any omissions to PR reviewers;
but for genuinely hard or impossible to hit cases it's okay to mark them as no cover.

Any no cover pragmas should include a note explaining why the code can't be hit.
Common cases are configured in pyproject.toml with their own explanations.

Test databases

For tests that need to run against a database, we run the database in a Docker container.
Each run of the tests starts the database if it's not already running and then leaves it running at the end to speed up future runs.
(Each test cleans out the schema to avoid pollution.)

There is a just command to remove the database containers:

just remove-database-containers

Displaying SQL queries

Set the environment variable LOG_SQL=1 (or anything non-empty) to get all SQL queries logged to the console. To get SQL queries in test runs, also use -s to turn
off log capture in pytest.

ehrQL's security properties

ehrQL is responsible for enforcing certain security boundaries within the OpenSAFELY platform. These are narrowly defined and the sections of the code which handle them are small and well-contained, so the vast majority of changes to ehrQL will not go anywhere near them. Nevertheless, it's important that anyone writing or reviewing ehrQL code be aware of these so they know to be alert for changes which could possibly have an impact.

ehrQL only accesses sensitive patient data while running inside a secure OpenSAFELY job processing pipeline, which means that all the most critical security properties are already enforced. This leaves ehrQL itself with two responsibilities:
* Ensure that each ehrQL job can only access the data to which it is supposed to have access.
* Ensure that patient data is only written to the locations at which it is supposed to be written.

These considerations affect three areas of the codebase.

1. Isolating user-supplied code from the rest of ehrQL

Users interact with ehrQL by writing Python code which builds a graph of query model objects describing the data to be selected. To ensure that the user's Python code has no direct access to data or ability to modify the environment we run it in a separate, highly restricted process and retrieve a JSON-serialised specification of the query graph. This is all handled by the loaders module.

Functions which need to evaluate user-supplied code should always use the methods provided in the loaders module and never import user code directly. (Given the fiddly process needed to import user code it is unlikely anyone would attempt to do this without looking for the pre-defined loader methods in any case.)

2. Adding additional restrictions to queries

The Backend class provides a modify_dataset() hook which allows the backend to add additional restrictions to the user's query to control what data it returns. Any changes to the query processing workflow must ensure that this hook continues to be called. This is currently enforced by an integration test which ensures that the TPPBackend.modify_dataset() hook continues to behave as expected.

3. Avoiding logging of patient data

The logs which ehrQL produces are treated as being at a different privacy level from the outputs it writes to disk. It is therefore important that the logs themselves never contain individual patient data. Fortunately this property is relatively easy to maintain because so few parts of the codebase deal directly with patient data. Data is retrieved by the execute_query_with_results() method on the base query engine. This returns an iterator of rows of data. Functions which consume rows of this iterator must avoid logging any values obtained from the rows. (Functions which merely wrap the iterator without consuming it will never have a reference to any patient data and so are not at risk in the same way.)

The key parts of the codebase which deals with individual rows of data are the file_formats module, which handles writing files to disk, and the sqlalchemy_exec_utils module, which handles batch fetching of results. Careful attention must be paid to any log calls in this modules to ensure that we are not logging individual patient data.

Note that "logs" here includes everything written to stdout/stderr, including calls to print, not just lines written using Python's logging mechanisms.

macOS

Starting with version 4.0, Bash is licenced under GPLv3.
Because of this, macOS still ships with version 3.2, which is incompatible with some scripts in this repository.
We recommend using homebrew to install a more recent version, ensure that the new version is on your $PATH, and restart your Terminal/shell session if necessary.

brew install bash

Static Type Checking

We previously used mypy and type annotations to perform correctness checking of the code base.
However, we made the decision to remove this stack after finding it was not a good fit for large parts of the code base.

This does not mean we've abandoned type annotations entirely.
The query_model module still makes heavy use of them and implements its own runtime checking to enforce them.
And developers should feel free to use them wherever this aids clarity vs a docstring or a comment.

Dataclasses have also retained their annotations to avoid initialising all fields with None.

Documentation

The documentation in this repository forms part of the main OpenSAFELY documentation.

It can also be built as a standalone documentation site with MkDocs to preview content changes, by running:

just docs-serve

⚠️ The documentation will look slightly different from OpenSAFELY's. Relative links to
sections of the main documentation outside of the /ehrql sections will not work (although
a scheduled Github Action runs overnight to check them).

We may be able to improve this later, depending on the behaviour of the mkdocs plugin that
we use: see https://github.com/opensafely-core/ehrql/issues/1126

Documentation redirects

These are handled in the main OpenSAFELY documentation repository.
If you need to redirect URLs —
and this should be fairly infrequent —
make any changes to the _redirects file in the main documentation repository,
and test them in a preview there.

Structure

ehrQL documentation is located in the docs directory. Local configuration is
specified in the mkdocs.yml located at the repo root.

The docs/ directory contains some files which are generated from the ehrql code and from
other documentation files. Specifically these are files at:

docs/includes/generated_docs/

The process for generating these files is described below.

When the main OpenSAFELY documentation is built, it imports the ehrql docs/ directory
and builds it within the main documentation site. This assumes that all generated documentation
has been updated already (see below for a description of pre-commit hooks and Github Actions
that mechanisms that check this happens).

Process for updating ehrql documentation

Developer makes changes to documentation files or code/tests that generate documentation
Changes committed; pre-commit hook ensures generated docs are up-to-date
PR opened; CI:
- ensures generated docs are up to date
PR merged; CI:
- triggers a deploy of the main OpenSAFELY documentation site
On a schedule (weekly), CI in the main opensafely/documentation repository:
- checks all the documentation links are valid

Using includes from the parent documentation

The most likely use case is including the glossary.md from the parent documentation.

To do so, use a slightly different snippet syntax:

!!! parent_snippet:'includes/glossary.md'

Generating data for documentation

Some ehrQL documentation is generated from code in this repo.

See the spec tests docs for further information on writing tests that
contribute to the ehrQL docs.

An intermediate step generates the markdown files that are included in the documentation, which are located in docs/includes/generated_docs.

To generate these files, run:

just generate-docs

If any of the files have changed you should include them in your PR.

To verify the documentation is up to date, run:

just docs-check-generated-docs-are-current

This will display a diff if there are any uncommitted changes to the generated markdown files. It also runs as a CI step, and will fail if there are changes that need to be committed to the repo.

Testing ehrQL definitions included in the documentation

All of the example tests can be run with:

just test-docs-examples

Dataset and measures definitions may be included:
* inline in Markdown files in docs/,
labelled as code blocks with the ehrql syntax,
* or as Python .py files in docs/.

Examples using `codelist_from_csv()`

For testing examples,
codelist_from_csv() is currently patched out to work without any CSV,
nor are codelist codes validated.

The function signature of codelist_from_csv() calls from examples is checked.

This may be improved in future to make the testing more rigorous;
see #1694.

Inline code blocks (Markdown fences)

Examples in the documentation Markdown source will be tested as part of the test suite
if you place complete examples in a code block with the ehrql syntax label: ```ehrql

This will still highlight the code as if it were Python.

⚠️ The ehrql syntax label is for inline and complete ehrQL blocks only.

We use the SuperFences extension for extracting Markdown fences.
Refer to the SuperFences documentation for more details of the fence format.

ehrQL definitions as included Python files

Python files in the docs/ directory are assumed to be working ehrQL dataset or measures definitions.

They are also tested in the test suite.

If included in the documentation using the snippet syntax,
they must be used with a python syntax label.
(If they were labelled as ehrql,
the snippet line itself would be extracted from the Markdown,
and treated as a dataset definition.)

Updating the main OpenSAFELY documentation repository

Merges to the main branch in this repo trigger a deployment of the main OpenSAFELY documentation via a Github Action.

This uses the DOC_WRITE_TOKEN, which is a Github fine-grained personal access token with access to the opensafely/documentation repository and repository permissions Read access to metadata and Read and Write access to actions.

Deployment

For notes on how ehrQL is currently deployed in the production OpenSAFELY environment (including how to deploy an updated version), please refer to the backend-server playbook.