Diff of /DEVELOPERS.md [000000] .. [e988c2]

Switch to side-by-side view

--- a
+++ b/DEVELOPERS.md
@@ -0,0 +1,406 @@
+# Notes for developers
+
+## System requirements
+- [`just`](https://github.com/casey/just)
+- Docker
+- recent version of Bash (see macOS notes)
+
+## Local development environment
+The `just` command provides a list of available recipes:
+```
+just list
+```
+
+Running any of the `just` commands that need it will set up a local environment and install dependencies.
+
+## Testing
+
+### Test categories
+
+Tests are divided into the following categories.
+
+<dl>
+   <dt>unit</dt><dd>fast tests of small code units</dd>
+   <dt>spec</dt><dd>tests generated from the ehrQL spec</dd>
+   <dt>generative</dt><dd>query engine tests generated by Hypothesis</dd>
+   <dt>functional</dt><dd>tests which exercise the ehrQL CLI end-to-end</dd>
+   <dt>acceptance</dt><dd>tests which check compatibility with real studies</dd>
+   <dt>integration</dt><dd>tests of detailed code logic that require a database</dd>
+   <dt>docker</dt><dd>tests of the ehrQL docker image</dd>
+   <dt>docs</dt><dd>tests of the documentation examples</dd>
+</dl>
+
+Each category lives in its own directory (for example `tests/unit`) and has its own `just` command to run it (for
+example `just test-unit`).
+
+### Running tests
+
+To run all tests, as they're run in CI with code coverage):
+```
+just test-all
+```
+
+To run just one category of tests:
+```
+just test-<category>
+```
+
+For example:
+```
+just test-unit
+```
+
+Additional arguments can be passed to any test commands, for example:
+```
+just test-unit --verbose --durations=10
+```
+
+For maximum flexibility, the `test` command can be used to run individual test files or tests, or to do other clever things with `pytest`. It just delegates to `pytest`.
+For example:
+```
+just test tests/integration/backends/test_tpp.py
+```
+
+There are further notes on using `pytest` in the wiki here:
+https://github.com/opensafely-core/ehrql/wiki/Tips-for-using-pytest
+
+
+#### Generative tests
+
+The generative tests use Hypothesis to generate variable definitions (in
+the query model) and test data.  They then execute the resulting dataset
+definitions using all the available query engines, and check that the
+results are the same.
+
+To get the most out of our generative tests we want them to run for a
+long time and to explore different parts of the query space each time
+they're run. But neither of these qualities are desirable in CI.
+
+For this reason, the CI configuration generates only a small number of
+examples, of limited query depth, and in a determinstic fashion. They
+thus function more as a test that the generative test machinery is
+working correctly than as a serious attempt to do generative testing.
+
+When running locally, although randomisation is enabled by default,
+the number of examples is still very small (to keep test runtimes
+reasonable). To do some "proper" generative testing you can run the
+command:
+```
+just test-generative
+```
+which increases the example count by setting `GENTEST_EXAMPLES`.
+
+It would be worth running these locally when adding new query model
+operations or making significant changes to a query engine. You may even
+want to crank the settings further e.g.
+```
+GENTEST_EXAMPLES=10000 GENTEST_MAX_DEPTH=20 just test-generative
+```
+
+You can control which query engines the tests exercise using the
+enviornment variable `GENTEST_QUERY_ENGINES`. For instance, if you have
+made a change to the basic SQL-building logic in BaseSQLQueryEngine and
+you want to rapidly test this with a large number of examples you could
+compare just the in-memory and SQLite engines using:
+```
+GENTEST_QUERY_ENGINES='in_memory sqlite' GENTEST_EXAMPLES=10000 just test-generative
+```
+
+In addition to whatever you do locally, a scheduled Github Actions
+workflow runs the generative test overnight with settings as high as we
+can get away with and alerts us in Slack if it finds a failure.
+
+You can get Hypothesis to dump statistics at the end of the run with `--hypothesis-show-statistics`,
+or (more usefully) dump some of our own statistics about the generated data and queries by setting `GENTEST_DEBUG=t`.
+
+When debugging a failure you'll probably want to reproduce it.
+Hypothesis often struggles to shrink the examples it finds, and even
+small examples can appear overwhelmingly verbose due to the repetitive
+nature of query model reprs. To help with this there is some tooling,
+and a process to follow:
+
+ * Copy the `dataset` and `data` arguments from the example that
+   Hypothesis displays and paste them into a new file.  (Don't worry
+   about stripping indentation or trailing commas here.)
+
+ * Run the command:
+   ```
+   just gentest-example-simplify PATH_TO_FILE.py
+   ```
+   This should transform the copied code into a valid test example and
+   pull out some of the repeated elements into variables.
+
+ * Run the command:
+   ```
+   just gentest-example-run PATH_TO_FILE.py
+   ```
+   This should execute the example and confirm that the test fails in
+   the expected way.
+
+ * To further simplify the example you can copy a repeated element,
+   assign it to a variable and then re-run `gentest-example-simplify` on
+   the file. This will replace occurances of that element with a
+   reference to the variable.
+
+Hypothesis can generate query graphs that are very deeply nested; after 100 draws in a test example, hypothesis will return the example as invalid.  In order to avoid this, the
+variable strategies check for a maximum depth and return a terminal node if the maximum depth is exceeded (A `SelectColumn` node for a series strategy, and a `SelectTable` or `SelectPatientTable` for a table strategy). The max depth defaults to 15 and can be overridden with environment variable `GENTEST_MAX_DEPTH`.
+
+See the [generative tests documentation](tests/generative/README.md) for more details.
+
+### Writing tests
+
+Please think carefully about how to test code that you are adding or changing.
+We think that test maintainability is a big risk for this system, so we're trying to be very deliberate about the kind of tests that we write.
+You should follow these guidelines and raise it with the rest of the team for discussion if you think that they are problematic.
+
+* _Major features in ehrQL_ should be motivated by a study used in an **acceptance** test.
+  The number of such studies should be kept small in order that they don't be come a maintenance burden.
+  The studies we use for the acceptance tests will need to be chosen carefully as representative of how we expect ehrQL to be used; they may be real studies or synthetic ones as appropriate.
+* All _erhQL features_ should be covered by **spec** tests.
+* Complex _query language logic_ that is not fully covered by the spec tests should be covered by **unit** tests.
+  To avoid duplication, you should not write unit tests for logic that is adequately covered by the spec tests.
+* The main _functionality of query engines_ will naturally be covered by **spec** tests, which are run against all the query engines.
+* Complex _query engine logic_ that is not fully covered by the spec tests should be covered by **unit** or **integration** tests as appropriate.
+  To avoid duplication, you should not write such tests for logic that is adequately covered by the spec tests.
+* Basic operation of ehrQL as a _CLI tool_ is exercised by one trivial embedded **acceptance** test.
+* Functionality of the _Docker image_ and how it invokes ehrQL are covered by a small set of **docker** tests.
+* Where _backend tables_ do not map directly to the schemas that they implement, it may be helpful to write **integration** tests.
+* All other _supporting logic_ should be covered by **unit** tests. Please avoid the temptation to cover this using acceptance, integration or docker tests that run the ehrQL end-to-end.
+
+Contrary to practice in some quarters we allow disk access by **unit** tests because it doesn't seem to cause any significant slow-down in those tests at the moment.
+We'll keep this under review.
+
+#### Logging in tests
+
+Logging is very verbose and is turned off by default in tests.  To turn it on, set the
+environment variable `LOG_SQL=1` and pass the `-s` option to turn
+off log capture in pytest.
+
+```
+LOG_SQL=1 just test-all -s
+```
+
+### Codebase structure
+
+The files for test categories that target individual modules (for example **unit** and **integration** tests) are organized into roughly the same directory structure as the `ehrql` package itself.
+
+Generally a module `ehrql.foo` will have a corresponding test file like `tests/unit/test_foo.py`.
+However we do not stick slavishly to this: where appropriate we may collapse tests for submodules like `ehrql.foo.{bar,bam}` into a single test file like `tests/unit/test_foo.py`,
+or break tests for a module like `ehrql.foo` into multiple test files like `tests/unit/foo/test_{one,another}_aspect.py`.
+
+Test categories that run against the ehrQL as a whole or against multiple components (for example **spec** and **acceptance** tests) have their own internal structure.
+
+### Code coverage
+
+Our approach to code coverage is to fail the build with less than 100% coverage, but be reasonably liberal about allowing lines to be marked as not requiring coverage.
+If you make a change that results in a newly un-covered line, you should make a good attempt to test it and expect to have to justify any omissions to PR reviewers;
+but for genuinely hard or impossible to hit cases it's okay to mark them as `no cover`.
+
+Any `no cover` pragmas should include a note explaining why the code can't be hit.
+Common cases are configured in `pyproject.toml` with their own explanations.
+
+### Test databases
+
+For tests that need to run against a database, we run the database in a Docker container.
+Each run of the tests starts the database if it's not already running _and then leaves it running_ at the end to speed up future runs.
+(Each test cleans out the schema to avoid pollution.)
+
+There is a `just` command to remove the database containers:
+```
+just remove-database-containers
+```
+
+### Displaying SQL queries
+
+Set the environment variable `LOG_SQL=1` (or anything non-empty) to get all SQL queries logged to the console.  To get SQL queries in [test runs](#logging-in-tests), also use `-s` to turn
+off log capture in pytest.
+
+## ehrQL's security properties
+
+ehrQL is responsible for enforcing certain security boundaries within the OpenSAFELY platform. These are narrowly defined and the sections of the code which handle them are small and well-contained, so the vast majority of changes to ehrQL will not go anywhere near them. Nevertheless, it's important that anyone writing or reviewing ehrQL code be aware of these so they know to be alert for changes which could possibly have an impact.
+
+ehrQL only accesses sensitive patient data while running inside a secure OpenSAFELY job processing pipeline, which means that all the most critical security properties are already enforced. This leaves ehrQL itself with two responsibilities:
+ * Ensure that each ehrQL job can only access the data to which it is supposed to have access.
+ * Ensure that patient data is only written to the locations at which it is supposed to be written.
+
+These considerations affect three areas of the codebase.
+
+
+### 1. Isolating user-supplied code from the rest of ehrQL
+
+Users interact with ehrQL by writing Python code which builds a graph of query model objects describing the data to be selected. To ensure that the user's Python code has no direct access to data or ability to modify the environment we run it in a separate, highly restricted process and retrieve a JSON-serialised specification of the query graph. This is all handled by the [`loaders`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/loaders.py) module.
+
+Functions which need to evaluate user-supplied code should always use the methods provided in the `loaders` module and **never import user code directly**. (Given the fiddly process needed to import user code it is unlikely anyone would attempt to do this _without_ looking for the pre-defined `loader` methods in any case.)
+
+
+### 2. Adding additional restrictions to queries
+
+The `Backend` class provides a [`modify_dataset()`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/backends/base.py#L26-L30) hook which allows the backend to add additional restrictions to the user's query to control what data it returns. Any changes to the query processing workflow **must ensure that this hook continues to be called**. This is currently enforced by an [integration test](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/tests/integration/backends/test_tpp.py#L3142-L3163) which ensures that the [`TPPBackend.modify_dataset()`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/backends/tpp.py#L97) hook continues to behave as expected. 
+
+
+### 3. Avoiding logging of patient data
+
+The logs which ehrQL produces are treated as being at a different privacy level from the outputs it writes to disk. It is therefore important that **the logs themselves never contain individual patient data**. Fortunately this property is relatively easy to maintain because so few parts of the codebase deal directly with patient data. Data is retrieved by the [`execute_query_with_results()`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/query_engines/base_sql.py#L972-L983) method on the base query engine. This returns an iterator of rows of data. Functions which _consume_ rows of this iterator must avoid logging any values obtained from the rows. (Functions which merely wrap the iterator without consuming it will never have a reference to any patient data and so are not at risk in the same way.) 
+
+The key parts of the codebase which deals with individual rows of data are the [`file_formats`](https://github.com/opensafely-core/ehrql/tree/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/file_formats) module, which handles writing files to disk, and the [`sqlalchemy_exec_utils`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/utils/sqlalchemy_exec_utils.py) module, which handles batch fetching of results. Careful attention must be paid to any log calls in this modules to ensure that we are not logging individual patient data.
+
+Note that "logs" here includes everything written to stdout/stderr, including calls to `print`, not just lines written using Python's logging mechanisms.
+
+## macOS
+
+Starting with version 4.0, Bash is licenced under GPLv3.
+Because of this, macOS still ships with version 3.2, which is incompatible with some scripts in this repository.
+We recommend using [homebrew](https://brew.sh/) to install a more recent version, ensure that the new version is on your `$PATH`, and restart your Terminal/shell session if necessary.
+
+```bash
+brew install bash
+```
+
+## Static Type Checking
+We previously used [mypy](https://mypy.readthedocs.io/en/stable/) and type annotations to perform correctness checking of the code base.
+However, we made the decision to remove this stack after finding it was not a good fit for large parts of the code base.
+
+This does not mean we've abandoned type annotations entirely.
+The `query_model` module still makes heavy use of them and implements its own runtime checking to enforce them.
+And developers should feel free to use them wherever this aids clarity vs a docstring or a comment.
+
+Dataclasses have also retained their annotations to avoid initialising all fields with None.
+
+
+## Documentation
+
+The documentation in this repository forms part of the main [OpenSAFELY documentation](https://github.com/opensafely/documentation).
+
+It can also be built as a standalone documentation site with MkDocs to preview content changes, by running:
+
+    just docs-serve
+
+:warning: The documentation will look slightly different from OpenSAFELY's. Relative links to
+sections of the main documentation outside of the /ehrql sections will not work (although
+a scheduled [Github Action](https://github.com/opensafely-core/ehrql/actions/workflows/check-docs-links.yml) runs overnight to check them).
+
+We may be able to improve this later, depending on the behaviour of the mkdocs plugin that
+we use: see https://github.com/opensafely-core/ehrql/issues/1126
+
+### Documentation redirects
+
+These are handled in the main [OpenSAFELY documentation repository](https://github.com/opensafely/documentation).
+If you need to redirect URLs —
+and this should be fairly infrequent —
+make any changes to the `_redirects` file in the main documentation repository,
+and test them in a preview there.
+
+### Structure
+
+ehrQL documentation is located in the [docs](docs/) directory. Local configuration is
+specified in the `mkdocs.yml` located at the repo root.
+
+The `docs/` directory contains some files which are generated from the ehrql code and from
+other documentation files. Specifically these are files at:
+
+ - [docs/includes/generated_docs/](docs/includes/generated_docs/)
+
+The process for generating these files is described below.
+
+When the main OpenSAFELY documentation is built, it imports the ehrql `docs/` directory
+and builds it within the main documentation site. This assumes that all generated documentation
+has been updated already (see below for a description of pre-commit hooks and Github Actions
+that mechanisms that check this happens).
+
+#### Process for updating ehrql documentation
+
+1. Developer makes changes to documentation files or code/tests that generate documentation
+2. Changes committed; pre-commit hook ensures generated docs are up-to-date
+3. PR opened; CI:
+      - ensures generated docs are up to date
+3. PR merged; CI:
+      - triggers a deploy of the main OpenSAFELY documentation site
+4. On a schedule (weekly), CI in the main `opensafely/documentation` repository:
+      - checks all the documentation links are valid
+
+### Using includes from the parent documentation
+
+The most likely use case is including the `glossary.md` from the parent documentation.
+
+To do so, use a slightly different snippet syntax:
+
+```
+!!! parent_snippet:'includes/glossary.md'
+```
+
+### Generating data for documentation
+
+Some ehrQL documentation is generated from code in this repo.
+
+See the [spec tests docs](tests/spec/README.md) for further information on writing tests that
+contribute to the ehrQL docs.
+
+An intermediate step generates the markdown files that are included in the documentation, which are located in `docs/includes/generated_docs`.
+
+To generate these files, run:
+
+    just generate-docs
+
+If any of the files have changed you should include them in your PR.
+
+To verify the documentation is up to date, run:
+
+    just docs-check-generated-docs-are-current
+
+This will display a diff if there are any uncommitted changes to the generated markdown files. It also runs as a CI step, and will fail if there are changes that need to be committed to the repo.
+
+### Testing ehrQL definitions included in the documentation
+
+All of the example tests can be run with:
+
+    just test-docs-examples
+
+Dataset and measures definitions may be included:
+* inline in Markdown files in `docs/`,
+  labelled as code blocks with the `ehrql` syntax,
+* or as Python `.py` files in `docs/`.
+
+#### Examples using `codelist_from_csv()`
+
+For testing examples,
+`codelist_from_csv()` is currently patched out to work without any CSV,
+nor are codelist codes validated.
+
+The function signature of `codelist_from_csv()` calls from examples *is* checked.
+
+This may be improved in future to make the testing more rigorous;
+see #1694.
+
+#### Inline code blocks (Markdown fences)
+
+Examples in the documentation Markdown source will be tested as part of the test suite
+if you place complete examples in a code block with the `ehrql` syntax label: `` ```ehrql ``
+
+This will still highlight the code as if it were Python.
+
+:warning: The `ehrql` syntax label is for inline and complete ehrQL blocks only.
+
+We use the SuperFences extension for extracting Markdown fences.
+Refer to the [SuperFences documentation](https://facelessuser.github.io/pymdown-extensions/extensions/superfences/#nested-fence-format) for more details of the fence format.
+
+#### ehrQL definitions as included Python files
+
+Python files in the `docs/` directory are assumed to be working ehrQL dataset or measures definitions.
+
+They are also tested in the test suite.
+
+If included in the documentation using the snippet syntax,
+they must be used with a `python` syntax label.
+(If they were labelled as `ehrql`,
+the snippet line itself would be extracted from the Markdown,
+and treated as a dataset definition.)
+
+### Updating the main OpenSAFELY documentation repository
+
+Merges to the main branch in this repo trigger a [deployment of the main OpenSAFELY documentation via a Github Action](https://github.com/opensafely-core/ehrql/actions/workflows/deploy-documentation.yml).
+
+This uses the `DOC_WRITE_TOKEN`, which is a Github fine-grained personal access token with access to the [opensafely/documentation repository](https://github.com/opensafely/documentation) and repository permissions `Read access to metadata` and ` Read and Write access to actions`.
+
+## Deployment
+
+For notes on how ehrQL is currently deployed in the production OpenSAFELY environment (including how to deploy an updated version), please refer to the [backend-server playbook](https://github.com/opensafely-core/backend-server/blob/main/playbook.md).