Diff of /DEVELOPERS.md [000000] .. [e988c2]

Switch to unified view

a b/DEVELOPERS.md
1
# Notes for developers
2
3
## System requirements
4
- [`just`](https://github.com/casey/just)
5
- Docker
6
- recent version of Bash (see macOS notes)
7
8
## Local development environment
9
The `just` command provides a list of available recipes:
10
```
11
just list
12
```
13
14
Running any of the `just` commands that need it will set up a local environment and install dependencies.
15
16
## Testing
17
18
### Test categories
19
20
Tests are divided into the following categories.
21
22
<dl>
23
   <dt>unit</dt><dd>fast tests of small code units</dd>
24
   <dt>spec</dt><dd>tests generated from the ehrQL spec</dd>
25
   <dt>generative</dt><dd>query engine tests generated by Hypothesis</dd>
26
   <dt>functional</dt><dd>tests which exercise the ehrQL CLI end-to-end</dd>
27
   <dt>acceptance</dt><dd>tests which check compatibility with real studies</dd>
28
   <dt>integration</dt><dd>tests of detailed code logic that require a database</dd>
29
   <dt>docker</dt><dd>tests of the ehrQL docker image</dd>
30
   <dt>docs</dt><dd>tests of the documentation examples</dd>
31
</dl>
32
33
Each category lives in its own directory (for example `tests/unit`) and has its own `just` command to run it (for
34
example `just test-unit`).
35
36
### Running tests
37
38
To run all tests, as they're run in CI with code coverage):
39
```
40
just test-all
41
```
42
43
To run just one category of tests:
44
```
45
just test-<category>
46
```
47
48
For example:
49
```
50
just test-unit
51
```
52
53
Additional arguments can be passed to any test commands, for example:
54
```
55
just test-unit --verbose --durations=10
56
```
57
58
For maximum flexibility, the `test` command can be used to run individual test files or tests, or to do other clever things with `pytest`. It just delegates to `pytest`.
59
For example:
60
```
61
just test tests/integration/backends/test_tpp.py
62
```
63
64
There are further notes on using `pytest` in the wiki here:
65
https://github.com/opensafely-core/ehrql/wiki/Tips-for-using-pytest
66
67
68
#### Generative tests
69
70
The generative tests use Hypothesis to generate variable definitions (in
71
the query model) and test data.  They then execute the resulting dataset
72
definitions using all the available query engines, and check that the
73
results are the same.
74
75
To get the most out of our generative tests we want them to run for a
76
long time and to explore different parts of the query space each time
77
they're run. But neither of these qualities are desirable in CI.
78
79
For this reason, the CI configuration generates only a small number of
80
examples, of limited query depth, and in a determinstic fashion. They
81
thus function more as a test that the generative test machinery is
82
working correctly than as a serious attempt to do generative testing.
83
84
When running locally, although randomisation is enabled by default,
85
the number of examples is still very small (to keep test runtimes
86
reasonable). To do some "proper" generative testing you can run the
87
command:
88
```
89
just test-generative
90
```
91
which increases the example count by setting `GENTEST_EXAMPLES`.
92
93
It would be worth running these locally when adding new query model
94
operations or making significant changes to a query engine. You may even
95
want to crank the settings further e.g.
96
```
97
GENTEST_EXAMPLES=10000 GENTEST_MAX_DEPTH=20 just test-generative
98
```
99
100
You can control which query engines the tests exercise using the
101
enviornment variable `GENTEST_QUERY_ENGINES`. For instance, if you have
102
made a change to the basic SQL-building logic in BaseSQLQueryEngine and
103
you want to rapidly test this with a large number of examples you could
104
compare just the in-memory and SQLite engines using:
105
```
106
GENTEST_QUERY_ENGINES='in_memory sqlite' GENTEST_EXAMPLES=10000 just test-generative
107
```
108
109
In addition to whatever you do locally, a scheduled Github Actions
110
workflow runs the generative test overnight with settings as high as we
111
can get away with and alerts us in Slack if it finds a failure.
112
113
You can get Hypothesis to dump statistics at the end of the run with `--hypothesis-show-statistics`,
114
or (more usefully) dump some of our own statistics about the generated data and queries by setting `GENTEST_DEBUG=t`.
115
116
When debugging a failure you'll probably want to reproduce it.
117
Hypothesis often struggles to shrink the examples it finds, and even
118
small examples can appear overwhelmingly verbose due to the repetitive
119
nature of query model reprs. To help with this there is some tooling,
120
and a process to follow:
121
122
 * Copy the `dataset` and `data` arguments from the example that
123
   Hypothesis displays and paste them into a new file.  (Don't worry
124
   about stripping indentation or trailing commas here.)
125
126
 * Run the command:
127
   ```
128
   just gentest-example-simplify PATH_TO_FILE.py
129
   ```
130
   This should transform the copied code into a valid test example and
131
   pull out some of the repeated elements into variables.
132
133
 * Run the command:
134
   ```
135
   just gentest-example-run PATH_TO_FILE.py
136
   ```
137
   This should execute the example and confirm that the test fails in
138
   the expected way.
139
140
 * To further simplify the example you can copy a repeated element,
141
   assign it to a variable and then re-run `gentest-example-simplify` on
142
   the file. This will replace occurances of that element with a
143
   reference to the variable.
144
145
Hypothesis can generate query graphs that are very deeply nested; after 100 draws in a test example, hypothesis will return the example as invalid.  In order to avoid this, the
146
variable strategies check for a maximum depth and return a terminal node if the maximum depth is exceeded (A `SelectColumn` node for a series strategy, and a `SelectTable` or `SelectPatientTable` for a table strategy). The max depth defaults to 15 and can be overridden with environment variable `GENTEST_MAX_DEPTH`.
147
148
See the [generative tests documentation](tests/generative/README.md) for more details.
149
150
### Writing tests
151
152
Please think carefully about how to test code that you are adding or changing.
153
We think that test maintainability is a big risk for this system, so we're trying to be very deliberate about the kind of tests that we write.
154
You should follow these guidelines and raise it with the rest of the team for discussion if you think that they are problematic.
155
156
* _Major features in ehrQL_ should be motivated by a study used in an **acceptance** test.
157
  The number of such studies should be kept small in order that they don't be come a maintenance burden.
158
  The studies we use for the acceptance tests will need to be chosen carefully as representative of how we expect ehrQL to be used; they may be real studies or synthetic ones as appropriate.
159
* All _erhQL features_ should be covered by **spec** tests.
160
* Complex _query language logic_ that is not fully covered by the spec tests should be covered by **unit** tests.
161
  To avoid duplication, you should not write unit tests for logic that is adequately covered by the spec tests.
162
* The main _functionality of query engines_ will naturally be covered by **spec** tests, which are run against all the query engines.
163
* Complex _query engine logic_ that is not fully covered by the spec tests should be covered by **unit** or **integration** tests as appropriate.
164
  To avoid duplication, you should not write such tests for logic that is adequately covered by the spec tests.
165
* Basic operation of ehrQL as a _CLI tool_ is exercised by one trivial embedded **acceptance** test.
166
* Functionality of the _Docker image_ and how it invokes ehrQL are covered by a small set of **docker** tests.
167
* Where _backend tables_ do not map directly to the schemas that they implement, it may be helpful to write **integration** tests.
168
* All other _supporting logic_ should be covered by **unit** tests. Please avoid the temptation to cover this using acceptance, integration or docker tests that run the ehrQL end-to-end.
169
170
Contrary to practice in some quarters we allow disk access by **unit** tests because it doesn't seem to cause any significant slow-down in those tests at the moment.
171
We'll keep this under review.
172
173
#### Logging in tests
174
175
Logging is very verbose and is turned off by default in tests.  To turn it on, set the
176
environment variable `LOG_SQL=1` and pass the `-s` option to turn
177
off log capture in pytest.
178
179
```
180
LOG_SQL=1 just test-all -s
181
```
182
183
### Codebase structure
184
185
The files for test categories that target individual modules (for example **unit** and **integration** tests) are organized into roughly the same directory structure as the `ehrql` package itself.
186
187
Generally a module `ehrql.foo` will have a corresponding test file like `tests/unit/test_foo.py`.
188
However we do not stick slavishly to this: where appropriate we may collapse tests for submodules like `ehrql.foo.{bar,bam}` into a single test file like `tests/unit/test_foo.py`,
189
or break tests for a module like `ehrql.foo` into multiple test files like `tests/unit/foo/test_{one,another}_aspect.py`.
190
191
Test categories that run against the ehrQL as a whole or against multiple components (for example **spec** and **acceptance** tests) have their own internal structure.
192
193
### Code coverage
194
195
Our approach to code coverage is to fail the build with less than 100% coverage, but be reasonably liberal about allowing lines to be marked as not requiring coverage.
196
If you make a change that results in a newly un-covered line, you should make a good attempt to test it and expect to have to justify any omissions to PR reviewers;
197
but for genuinely hard or impossible to hit cases it's okay to mark them as `no cover`.
198
199
Any `no cover` pragmas should include a note explaining why the code can't be hit.
200
Common cases are configured in `pyproject.toml` with their own explanations.
201
202
### Test databases
203
204
For tests that need to run against a database, we run the database in a Docker container.
205
Each run of the tests starts the database if it's not already running _and then leaves it running_ at the end to speed up future runs.
206
(Each test cleans out the schema to avoid pollution.)
207
208
There is a `just` command to remove the database containers:
209
```
210
just remove-database-containers
211
```
212
213
### Displaying SQL queries
214
215
Set the environment variable `LOG_SQL=1` (or anything non-empty) to get all SQL queries logged to the console.  To get SQL queries in [test runs](#logging-in-tests), also use `-s` to turn
216
off log capture in pytest.
217
218
## ehrQL's security properties
219
220
ehrQL is responsible for enforcing certain security boundaries within the OpenSAFELY platform. These are narrowly defined and the sections of the code which handle them are small and well-contained, so the vast majority of changes to ehrQL will not go anywhere near them. Nevertheless, it's important that anyone writing or reviewing ehrQL code be aware of these so they know to be alert for changes which could possibly have an impact.
221
222
ehrQL only accesses sensitive patient data while running inside a secure OpenSAFELY job processing pipeline, which means that all the most critical security properties are already enforced. This leaves ehrQL itself with two responsibilities:
223
 * Ensure that each ehrQL job can only access the data to which it is supposed to have access.
224
 * Ensure that patient data is only written to the locations at which it is supposed to be written.
225
226
These considerations affect three areas of the codebase.
227
228
229
### 1. Isolating user-supplied code from the rest of ehrQL
230
231
Users interact with ehrQL by writing Python code which builds a graph of query model objects describing the data to be selected. To ensure that the user's Python code has no direct access to data or ability to modify the environment we run it in a separate, highly restricted process and retrieve a JSON-serialised specification of the query graph. This is all handled by the [`loaders`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/loaders.py) module.
232
233
Functions which need to evaluate user-supplied code should always use the methods provided in the `loaders` module and **never import user code directly**. (Given the fiddly process needed to import user code it is unlikely anyone would attempt to do this _without_ looking for the pre-defined `loader` methods in any case.)
234
235
236
### 2. Adding additional restrictions to queries
237
238
The `Backend` class provides a [`modify_dataset()`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/backends/base.py#L26-L30) hook which allows the backend to add additional restrictions to the user's query to control what data it returns. Any changes to the query processing workflow **must ensure that this hook continues to be called**. This is currently enforced by an [integration test](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/tests/integration/backends/test_tpp.py#L3142-L3163) which ensures that the [`TPPBackend.modify_dataset()`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/backends/tpp.py#L97) hook continues to behave as expected. 
239
240
241
### 3. Avoiding logging of patient data
242
243
The logs which ehrQL produces are treated as being at a different privacy level from the outputs it writes to disk. It is therefore important that **the logs themselves never contain individual patient data**. Fortunately this property is relatively easy to maintain because so few parts of the codebase deal directly with patient data. Data is retrieved by the [`execute_query_with_results()`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/query_engines/base_sql.py#L972-L983) method on the base query engine. This returns an iterator of rows of data. Functions which _consume_ rows of this iterator must avoid logging any values obtained from the rows. (Functions which merely wrap the iterator without consuming it will never have a reference to any patient data and so are not at risk in the same way.) 
244
245
The key parts of the codebase which deals with individual rows of data are the [`file_formats`](https://github.com/opensafely-core/ehrql/tree/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/file_formats) module, which handles writing files to disk, and the [`sqlalchemy_exec_utils`](https://github.com/opensafely-core/ehrql/blob/6e4430426fb31e41a4c95f264628dd89fee6a266/ehrql/utils/sqlalchemy_exec_utils.py) module, which handles batch fetching of results. Careful attention must be paid to any log calls in this modules to ensure that we are not logging individual patient data.
246
247
Note that "logs" here includes everything written to stdout/stderr, including calls to `print`, not just lines written using Python's logging mechanisms.
248
249
## macOS
250
251
Starting with version 4.0, Bash is licenced under GPLv3.
252
Because of this, macOS still ships with version 3.2, which is incompatible with some scripts in this repository.
253
We recommend using [homebrew](https://brew.sh/) to install a more recent version, ensure that the new version is on your `$PATH`, and restart your Terminal/shell session if necessary.
254
255
```bash
256
brew install bash
257
```
258
259
## Static Type Checking
260
We previously used [mypy](https://mypy.readthedocs.io/en/stable/) and type annotations to perform correctness checking of the code base.
261
However, we made the decision to remove this stack after finding it was not a good fit for large parts of the code base.
262
263
This does not mean we've abandoned type annotations entirely.
264
The `query_model` module still makes heavy use of them and implements its own runtime checking to enforce them.
265
And developers should feel free to use them wherever this aids clarity vs a docstring or a comment.
266
267
Dataclasses have also retained their annotations to avoid initialising all fields with None.
268
269
270
## Documentation
271
272
The documentation in this repository forms part of the main [OpenSAFELY documentation](https://github.com/opensafely/documentation).
273
274
It can also be built as a standalone documentation site with MkDocs to preview content changes, by running:
275
276
    just docs-serve
277
278
:warning: The documentation will look slightly different from OpenSAFELY's. Relative links to
279
sections of the main documentation outside of the /ehrql sections will not work (although
280
a scheduled [Github Action](https://github.com/opensafely-core/ehrql/actions/workflows/check-docs-links.yml) runs overnight to check them).
281
282
We may be able to improve this later, depending on the behaviour of the mkdocs plugin that
283
we use: see https://github.com/opensafely-core/ehrql/issues/1126
284
285
### Documentation redirects
286
287
These are handled in the main [OpenSAFELY documentation repository](https://github.com/opensafely/documentation).
288
If you need to redirect URLs —
289
and this should be fairly infrequent —
290
make any changes to the `_redirects` file in the main documentation repository,
291
and test them in a preview there.
292
293
### Structure
294
295
ehrQL documentation is located in the [docs](docs/) directory. Local configuration is
296
specified in the `mkdocs.yml` located at the repo root.
297
298
The `docs/` directory contains some files which are generated from the ehrql code and from
299
other documentation files. Specifically these are files at:
300
301
 - [docs/includes/generated_docs/](docs/includes/generated_docs/)
302
303
The process for generating these files is described below.
304
305
When the main OpenSAFELY documentation is built, it imports the ehrql `docs/` directory
306
and builds it within the main documentation site. This assumes that all generated documentation
307
has been updated already (see below for a description of pre-commit hooks and Github Actions
308
that mechanisms that check this happens).
309
310
#### Process for updating ehrql documentation
311
312
1. Developer makes changes to documentation files or code/tests that generate documentation
313
2. Changes committed; pre-commit hook ensures generated docs are up-to-date
314
3. PR opened; CI:
315
      - ensures generated docs are up to date
316
3. PR merged; CI:
317
      - triggers a deploy of the main OpenSAFELY documentation site
318
4. On a schedule (weekly), CI in the main `opensafely/documentation` repository:
319
      - checks all the documentation links are valid
320
321
### Using includes from the parent documentation
322
323
The most likely use case is including the `glossary.md` from the parent documentation.
324
325
To do so, use a slightly different snippet syntax:
326
327
```
328
!!! parent_snippet:'includes/glossary.md'
329
```
330
331
### Generating data for documentation
332
333
Some ehrQL documentation is generated from code in this repo.
334
335
See the [spec tests docs](tests/spec/README.md) for further information on writing tests that
336
contribute to the ehrQL docs.
337
338
An intermediate step generates the markdown files that are included in the documentation, which are located in `docs/includes/generated_docs`.
339
340
To generate these files, run:
341
342
    just generate-docs
343
344
If any of the files have changed you should include them in your PR.
345
346
To verify the documentation is up to date, run:
347
348
    just docs-check-generated-docs-are-current
349
350
This will display a diff if there are any uncommitted changes to the generated markdown files. It also runs as a CI step, and will fail if there are changes that need to be committed to the repo.
351
352
### Testing ehrQL definitions included in the documentation
353
354
All of the example tests can be run with:
355
356
    just test-docs-examples
357
358
Dataset and measures definitions may be included:
359
* inline in Markdown files in `docs/`,
360
  labelled as code blocks with the `ehrql` syntax,
361
* or as Python `.py` files in `docs/`.
362
363
#### Examples using `codelist_from_csv()`
364
365
For testing examples,
366
`codelist_from_csv()` is currently patched out to work without any CSV,
367
nor are codelist codes validated.
368
369
The function signature of `codelist_from_csv()` calls from examples *is* checked.
370
371
This may be improved in future to make the testing more rigorous;
372
see #1694.
373
374
#### Inline code blocks (Markdown fences)
375
376
Examples in the documentation Markdown source will be tested as part of the test suite
377
if you place complete examples in a code block with the `ehrql` syntax label: `` ```ehrql ``
378
379
This will still highlight the code as if it were Python.
380
381
:warning: The `ehrql` syntax label is for inline and complete ehrQL blocks only.
382
383
We use the SuperFences extension for extracting Markdown fences.
384
Refer to the [SuperFences documentation](https://facelessuser.github.io/pymdown-extensions/extensions/superfences/#nested-fence-format) for more details of the fence format.
385
386
#### ehrQL definitions as included Python files
387
388
Python files in the `docs/` directory are assumed to be working ehrQL dataset or measures definitions.
389
390
They are also tested in the test suite.
391
392
If included in the documentation using the snippet syntax,
393
they must be used with a `python` syntax label.
394
(If they were labelled as `ehrql`,
395
the snippet line itself would be extracted from the Markdown,
396
and treated as a dataset definition.)
397
398
### Updating the main OpenSAFELY documentation repository
399
400
Merges to the main branch in this repo trigger a [deployment of the main OpenSAFELY documentation via a Github Action](https://github.com/opensafely-core/ehrql/actions/workflows/deploy-documentation.yml).
401
402
This uses the `DOC_WRITE_TOKEN`, which is a Github fine-grained personal access token with access to the [opensafely/documentation repository](https://github.com/opensafely/documentation) and repository permissions `Read access to metadata` and ` Read and Write access to actions`.
403
404
## Deployment
405
406
For notes on how ehrQL is currently deployed in the production OpenSAFELY environment (including how to deploy an updated version), please refer to the [backend-server playbook](https://github.com/opensafely-core/backend-server/blob/main/playbook.md).