Switch to unified view

a b/docs/how-to/test-dataset-definition.md
1
# How to test your dataset definition
2
3
This guide shows you how to evaluate and validate the behaviour of your ehrQL query.
4
The [assure](../reference/cli.md#assure) command works like a suite of unit tests for your ehrQL queries.
5
6
You can write assurance tests that help you and others to understand and review the expected behaviour of your dataset definition.
7
Assurance tests also provide confidence that existing functionality remains unchanged when reworking your dataset definition.
8
9
In this guide we will demonstrate how to test ehrQL queries:
10
11
1. Create dataset definition
12
2. Specify test data and expectations
13
3. Run tests
14
15
## Example dataset definition
16
17
First, we're creating a dataset definition with the following specifications:
18
19
- The population includes everyone above the age of 18 on the `index_date` (31st March 2023).
20
- The dataset has the following three columns:
21
    - `age`: Patient's age on the `index_date`.
22
    - `has_asthma_med`: Boolean value specifying whether a patient has received an asthma medication before the `index_date`.
23
    - `latest_asthma_med_date`: Date of the most recent prescription of an asthma medication before the `index_date`.
24
25
```ehrql
26
from ehrql import create_dataset
27
from ehrql.tables.core import patients, medications
28
29
asthma_codes = ["39113311000001107", "39113611000001102"]
30
31
dataset = create_dataset()
32
33
index_date = "2023-03-31"
34
35
dataset.age = patients.age_on(index_date)
36
dataset.define_population(dataset.age > 18)
37
38
latest_asthma_med = (
39
    medications.where(medications.dmd_code.is_in(asthma_codes))
40
    .where(medications.date <= index_date)
41
    .sort_by(medications.date)
42
    .last_for_patient()
43
)
44
45
dataset.has_asthma_med = latest_asthma_med.exists_for_patient()
46
dataset.latest_asthma_med_date = latest_asthma_med.date
47
```
48
49
## Specifying test data and expectations
50
51
Next, you need to provide (1) data for test patients and (2) specify the data that you expect to see in the dataset for each patient after applying your ehrQL queries.
52
Test data and expectations are both defined in a nested dictionary called `test_data`.
53
54
### Data for test patients
55
56
To set up test patients and their data you need to use the following structure:
57
58
* **Patient ID**: Outermost dictionary keys represent patient IDs.
59
* **Table names**: Second-level dictionary keys denote table names from OpenSAFELY backends.
60
    * *One-row-per-patient* tables (e.g., [patients](../reference/schemas/core.md#patients)) are specified in a single dictionary
61
    * *Many-rows-per-patient* tables (e.g., [medications](../reference/schemas/core.md#medications)) are specified in a list that can contain multiple dictionaries, where each dictionary adds one row for the patient to the `medications` table.
62
* **Column names**: Third-level dictionary keys indicate column names in the tables.
63
64
```py
65
test_data = {
66
    1: {
67
        "patients": {
68
            "date_of_birth": date(2020, 1, 1),
69
            "sex": "female"},
70
        "medications": [
71
            {
72
                # First prescription of asthma medication
73
                "date": date(2010, 1, 1),
74
                "dmd_code": "39113311000001107",
75
            },
76
            {
77
                # Second prescription of asthma medication
78
                "date": date(2020, 1, 1),
79
                "dmd_code": "39113311000001107",
80
            },
81
        ],
82
        # Add expectations for patient 1 here
83
        # See next section of this guide
84
    },
85
}
86
```
87
88
In the example above we have created one test patient with the patient ID `1` and added test data for two tables: `patients` and `medications`.
89
The keys of the second-level dictionary match the names of the tables in the dataset definition.
90
To explore how these tables are structured you can look at the column names in the [table schemas](../reference/schemas.md) documentation.
91
Also note that some columns have constraints that need to be specified correctly in your test data.
92
For example, the `date_of_birth` column in the [patients table](../reference/schemas/core.md#patients.date_of_birth) has the following constraints: '*Always the first day of a month*' and '*Never `NULL`*'.
93
94
As mentioned above, adding data is different for *one-row-* and *many-rows-per-patient* tables:
95
96
* `patients` is a *one-row-per-patient* table, so you can only define one dictionary with one key for each column (`date_of_birth` and `sex`) that you want to populate.
97
    Note that you don't have to specify a value for each column in the underlying table.
98
    For example we did not specify `date_of_death` in the dictionary so the column will be missing with the value `None`.
99
    This only works because the `date_of_death` column does not have a '*Never `NULL`*' constraint.
100
* `medications` is a *many-rows-per-patient* table, so you can define a list containing multiple dictionaries (one for each row you want to add to the table) with one key for each column (`date` and `dmd_code`).
101
102
### Expectations for test patients
103
104
Once you have created data for your test patients you need to specify your expectations after applying the ehrQL in your dataset definition to the test patients.
105
Note that you have to specify a list for each table you use in your dataset definition, but this could also be an empty list.
106
First you need to indicate whether you expect the test patient to be in your defined population by providing `True` or `False` to the `expected_in_population` key.
107
If you are expecting a patient in your population you also need to specify the values for the columns you added to your dataset in the `expected_columns` dictionary.
108
Each key in the `expected_columns` dictionary represents one column you added to your dataset.
109
110
In the example below we created three test patients in a separate file (e.g., `analysis/test_dataset_definition.py`), each testing a different element of our dataset definition (e.g., `analysis/dataset_definition.py`):
111
112
* **Patient 1**: Expected in our population because the patient is older than 18 years on the `index_date`.
113
  The three entries in the medications table tests the ehrQL logic that selects the latest medication before the `index_date`.
114
* **Patient 2**: Expected in our population because the patient is older than 18 years on the `index_date`.
115
  However the patient does not have any entries in their `medications` table.
116
  Here we are testing the behaviour of our ehrQL query when a patient was never prescribed a code from the `asthma_codes` codelist
117
* **Patient 3**: Not expected in our population because the patient is younger than 18 years on the `index_date`.
118
119
At the top of your test script you need to import the `date` function and the `dataset` from your dataset definition that you want to test.
120
121
```py
122
from datetime import date
123
from dataset_definition import dataset
124
125
test_data = {
126
    # Expected in population with matching medication
127
    1: {
128
        "patients": {"date_of_birth": date(1950, 1, 1)},
129
        "medications": [
130
            {
131
                # First matching medication
132
                "date": date(2010, 1, 1),
133
                "dmd_code": "39113311000001107",
134
            },
135
            {
136
                # Latest matching medication before index_date
137
                "date": date(2020, 1, 1),
138
                "dmd_code": "39113311000001107",
139
            },
140
            {
141
                # Most recent matching medication, but after index_date
142
                "date": date(2023, 6, 1),
143
                "dmd_code": "39113311000001107",
144
            },
145
        ],
146
        "expected_in_population": True,
147
        "expected_columns": {
148
            "age": 73,
149
            "has_asthma_med": True,
150
            "latest_asthma_med_date": date(2020, 1, 1),
151
        },
152
    },
153
    # Expected in population without matching medication
154
    2: {
155
        "patients": [{"date_of_birth": date(1950, 1, 1)}],
156
        "medications": [],
157
        "expected_in_population": True,
158
        "expected_columns": {
159
            "age": 73,
160
            "has_asthma_med": False,
161
            "latest_asthma_med_date": None,
162
        },
163
    },
164
    # Not expected in population
165
    3: {
166
        "patients": [{"date_of_birth": date(2010, 1, 1)}],
167
        "medications": [],
168
        "expected_in_population": False,
169
    },
170
}
171
```
172
173
## Running the tests
174
175
Finally you can run your assurance tests to verify if your expectations were successful or failed.
176
177
### Option 1: Running tests through the terminal
178
179
To run your tests through the terminal, use the following command:
180
181
```
182
opensafely exec ehrql:v1 assure analysis/test_dataset_definition.py
183
```
184
185
### Option 2: Integrating tests into your `generate-dataset` action
186
187
You can also run your tests every time you execute a `generate-dataset` action by providing your test file using the `--test-data-file` flag in the `project.yaml` file:
188
189
```
190
actions:
191
  generate_dataset:
192
    run: >
193
        ehrql:v1 generate-dataset
194
        analysis/dataset_definition.py
195
        --test-data-file analysis/test_dataset_definition.py
196
        --output outputs/dataset.arrow
197
    outputs:
198
      highly_sensitive:
199
        population: outputs/dataset.arrow
200
201
```
202
203
## Interpreting the results
204
205
### Successful expectations
206
207
If the expected results match the results after applying the ehrQL logic and the test data meets all constraints you will see the following short message in your terminal:
208
209
```
210
Validate test data: All OK!
211
Validate results: All OK!
212
```
213
214
### Failed expectations
215
216
#### Failed constraint validations
217
218
If the test data you provided does not meet the constraints you will see a message with more information.
219
We recommend that you fix the errors so that the test data meets the constraints and is identical to the production data.
220
However, if you are sure that you want to test your ehrQL query with values that do not to the constraints, you can ignore the '*Validate test data*' section of the message.
221
222
```
223
Validate test data: Found errors with 1 patient(s)
224
 * Patient 1 had 1 test data value(s) that did not meet the constraint(s)
225
   * for column 'date_of_birth' with 'Constraint.NotNull()', got 'None'
226
Validate results: All OK!
227
```
228
229
#### Failed result expectations
230
231
You will see a message that helps you to diagnose and fix the problem if your expectations do not match the results.
232
The error message is structured by patient and contains one line for each column with a failed expectation.
233
Each line starts with the column name followed by the value that was specified in the test and the last value shows the result that was obtained after applying the ehrQL logic:
234
235
```
236
Validate test data: All OK!
237
Validate results: Found errors with 1 patient(s)
238
 * Patient 1 had unexpected value(s)
239
   * for column 'age', expected '72', got '73'
240
   * for column 'latest_asthma_med_date', expected '2020-01-01', got '2021-01-01'
241
```