|
a |
|
b/docs/how-to/dummy-data.md |
|
|
1 |
# How to use dummy data in an ehrQL dataset definition |
|
|
2 |
|
|
|
3 |
Because OpenSAFELY doesn't allow direct access to individual patient records, ehrQL allows you |
|
|
4 |
to work with dummy data for developing analytic code on your own computer. |
|
|
5 |
|
|
|
6 |
There are three ways to use dummy data in ehrQL. |
|
|
7 |
|
|
|
8 |
1. [Let ehrQL generate a dummy dataset from your dataset definition](#let-ehrql-generate-a-dummy-dataset-from-your-dataset-definition) |
|
|
9 |
|
|
|
10 |
1. [Supply your own dummy dataset](#supply-your-own-dummy-dataset) |
|
|
11 |
|
|
|
12 |
1. [Supply your own dummy tables](#supply-your-own-dummy-tables) |
|
|
13 |
|
|
|
14 |
|
|
|
15 |
## Let ehrQL generate a dummy dataset from your dataset definition |
|
|
16 |
|
|
|
17 |
ehrQL is designed so that the same command can be used to output a dummy dataset when run on your own computer and then output a real dataset when run inside the secure environment as part of an OpenSAFELY pipeline. |
|
|
18 |
|
|
|
19 |
Refer to the [documentation on using ehrQL in a study](../tutorial/using-ehrql-as-part-of-a-study/index.md) to see how this works. |
|
|
20 |
|
|
|
21 |
You do not need to add anything to the dataset definition itself in order to generate a dummy dataset in this way. ehrQL will use the dataset definition to set up dummy data and generate matching patients. |
|
|
22 |
|
|
|
23 |
By default, ten patients will be generated in a dummy dataset. If you need to increase this number, you can configure it in the dataset definition with: |
|
|
24 |
|
|
|
25 |
``` |
|
|
26 |
dataset.configure_dummy_data(population_size=1000) |
|
|
27 |
``` |
|
|
28 |
|
|
|
29 |
:warning: Increasing the population size will increase the time required to generate the dataset. |
|
|
30 |
|
|
|
31 |
|
|
|
32 |
## Supply your own dummy dataset |
|
|
33 |
|
|
|
34 |
You can provide a dummy dataset file in the following formats. |
|
|
35 |
|
|
|
36 |
|Format |File extension| |
|
|
37 |
|--------------|--------------| |
|
|
38 |
|CSV |.csv | |
|
|
39 |
|Compressed CSV|.csv.gz | |
|
|
40 |
|Arrow |.arrow | |
|
|
41 |
|
|
|
42 |
:warning: Your dummy dataset file must have the relevant file extension shown in the table |
|
|
43 |
above. |
|
|
44 |
|
|
|
45 |
For example, take this dataset definition: |
|
|
46 |
|
|
|
47 |
```ehrql |
|
|
48 |
from ehrql import create_dataset |
|
|
49 |
from ehrql.tables.core import patients, medications |
|
|
50 |
|
|
|
51 |
dataset = create_dataset() |
|
|
52 |
|
|
|
53 |
dataset.define_population(patients.date_of_birth.is_on_or_before("1999-12-31")) |
|
|
54 |
|
|
|
55 |
asthma_codes = ["39113311000001107", "39113611000001102"] |
|
|
56 |
latest_asthma_med = ( |
|
|
57 |
medications.where(medications.dmd_code.is_in(asthma_codes)) |
|
|
58 |
.sort_by(medications.date) |
|
|
59 |
.last_for_patient() |
|
|
60 |
) |
|
|
61 |
|
|
|
62 |
dataset.asthma_med_date = latest_asthma_med.date |
|
|
63 |
dataset.asthma_med_code = latest_asthma_med.dmd_code |
|
|
64 |
``` |
|
|
65 |
|
|
|
66 |
And this dummy dataset, in a CSV file named `dummy.csv`: |
|
|
67 |
|
|
|
68 |
|patient_id|asthma_med_date|asthma_med_code | |
|
|
69 |
|----------|---------------|-----------------| |
|
|
70 |
|1 |2021-01-01 |39113311000001107| |
|
|
71 |
|2 |2020-03-10 |39113311000001107| |
|
|
72 |
|3 |2022-11-22 |39113311000001107| |
|
|
73 |
|4 |2021-06-07 |39113611000001102| |
|
|
74 |
|5 |2023-10-17 |39113611000001102| |
|
|
75 |
|
|
|
76 |
Run the dataset definition with the dummy dataset file: |
|
|
77 |
|
|
|
78 |
``` |
|
|
79 |
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-data-file dummy.csv |
|
|
80 |
``` |
|
|
81 |
|
|
|
82 |
Now, instead of a generated dummy dataset, you'll see the data from the dummy data file that |
|
|
83 |
you provided. |
|
|
84 |
|
|
|
85 |
 |
|
|
86 |
|
|
|
87 |
|
|
|
88 |
### Boolean values |
|
|
89 |
:warning: Note that in CSV data, boolean values must be specified as "T" or "F". |
|
|
90 |
|
|
|
91 |
Add an extra `had_asthma_event` to the dataset above: |
|
|
92 |
|
|
|
93 |
```python |
|
|
94 |
dataset.had_asthma_event = ( |
|
|
95 |
medications.where(medications.dmd_code.is_in(asthma_codes)) |
|
|
96 |
.exists_for_patient() |
|
|
97 |
) |
|
|
98 |
``` |
|
|
99 |
|
|
|
100 |
This is a boolean value and the updated dummy dataset would look like this: |
|
|
101 |
|
|
|
102 |
|patient_id|asthma_med_date|asthma_med_code |had_asthma_event| |
|
|
103 |
|----------|---------------|-----------------|----------------| |
|
|
104 |
|1 |2021-01-01 |39113311000001107|T | |
|
|
105 |
|2 |2020-03-10 |39113311000001107|T | |
|
|
106 |
|3 |2022-11-22 |39113311000001107|T | |
|
|
107 |
|4 |2021-06-07 |39113611000001102|T | |
|
|
108 |
|5 |2023-10-17 |39113611000001102|T | |
|
|
109 |
|
|
|
110 |
|
|
|
111 |
### Dummy dataset errors |
|
|
112 |
|
|
|
113 |
ehrQL will check the column names, types and categorical values in your dummy dataset file. If |
|
|
114 |
errors are found, they will be shown in the terminal output. |
|
|
115 |
|
|
|
116 |
|
|
|
117 |
### Using ehrQL to generate a base dummy dataset |
|
|
118 |
|
|
|
119 |
Generating your own dummy dataset can be difficult (or tedious!), especially if you have a |
|
|
120 |
large number of output columns, or you need a large amount of data. One option is to use ehrQL |
|
|
121 |
to generate an initial dataset, and then modify it as you need. |
|
|
122 |
|
|
|
123 |
Run the dataset definition with an output path: |
|
|
124 |
|
|
|
125 |
``` |
|
|
126 |
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --output dataset.csv |
|
|
127 |
``` |
|
|
128 |
|
|
|
129 |
Now you can edit `dataset.csv` as you want, and rerun the dataset definition, using it as the |
|
|
130 |
dummy data file: |
|
|
131 |
|
|
|
132 |
``` |
|
|
133 |
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-data-file dataset.csv |
|
|
134 |
``` |
|
|
135 |
|
|
|
136 |
## Supply your own dummy tables |
|
|
137 |
|
|
|
138 |
Instead of supplying a single dummy dataset file that contains the exact output data, you can |
|
|
139 |
provide ehrQL with a path to a folder of dummy data tables. Dummy data tables are supplied as |
|
|
140 |
CSV files, with one file per table, representing the data in the backend database. Refer to |
|
|
141 |
the [table schema reference](../reference/schemas.md) for documentation on the available tables |
|
|
142 |
and columns. ehrQL will use the dummy tables as the backend data from which to extract the dataset. This is a good way to test that your dataset |
|
|
143 |
definition is working as expected. |
|
|
144 |
|
|
|
145 |
The [running ehrQL as a standalone action](../explanation/running-ehrql.md#2-running-ehrql-as-a-standalone-action-via-opensafely-exec) documentation |
|
|
146 |
demonstrates how to to run a dataset definition against the example dummy tables in the |
|
|
147 |
ehrQL tutorial. |
|
|
148 |
|
|
|
149 |
### Generating dummy tables |
|
|
150 |
|
|
|
151 |
ehrQL can be used to create dummy tables for you, using your dataset definition to determine the |
|
|
152 |
specific tables that are required. |
|
|
153 |
|
|
|
154 |
Try this out by running the following command against the simple dataset definition above: |
|
|
155 |
|
|
|
156 |
``` |
|
|
157 |
opensafely exec ehrql:v1 create-dummy-tables dataset_definition.py dummy-folder |
|
|
158 |
``` |
|
|
159 |
|
|
|
160 |
 |
|
|
161 |
|
|
|
162 |
A new folder, `dummy-folder`, has been created, which contains just the two tables that the |
|
|
163 |
dataset definition requires - `patients.csv` and `medications.csv`. |
|
|
164 |
|
|
|
165 |
Now you can run ehrQl with these generated tables instead: |
|
|
166 |
|
|
|
167 |
``` |
|
|
168 |
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables dummy-folder |
|
|
169 |
``` |