Diff of /docs/how-to/dummy-data.md [000000] .. [e988c2]

Switch to unified view

a b/docs/how-to/dummy-data.md
1
# How to use dummy data in an ehrQL dataset definition
2
3
Because OpenSAFELY doesn't allow direct access to individual patient records, ehrQL allows you
4
to work with dummy data for developing analytic code on your own computer.
5
6
There are three ways to use dummy data in ehrQL.
7
8
1. [Let ehrQL generate a dummy dataset from your dataset definition](#let-ehrql-generate-a-dummy-dataset-from-your-dataset-definition)
9
10
1. [Supply your own dummy dataset](#supply-your-own-dummy-dataset)
11
12
1. [Supply your own dummy tables](#supply-your-own-dummy-tables)
13
14
15
## Let ehrQL generate a dummy dataset from your dataset definition
16
17
ehrQL is designed so that the same command can be used to output a dummy dataset when run on your own computer and then output a real dataset when run inside the secure environment as part of an OpenSAFELY pipeline.
18
19
Refer to the [documentation on using ehrQL in a study](../tutorial/using-ehrql-as-part-of-a-study/index.md) to see how this works.
20
21
You do not need to add anything to the dataset definition itself in order to generate a dummy dataset in this way. ehrQL will use the dataset definition to set up dummy data and generate matching patients.
22
23
By default, ten patients will be generated in a dummy dataset. If you need to increase this number, you can configure it in the dataset definition with:
24
25
```
26
dataset.configure_dummy_data(population_size=1000)
27
```
28
29
:warning: Increasing the population size will increase the time required to generate the dataset.
30
31
32
## Supply your own dummy dataset
33
34
You can provide a dummy dataset file in the following formats.
35
36
|Format        |File extension|
37
|--------------|--------------|
38
|CSV           |.csv          |
39
|Compressed CSV|.csv.gz       |
40
|Arrow         |.arrow        |
41
42
:warning: Your dummy dataset file must have the relevant file extension shown in the table
43
above.
44
45
For example, take this dataset definition:
46
47
```ehrql
48
from ehrql import create_dataset
49
from ehrql.tables.core import patients, medications
50
51
dataset = create_dataset()
52
53
dataset.define_population(patients.date_of_birth.is_on_or_before("1999-12-31"))
54
55
asthma_codes = ["39113311000001107", "39113611000001102"]
56
latest_asthma_med = (
57
    medications.where(medications.dmd_code.is_in(asthma_codes))
58
    .sort_by(medications.date)
59
    .last_for_patient()
60
)
61
62
dataset.asthma_med_date = latest_asthma_med.date
63
dataset.asthma_med_code = latest_asthma_med.dmd_code
64
```
65
66
And this dummy dataset, in a CSV file named `dummy.csv`:
67
68
|patient_id|asthma_med_date|asthma_med_code  |
69
|----------|---------------|-----------------|
70
|1         |2021-01-01     |39113311000001107|
71
|2         |2020-03-10     |39113311000001107|
72
|3         |2022-11-22     |39113311000001107|
73
|4         |2021-06-07     |39113611000001102|
74
|5         |2023-10-17     |39113611000001102|
75
76
Run the dataset definition with the dummy dataset file:
77
78
```
79
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-data-file dummy.csv
80
```
81
82
Now, instead of a generated dummy dataset, you'll see the data from the dummy data file that
83
you provided.
84
85
![A screenshot of VS Code, showing the terminal after the `opensafely exec` command was run](opensafely_exec_dummy_data_file.png)
86
87
88
### Boolean values
89
:warning:  Note that in CSV data, boolean values must be specified as "T" or "F".
90
91
Add an extra `had_asthma_event` to the dataset above:
92
93
```python
94
dataset.had_asthma_event = (
95
    medications.where(medications.dmd_code.is_in(asthma_codes))
96
    .exists_for_patient()
97
)
98
```
99
100
This is a boolean value and the updated dummy dataset would look like this:
101
102
|patient_id|asthma_med_date|asthma_med_code  |had_asthma_event|
103
|----------|---------------|-----------------|----------------|
104
|1         |2021-01-01     |39113311000001107|T               |
105
|2         |2020-03-10     |39113311000001107|T               |
106
|3         |2022-11-22     |39113311000001107|T               |
107
|4         |2021-06-07     |39113611000001102|T               |
108
|5         |2023-10-17     |39113611000001102|T               |
109
110
111
### Dummy dataset errors
112
113
ehrQL will check the column names, types and categorical values in your dummy dataset file. If
114
errors are found, they will be shown in the terminal output.
115
116
117
### Using ehrQL to generate a base dummy dataset
118
119
Generating your own dummy dataset can be difficult (or tedious!), especially if you have a
120
large number of output columns, or you need a large amount of data. One option is to use ehrQL
121
to generate an initial dataset, and then modify it as you need.
122
123
Run the dataset definition with an output path:
124
125
```
126
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --output dataset.csv
127
```
128
129
Now you can edit `dataset.csv` as you want, and rerun the dataset definition, using it as the
130
dummy data file:
131
132
```
133
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-data-file dataset.csv
134
```
135
136
## Supply your own dummy tables
137
138
Instead of supplying a single dummy dataset file that contains the exact output data, you can
139
provide ehrQL with a path to a folder of dummy data tables. Dummy data tables are supplied as
140
CSV files, with one file per table, representing the data in the backend database. Refer to
141
the [table schema reference](../reference/schemas.md) for documentation on the available tables
142
and columns. ehrQL will use the dummy tables as the backend data from which to extract the dataset. This is a good way to test that your dataset
143
definition is working as expected.
144
145
The [running ehrQL as a standalone action](../explanation/running-ehrql.md#2-running-ehrql-as-a-standalone-action-via-opensafely-exec) documentation
146
demonstrates how to to run a dataset definition against the example dummy tables in the
147
ehrQL tutorial.
148
149
### Generating dummy tables
150
151
ehrQL can be used to create dummy tables for you, using your dataset definition to determine the
152
specific tables that are required.
153
154
Try this out by running the following command against the simple dataset definition above:
155
156
```
157
opensafely exec ehrql:v1 create-dummy-tables dataset_definition.py dummy-folder
158
```
159
160
![A screenshot of VS Code, showing the terminal after the `create-dummy-tables` command was run](opensafely_exec_create_dummy_tables.png)
161
162
A new folder, `dummy-folder`, has been created, which contains just the two tables that the
163
dataset definition requires - `patients.csv` and `medications.csv`.
164
165
Now you can run ehrQl with these generated tables instead:
166
167
```
168
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables dummy-folder
169
```