Switch to unified view

a b/docs/explanation/running-ehrql.md
1
You can run ehrQL in two places:
2
3
* on your own computer,
4
  where you can try out ehrQL against _dummy tables_,
5
  and test that your analysis code runs correctly
6
* on an OpenSAFELY backend database,
7
  to use ehrQL with _real tables_
8
9
## Running ehrQL on your own computer against dummy tables
10
11
There are three ways to run ehrQL on your own computer against dummy tables:
12
13
1. using the OpenSAFELY VS Code extension, to explore the contents of ehrQL elements and datasets
14
1. as a standalone action, to test your dataset definition, via `opensafely exec`
15
1. as the first step in an OpenSAFELY pipeline, to test the whole pipeline, via `opensafely run`
16
17
### 1. Exploring ehrQL using the OpenSAFELY VS Code extension
18
19
The [OpenSAFELY VS Code extension](../explanation/vscode-extension.md) uses dummy tables to allow you to inspect the contents of ehrQL tables, columns, datasets and queries as you are writing
20
ehrQL in VS Code.
21
22
23
### 2. Running ehrQL as a standalone action via `opensafely exec`
24
25
To actually run your ehrQL queries against real tables,
26
you need to write a dataset definition and save it in a file.
27
28
But first, while you are developing an ehrQL query,
29
you can run your dataset definition against dummy tables
30
to produce an output file that you can inspect.
31
32
:computer: Copy and paste the following dataset definition
33
into a new file called `dataset_definition.py`:
34
35
```ehrql
36
from ehrql import create_dataset
37
from ehrql.tables.core import patients, medications
38
39
dataset = create_dataset()
40
41
dataset.define_population(patients.date_of_birth.is_on_or_before("1999-12-31"))
42
43
asthma_codes = ["39113311000001107", "39113611000001102"]
44
latest_asthma_med = (
45
    medications.where(medications.dmd_code.is_in(asthma_codes))
46
    .sort_by(medications.date)
47
    .last_for_patient()
48
)
49
50
dataset.med_date = latest_asthma_med.date
51
dataset.med_code = latest_asthma_med.dmd_code
52
```
53
54
:grey_question: Can you work out what the dataset definition will generate?
55
56
Make sure you save the file!
57
58
:computer: Use the command below to run your dataset definition with ehrQL.
59
60
```
61
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv
62
```
63
64
:notepad_spiral: ehrQL dataset definitions are written in Python.
65
But, unlike typical Python code,
66
we instead run the dataset definition via the OpenSAFELY CLI.
67
The OpenSAFELY CLI internally uses a correctly configured version of Python
68
to run the dataset definition.
69
70
#### What each part of this command does
71
72
* `opensafely exec ehrql:v1` uses the OpenSAFELY CLI to run ehrQL.
73
  The `v1` after the `:` refers to the version of ehrQL being used.
74
* `generate-dataset` instructs ehrQL to generate a dataset from the dataset definition.
75
* `dataset_definition.py` specifies the filename of the dataset definition to use.
76
    * The dataset definition file is in the directory that we are running `opensafely exec`
77
      so we do not need to specify the full path to the file in this case.
78
* `--dummy-tables example-data` specifies that the dummy CSV input data is in the `example-data` directory.
79
    * :notepad_spiral: If the `--dummy-tables` option is omitted,
80
      randomly generated data will be used instead.
81
* `--output output/dataset.csv` specifies the path to the output CSV file.
82
    * :notepad_spiral: If the `--output` option is omitted,
83
      the output will be not be saved to a file,
84
      but displayed on screen.
85
86
#### What you should see when you run the command
87
88
You should see output displayed similar to this:
89
90
```
91
2023-04-19 08:53:41 [info     ] Compiling dataset definition from dataset_definition.py [ehrql.main]
92
2023-04-19 08:53:41 [info     ] Generating dummy dataset       [ehrql.main]
93
2023-04-19 08:53:41 [info     ] Reading CSV data from example-data [ehrql.main]
94
2023-04-19 08:53:41 [info     ] Building dataset and writing results [ehrql.main]
95
```
96
97
:notepad_spiral: The date and time you see will differ from that here.
98
99
#### The output file
100
101
The output will be stored in a file called `dataset.csv` in the `output` directory.
102
103
The file will contain the following CSV data:
104
105
```
106
patient_id,med_date,med_code
107
0,2014-01-11,39113611000001102
108
1,2018-09-21,39113311000001107
109
4,2017-05-11,39113611000001102
110
5,2019-07-06,39113611000001102
111
6,,
112
7,,
113
8,,
114
9,,
115
```
116
117
:notepad_spiral: The bottom 4 rows in the generated dataset show that there are 4 patients in the defined population that do not have any record for the medications specified in the dataset definition.
118
119
:computer: Try running the ehrQL dataset definition again,
120
without the `--dummy-tables` and `--output` options:
121
122
```
123
opensafely exec ehrql:v1 generate-dataset dataset_definition.py
124
```
125
126
By not specifying the dummy tables to use,
127
random data is used as the data source to generate the dataset,
128
instead of using the sample dummy tables that you downloaded previously.
129
130
By not specifying the output file,
131
the output is displayed on screen.
132
This can be useful for quickly seeing output
133
while working on a dataset definition,
134
instead of saving to a file,
135
and then opening the file.
136
137
138
#### When things go wrong
139
140
If your dataset definition contains some invalid ehrQL,
141
an error message will be displayed on the screen.
142
143
This is one example:
144
145
```
146
$ opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv
147
2023-04-21 17:53:42 [info     ] Compiling dataset definition from dataset_definition.py [ehrql.main]
148
Error loading file 'dataset_definition.py':
149
```
150
```pycon
151
Traceback (most recent call last):
152
  File "/workspace/dataset_definition.py", line 10, in <module>
153
    dataset.med_date = latest_asthma_med.dat
154
                       ^^^^^^^^^^^^^^^^^^^
155
AttributeError: 'medications' object has no attribute 'dat'
156
```
157
158
Refer to [the catalogue of errors](../how-to/errors.md) for help with interpreting error messages.
159
160
### 3. Running ehrQL in an OpenSAFELY pipeline via `opensafely run`
161
162
To run your ehrQL queries as part of an OpenSAFELY pipeline with `opensafely run`,
163
you need to have a file called `project.yaml`.
164
165
:notepad_spiral: There is considerably more technical detail on [the project pipeline in the OpenSAFELY documentation](https://docs.opensafely.org/actions-pipelines/).
166
167
:computer: Copy the following into a file called
168
`project.yaml`:
169
170
```yaml
171
version: '4.0'
172
173
actions:
174
  generate_dataset:
175
    run: ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv.gz
176
    outputs:
177
      highly_sensitive:
178
        dataset: output/dataset.csv.gz
179
180
  summarise_dataset:
181
    run: python:latest summarise_dataset.py
182
    needs: [generate_dataset]
183
    outputs:
184
     moderately_sensitive:
185
        dataset: output/summary.txt
186
```
187
188
:notepad_spiral: Users already familiar with the [OpenSAFELY research template](https://github.com/opensafely/research-template) may notice that the research template already includes a basic `project.yaml` file that can be edited.
189
Here, for the purposes of this tutorial,
190
to skip setting up the template,
191
we create this file entirely by hand.
192
193
The `project.yaml` file defines two actions: `generate_dataset` and `summarise_dataset`.
194
Each of these actions specifies one or more `outputs`.
195
196
:notepad_spiral: The definitions of "highly sensitive" and "moderately sensitive" are explained in the [`project.yaml` documentation](https://docs.opensafely.org/actions-pipelines/#projectyaml-format).
197
198
The `generate_dataset` action's `run:` command should look familiar from the previous section.
199
However, note that the `--output` path is now to a compressed CSV file (`dataset.csv.gz`).
200
201
:notepad_spiral: We recommend the use of compressed CSV files when generating a dataset definition as part of an OpenSAFELY pipeline.
202
203
`summarise_dataset` uses a Python script called `summarise_dataset.py`.
204
Copy the following into a file called `summarise_dataset.py`.
205
206
```python
207
import pandas as pd
208
209
dataframe = pd.read_csv("output/dataset.csv.gz")
210
num_rows = len(dataframe)
211
212
with open("output/summary.txt", "w") as f:
213
    f.write(f"There are {num_rows} patients in the population\n")
214
```
215
216
:grey_question: Even if you don't know how to use [pandas](https://pandas.pydata.org/),
217
can you guess at what this code might do before you run the OpenSAFELY project?
218
219
:computer: Use the command below to run all of the actions
220
in `project.yaml`:
221
222
    opensafely run run_all
223
224
:notepad_spiral: If is this is the first time you have used `opensafely exec`,
225
the OpenSAFELY CLI may fetch some other Docker images (`python` and `busybox`) needed to run the action.
226
227
#### What you should see when you run the command
228
229
You should see in the logs output displayed similar to this:
230
231
```
232
$ opensafely run run_all
233
234
Running actions: generate_dataset, summarise_dataset
235
236
jobrunner.run loop started
237
generate_dataset: Preparing your code and workspace files
238
...
239
summarise_dataset: Extracting output file: output/summary.txt
240
summarise_dataset: Finished recording results
241
summarise_dataset: Completed successfully
242
summarise_dataset: Cleaning up container and volume
243
244
=> generate_dataset
245
   Completed successfully
246
247
   log file: metadata/generate_dataset.log
248
   outputs:
249
     output/dataset.csv.gz  - highly_sensitive
250
251
=> summarise_dataset
252
   Completed successfully
253
254
   log file: metadata/summarise_dataset.log
255
   outputs:
256
     output/summary.txt  - moderately_sensitive
257
```
258
259
:notepad_spiral: Some of the middle lines of this log have been omitted.
260
261
#### The output files
262
263
The `generate_dataset` action will generate a compressed CSV file called `dataset.csv.gz` in the `output` directory.
264
If you unzip this, you should see the same output as the previous example.
265
266
The `summarise_dataset` action will generate a small text file called `summary.txt` in the `output` directory.
267
This will tell you how many patients are in your population.
268
269
## Running ehrQL on an OpenSAFELY backend database
270
271
Once you are happy with your ehrQL queries and any analysis code,
272
you can submit your project to run against real data in an OpenSAFELY backend database.
273
274
To submit your project to run against real data, refer to the
275
[existing documentation on using the OpenSAFELY jobs site](https://docs.opensafely.org/jobs-site).
276
277
:notepad_spiral: You will require approval for an OpenSAFELY project,
278
before you can submit your project to the jobs site.
279
280
## Questions
281
282
* :grey_question: Why would you use the ehrQL `show()` function?
283
* :grey_question: Which `opensafely` command would you use to run just a dataset definition as a single action?
284
* :grey_question: Which `opensafely` command would you use to run an entire OpenSAFELY project consisting of multiple actions?