|
a |
|
b/docs/explanation/running-ehrql.md |
|
|
1 |
You can run ehrQL in two places: |
|
|
2 |
|
|
|
3 |
* on your own computer, |
|
|
4 |
where you can try out ehrQL against _dummy tables_, |
|
|
5 |
and test that your analysis code runs correctly |
|
|
6 |
* on an OpenSAFELY backend database, |
|
|
7 |
to use ehrQL with _real tables_ |
|
|
8 |
|
|
|
9 |
## Running ehrQL on your own computer against dummy tables |
|
|
10 |
|
|
|
11 |
There are three ways to run ehrQL on your own computer against dummy tables: |
|
|
12 |
|
|
|
13 |
1. using the OpenSAFELY VS Code extension, to explore the contents of ehrQL elements and datasets |
|
|
14 |
1. as a standalone action, to test your dataset definition, via `opensafely exec` |
|
|
15 |
1. as the first step in an OpenSAFELY pipeline, to test the whole pipeline, via `opensafely run` |
|
|
16 |
|
|
|
17 |
### 1. Exploring ehrQL using the OpenSAFELY VS Code extension |
|
|
18 |
|
|
|
19 |
The [OpenSAFELY VS Code extension](../explanation/vscode-extension.md) uses dummy tables to allow you to inspect the contents of ehrQL tables, columns, datasets and queries as you are writing |
|
|
20 |
ehrQL in VS Code. |
|
|
21 |
|
|
|
22 |
|
|
|
23 |
### 2. Running ehrQL as a standalone action via `opensafely exec` |
|
|
24 |
|
|
|
25 |
To actually run your ehrQL queries against real tables, |
|
|
26 |
you need to write a dataset definition and save it in a file. |
|
|
27 |
|
|
|
28 |
But first, while you are developing an ehrQL query, |
|
|
29 |
you can run your dataset definition against dummy tables |
|
|
30 |
to produce an output file that you can inspect. |
|
|
31 |
|
|
|
32 |
:computer: Copy and paste the following dataset definition |
|
|
33 |
into a new file called `dataset_definition.py`: |
|
|
34 |
|
|
|
35 |
```ehrql |
|
|
36 |
from ehrql import create_dataset |
|
|
37 |
from ehrql.tables.core import patients, medications |
|
|
38 |
|
|
|
39 |
dataset = create_dataset() |
|
|
40 |
|
|
|
41 |
dataset.define_population(patients.date_of_birth.is_on_or_before("1999-12-31")) |
|
|
42 |
|
|
|
43 |
asthma_codes = ["39113311000001107", "39113611000001102"] |
|
|
44 |
latest_asthma_med = ( |
|
|
45 |
medications.where(medications.dmd_code.is_in(asthma_codes)) |
|
|
46 |
.sort_by(medications.date) |
|
|
47 |
.last_for_patient() |
|
|
48 |
) |
|
|
49 |
|
|
|
50 |
dataset.med_date = latest_asthma_med.date |
|
|
51 |
dataset.med_code = latest_asthma_med.dmd_code |
|
|
52 |
``` |
|
|
53 |
|
|
|
54 |
:grey_question: Can you work out what the dataset definition will generate? |
|
|
55 |
|
|
|
56 |
Make sure you save the file! |
|
|
57 |
|
|
|
58 |
:computer: Use the command below to run your dataset definition with ehrQL. |
|
|
59 |
|
|
|
60 |
``` |
|
|
61 |
opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv |
|
|
62 |
``` |
|
|
63 |
|
|
|
64 |
:notepad_spiral: ehrQL dataset definitions are written in Python. |
|
|
65 |
But, unlike typical Python code, |
|
|
66 |
we instead run the dataset definition via the OpenSAFELY CLI. |
|
|
67 |
The OpenSAFELY CLI internally uses a correctly configured version of Python |
|
|
68 |
to run the dataset definition. |
|
|
69 |
|
|
|
70 |
#### What each part of this command does |
|
|
71 |
|
|
|
72 |
* `opensafely exec ehrql:v1` uses the OpenSAFELY CLI to run ehrQL. |
|
|
73 |
The `v1` after the `:` refers to the version of ehrQL being used. |
|
|
74 |
* `generate-dataset` instructs ehrQL to generate a dataset from the dataset definition. |
|
|
75 |
* `dataset_definition.py` specifies the filename of the dataset definition to use. |
|
|
76 |
* The dataset definition file is in the directory that we are running `opensafely exec` |
|
|
77 |
so we do not need to specify the full path to the file in this case. |
|
|
78 |
* `--dummy-tables example-data` specifies that the dummy CSV input data is in the `example-data` directory. |
|
|
79 |
* :notepad_spiral: If the `--dummy-tables` option is omitted, |
|
|
80 |
randomly generated data will be used instead. |
|
|
81 |
* `--output output/dataset.csv` specifies the path to the output CSV file. |
|
|
82 |
* :notepad_spiral: If the `--output` option is omitted, |
|
|
83 |
the output will be not be saved to a file, |
|
|
84 |
but displayed on screen. |
|
|
85 |
|
|
|
86 |
#### What you should see when you run the command |
|
|
87 |
|
|
|
88 |
You should see output displayed similar to this: |
|
|
89 |
|
|
|
90 |
``` |
|
|
91 |
2023-04-19 08:53:41 [info ] Compiling dataset definition from dataset_definition.py [ehrql.main] |
|
|
92 |
2023-04-19 08:53:41 [info ] Generating dummy dataset [ehrql.main] |
|
|
93 |
2023-04-19 08:53:41 [info ] Reading CSV data from example-data [ehrql.main] |
|
|
94 |
2023-04-19 08:53:41 [info ] Building dataset and writing results [ehrql.main] |
|
|
95 |
``` |
|
|
96 |
|
|
|
97 |
:notepad_spiral: The date and time you see will differ from that here. |
|
|
98 |
|
|
|
99 |
#### The output file |
|
|
100 |
|
|
|
101 |
The output will be stored in a file called `dataset.csv` in the `output` directory. |
|
|
102 |
|
|
|
103 |
The file will contain the following CSV data: |
|
|
104 |
|
|
|
105 |
``` |
|
|
106 |
patient_id,med_date,med_code |
|
|
107 |
0,2014-01-11,39113611000001102 |
|
|
108 |
1,2018-09-21,39113311000001107 |
|
|
109 |
4,2017-05-11,39113611000001102 |
|
|
110 |
5,2019-07-06,39113611000001102 |
|
|
111 |
6,, |
|
|
112 |
7,, |
|
|
113 |
8,, |
|
|
114 |
9,, |
|
|
115 |
``` |
|
|
116 |
|
|
|
117 |
:notepad_spiral: The bottom 4 rows in the generated dataset show that there are 4 patients in the defined population that do not have any record for the medications specified in the dataset definition. |
|
|
118 |
|
|
|
119 |
:computer: Try running the ehrQL dataset definition again, |
|
|
120 |
without the `--dummy-tables` and `--output` options: |
|
|
121 |
|
|
|
122 |
``` |
|
|
123 |
opensafely exec ehrql:v1 generate-dataset dataset_definition.py |
|
|
124 |
``` |
|
|
125 |
|
|
|
126 |
By not specifying the dummy tables to use, |
|
|
127 |
random data is used as the data source to generate the dataset, |
|
|
128 |
instead of using the sample dummy tables that you downloaded previously. |
|
|
129 |
|
|
|
130 |
By not specifying the output file, |
|
|
131 |
the output is displayed on screen. |
|
|
132 |
This can be useful for quickly seeing output |
|
|
133 |
while working on a dataset definition, |
|
|
134 |
instead of saving to a file, |
|
|
135 |
and then opening the file. |
|
|
136 |
|
|
|
137 |
|
|
|
138 |
#### When things go wrong |
|
|
139 |
|
|
|
140 |
If your dataset definition contains some invalid ehrQL, |
|
|
141 |
an error message will be displayed on the screen. |
|
|
142 |
|
|
|
143 |
This is one example: |
|
|
144 |
|
|
|
145 |
``` |
|
|
146 |
$ opensafely exec ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv |
|
|
147 |
2023-04-21 17:53:42 [info ] Compiling dataset definition from dataset_definition.py [ehrql.main] |
|
|
148 |
Error loading file 'dataset_definition.py': |
|
|
149 |
``` |
|
|
150 |
```pycon |
|
|
151 |
Traceback (most recent call last): |
|
|
152 |
File "/workspace/dataset_definition.py", line 10, in <module> |
|
|
153 |
dataset.med_date = latest_asthma_med.dat |
|
|
154 |
^^^^^^^^^^^^^^^^^^^ |
|
|
155 |
AttributeError: 'medications' object has no attribute 'dat' |
|
|
156 |
``` |
|
|
157 |
|
|
|
158 |
Refer to [the catalogue of errors](../how-to/errors.md) for help with interpreting error messages. |
|
|
159 |
|
|
|
160 |
### 3. Running ehrQL in an OpenSAFELY pipeline via `opensafely run` |
|
|
161 |
|
|
|
162 |
To run your ehrQL queries as part of an OpenSAFELY pipeline with `opensafely run`, |
|
|
163 |
you need to have a file called `project.yaml`. |
|
|
164 |
|
|
|
165 |
:notepad_spiral: There is considerably more technical detail on [the project pipeline in the OpenSAFELY documentation](https://docs.opensafely.org/actions-pipelines/). |
|
|
166 |
|
|
|
167 |
:computer: Copy the following into a file called |
|
|
168 |
`project.yaml`: |
|
|
169 |
|
|
|
170 |
```yaml |
|
|
171 |
version: '4.0' |
|
|
172 |
|
|
|
173 |
actions: |
|
|
174 |
generate_dataset: |
|
|
175 |
run: ehrql:v1 generate-dataset dataset_definition.py --dummy-tables example-data --output output/dataset.csv.gz |
|
|
176 |
outputs: |
|
|
177 |
highly_sensitive: |
|
|
178 |
dataset: output/dataset.csv.gz |
|
|
179 |
|
|
|
180 |
summarise_dataset: |
|
|
181 |
run: python:latest summarise_dataset.py |
|
|
182 |
needs: [generate_dataset] |
|
|
183 |
outputs: |
|
|
184 |
moderately_sensitive: |
|
|
185 |
dataset: output/summary.txt |
|
|
186 |
``` |
|
|
187 |
|
|
|
188 |
:notepad_spiral: Users already familiar with the [OpenSAFELY research template](https://github.com/opensafely/research-template) may notice that the research template already includes a basic `project.yaml` file that can be edited. |
|
|
189 |
Here, for the purposes of this tutorial, |
|
|
190 |
to skip setting up the template, |
|
|
191 |
we create this file entirely by hand. |
|
|
192 |
|
|
|
193 |
The `project.yaml` file defines two actions: `generate_dataset` and `summarise_dataset`. |
|
|
194 |
Each of these actions specifies one or more `outputs`. |
|
|
195 |
|
|
|
196 |
:notepad_spiral: The definitions of "highly sensitive" and "moderately sensitive" are explained in the [`project.yaml` documentation](https://docs.opensafely.org/actions-pipelines/#projectyaml-format). |
|
|
197 |
|
|
|
198 |
The `generate_dataset` action's `run:` command should look familiar from the previous section. |
|
|
199 |
However, note that the `--output` path is now to a compressed CSV file (`dataset.csv.gz`). |
|
|
200 |
|
|
|
201 |
:notepad_spiral: We recommend the use of compressed CSV files when generating a dataset definition as part of an OpenSAFELY pipeline. |
|
|
202 |
|
|
|
203 |
`summarise_dataset` uses a Python script called `summarise_dataset.py`. |
|
|
204 |
Copy the following into a file called `summarise_dataset.py`. |
|
|
205 |
|
|
|
206 |
```python |
|
|
207 |
import pandas as pd |
|
|
208 |
|
|
|
209 |
dataframe = pd.read_csv("output/dataset.csv.gz") |
|
|
210 |
num_rows = len(dataframe) |
|
|
211 |
|
|
|
212 |
with open("output/summary.txt", "w") as f: |
|
|
213 |
f.write(f"There are {num_rows} patients in the population\n") |
|
|
214 |
``` |
|
|
215 |
|
|
|
216 |
:grey_question: Even if you don't know how to use [pandas](https://pandas.pydata.org/), |
|
|
217 |
can you guess at what this code might do before you run the OpenSAFELY project? |
|
|
218 |
|
|
|
219 |
:computer: Use the command below to run all of the actions |
|
|
220 |
in `project.yaml`: |
|
|
221 |
|
|
|
222 |
opensafely run run_all |
|
|
223 |
|
|
|
224 |
:notepad_spiral: If is this is the first time you have used `opensafely exec`, |
|
|
225 |
the OpenSAFELY CLI may fetch some other Docker images (`python` and `busybox`) needed to run the action. |
|
|
226 |
|
|
|
227 |
#### What you should see when you run the command |
|
|
228 |
|
|
|
229 |
You should see in the logs output displayed similar to this: |
|
|
230 |
|
|
|
231 |
``` |
|
|
232 |
$ opensafely run run_all |
|
|
233 |
|
|
|
234 |
Running actions: generate_dataset, summarise_dataset |
|
|
235 |
|
|
|
236 |
jobrunner.run loop started |
|
|
237 |
generate_dataset: Preparing your code and workspace files |
|
|
238 |
... |
|
|
239 |
summarise_dataset: Extracting output file: output/summary.txt |
|
|
240 |
summarise_dataset: Finished recording results |
|
|
241 |
summarise_dataset: Completed successfully |
|
|
242 |
summarise_dataset: Cleaning up container and volume |
|
|
243 |
|
|
|
244 |
=> generate_dataset |
|
|
245 |
Completed successfully |
|
|
246 |
|
|
|
247 |
log file: metadata/generate_dataset.log |
|
|
248 |
outputs: |
|
|
249 |
output/dataset.csv.gz - highly_sensitive |
|
|
250 |
|
|
|
251 |
=> summarise_dataset |
|
|
252 |
Completed successfully |
|
|
253 |
|
|
|
254 |
log file: metadata/summarise_dataset.log |
|
|
255 |
outputs: |
|
|
256 |
output/summary.txt - moderately_sensitive |
|
|
257 |
``` |
|
|
258 |
|
|
|
259 |
:notepad_spiral: Some of the middle lines of this log have been omitted. |
|
|
260 |
|
|
|
261 |
#### The output files |
|
|
262 |
|
|
|
263 |
The `generate_dataset` action will generate a compressed CSV file called `dataset.csv.gz` in the `output` directory. |
|
|
264 |
If you unzip this, you should see the same output as the previous example. |
|
|
265 |
|
|
|
266 |
The `summarise_dataset` action will generate a small text file called `summary.txt` in the `output` directory. |
|
|
267 |
This will tell you how many patients are in your population. |
|
|
268 |
|
|
|
269 |
## Running ehrQL on an OpenSAFELY backend database |
|
|
270 |
|
|
|
271 |
Once you are happy with your ehrQL queries and any analysis code, |
|
|
272 |
you can submit your project to run against real data in an OpenSAFELY backend database. |
|
|
273 |
|
|
|
274 |
To submit your project to run against real data, refer to the |
|
|
275 |
[existing documentation on using the OpenSAFELY jobs site](https://docs.opensafely.org/jobs-site). |
|
|
276 |
|
|
|
277 |
:notepad_spiral: You will require approval for an OpenSAFELY project, |
|
|
278 |
before you can submit your project to the jobs site. |
|
|
279 |
|
|
|
280 |
## Questions |
|
|
281 |
|
|
|
282 |
* :grey_question: Why would you use the ehrQL `show()` function? |
|
|
283 |
* :grey_question: Which `opensafely` command would you use to run just a dataset definition as a single action? |
|
|
284 |
* :grey_question: Which `opensafely` command would you use to run an entire OpenSAFELY project consisting of multiple actions? |