Switch to unified view

a b/docs/explanation/measures.md
1
# Using the measures framework
2
3
## Introduction
4
5
The measures framework is used to calculate **quotients** (i.e. a numerator divided by a denominator) and to see how these **vary over time** and when broken down by different **groupings**.
6
7
The numerators, denominators and groups are all defined using ehrQL queries, just like we would use with a dataset definition. But the way those definitions are used is a bit different.
8
9
To explain the concepts involved we'll start with a basic but complete example and walk through each of the elements.
10
11
12
## Basic example
13
14
Suppose we want to know what proportion of the patients prescribed atorvastatin tablets in a given month were prescribed 80mg tablets, with the results broken down by sex and calculated for each of the first six months of 2022. Here's how we'd do that using the measures framework:
15
16
### Code
17
18
```ehrql
19
from ehrql import INTERVAL, case, create_measures, months, when
20
from ehrql.tables.core import medications, patients
21
22
# Every measure definitions file must include this line
23
measures = create_measures()
24
25
# Disable disclosure control for demonstration purposes.
26
# Values will neither be suppressed nor rounded.
27
measures.configure_disclosure_control(enabled=False)
28
29
# Small codelist for demonstration purposes; the real list would be longer
30
atorvastatin_tablets = [
31
    "39733211000001101",
32
    "39695411000001103",
33
    "39733011000001106",
34
    "39733111000001107",
35
]
36
37
atorvastatin_80mg_tablets = [
38
    "39733211000001101",
39
]
40
41
# The use of the special INTERVAL placeholder below is the key part of
42
# any measure definition as it allows the definition to be evaluated
43
# over a range of different intervals, rather than a fixed pair of dates
44
rx_in_interval = medications.where(
45
    medications.date.is_during(INTERVAL)
46
)
47
atorvastatin_rx = rx_in_interval.where(
48
    medications.dmd_code.is_in(atorvastatin_tablets)
49
)
50
atorvastatin_80_rx = rx_in_interval.where(
51
    medications.dmd_code.is_in(atorvastatin_80mg_tablets)
52
)
53
54
has_recorded_sex = patients.sex.is_in(["male", "female"])
55
56
measures.define_measure(
57
    name="atorva_80",
58
    numerator=atorvastatin_80_rx.exists_for_patient(),
59
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
60
    group_by={
61
        "sex": patients.sex
62
    },
63
    intervals=months(6).starting_on("2022-01-01"),
64
)
65
```
66
67
### Running the example
68
69
You can save this file as `measure_definition.py` and then run the [`generate-measures`](../reference/cli.md#generate-measures) command on it:
70
```
71
opensafely exec ehrql:v1 generate-measures measure_definition.py --output measures.csv
72
```
73
74
### Results
75
76
This should produce a file called `measures.csv` whose contents look something like this:
77
78
measure | interval_start | interval_end | ratio | numerator | denominator | sex
79
-- | -- | -- | -- | -- | -- | --
80
atorva_80 | 2022-01-01 | 2022-01-31 | 0 | 0 | 5 | female
81
atorva_80 | 2022-01-01 | 2022-01-31 | 0.2 | 2 | 10 | male
82
atorva_80 | 2022-02-01 | 2022-02-28 | 0 | 0 | 3 | female
83
atorva_80 | 2022-02-01 | 2022-02-28 | 0.2 | 1 | 5 | male
84
atorva_80 | 2022-03-01 | 2022-03-31 | 0.5 | 3 | 6 | female
85
atorva_80 | 2022-03-01 | 2022-03-31 | 0.2 | 2 | 10 | male
86
atorva_80 | 2022-04-01 | 2022-04-30 | 0.444 | 4 | 9 | male
87
atorva_80 | 2022-04-01 | 2022-04-30 | 0.125 | 1 | 8 | female
88
atorva_80 | 2022-05-01 | 2022-05-31 | 0.286 | 2 | 7 | male
89
atorva_80 | 2022-05-01 | 2022-05-31 | 0.2 | 1 | 5 | female
90
atorva_80 | 2022-06-01 | 2022-06-30 | 0.25 | 2 | 8 | male
91
atorva_80 | 2022-06-01 | 2022-06-30 | 0.5 | 5 | 10 | female
92
93
94
Here the **`measure`** column always has the same value (`atorva_80`) because we only have a single measure defined, but if we had defined multiple measures then this would tell us which measure each row relates to.
95
96
The **`interval_start`** and **`interval_end`** columns show the date range (inclusive) covered by each row. In this case there are 12 rows because each of the six months appears twice: once where `sex=male` and again where `sex=female`.
97
98
The **`denominator`** column gives the number of patients for each interval-sex combination which match our denominator definition. The **`numerator`** column gives the number of patients which _also_ match the numerator definition. That is, a patient must match both the denominator and numerator definitions to be included in the numerator.
99
100
The **`ratio`** is simply `numerator` divided by `denominator`.
101
102
The **`sex`** column gives the value of the expression we supplied when defining `group_by`. Had we defined any other groupings (say `age_band`) then they would appear as additional columns here.
103
104
105
## Core concepts
106
107
### Defining a measure
108
109
A measure definition always starts by creating a measures collection object:
110
```python
111
measures = create_measures()
112
```
113
114
Each individual measure is then defined by calling [`measures.define_measure()`](../reference/language.md#Measures.define_measure):
115
```python
116
measures.define_measure(
117
    name="atorva_80",
118
    numerator=atorvastatin_80_rx.exists_for_patient(),
119
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
120
    group_by={
121
        "sex": patients.sex
122
    },
123
    intervals=months(6).starting_on("2022-01-01"),
124
)
125
```
126
127
The **`name`** argument is just used so we can identify our measure in
128
the output. It can be anything you like so long as it contains only
129
alphanumeric and underscore characters and starts with a letter.
130
131
The **`denominator`** argument defines the condition that patients must
132
match to be included in the denominator i.e. it is a [boolean patient
133
series](../reference/language.md#BoolPatientSeries). (It is also possible
134
to supply an integer here, but we'll cover this later.)
135
136
The **`numerator`** arguments defines the _additional_ condition that
137
patients must also meet to be included in the numerator. (Again, this
138
can also be an integer as we'll discuss later.)
139
140
The **`group_by`** argument defines how we would like our results broken
141
down. It's optional – leaving it out means that we'll get a single row
142
for each time interval – but most measures define at least one set of
143
groups. It is supplied as a dictionary mapping group names to group
144
definitions.
145
146
As we saw in the example above, each group name defined here ends up as
147
column in the results. Each group definition should be a "categorical"
148
patient series, that is: a patient series which takes only a fixed set
149
of values.
150
151
The **`interval`** argument defines the time periods over which the
152
measure will be calculated. This is given as a list of start date/end
153
date pairs but, as typing these all out by hand would be laborious, we
154
provide several convenience functions for generating such lists:
155
[`years`](../reference/language.md#years.starting_on),
156
[`months`](../reference/language.md#months.starting_on) and
157
[`weeks`](../reference/language.md#weeks.starting_on).
158
159
160
### The `INTERVAL` placeholder
161
162
The key difference between the ehrQL we'd write for a dataset definition and the ehrQL we write for a measure comes in this line here:
163
```python
164
rx_in_interval = medications.where(
165
    medications.date.is_during(INTERVAL)
166
)
167
```
168
169
This filters the medications table to include just those prescribed during "the current interval" without specifying exactly what the interval is.
170
171
In a dataset definition we would need to reference a specific pair of dates here e.g.:
172
```python
173
rx_in_interval = medications.where(
174
    medications.date.is_on_or_between("2022-01-01", "2022-01-31")
175
)
176
```
177
178
But the query we use in our measure definition needs to be evaluated over a range of date intervals, not just one. We can handle this by using the special `INTERVAL` value as a placeholder for the start and end dates of the intervals:
179
```python
180
rx_in_interval = medications.where(
181
    medications.date.is_on_or_between(INTERVAL.start_date, INTERVAL.end_date)
182
)
183
```
184
185
However, it's a bit cumbersome to have to type `is_on_or_between()`, `start_date` and `end_date` every time so we provide `is_during()` as a convenient shorthand:
186
```python
187
rx_in_interval = medications.where(
188
    medications.date.is_during(INTERVAL)
189
)
190
```
191
192
### Disclosure control
193
194
By default, numerators and denominators are subject to disclosure control.
195
First, values less than or equal to seven are replaced with zero (suppressed);
196
then, values are rounded to the nearest five.
197
198
We disabled disclosure control with this line here:
199
200
```python
201
measures.configure_disclosure_control(enabled=False)
202
```
203
204
However, we should carefully consider whether we wish to disable disclosure control when
205
running inside of the secure environment. If we don't, then we should remove that line.
206
207
## Additional notes
208
209
### Multiple measures
210
211
We're not restricted to just one measure per file. If we wanted to add
212
another measure using the same numerator and denominator but this time
213
broken down by age band, then we would start by defining an age band variable
214
like this:
215
```python
216
age = patients.age_on(INTERVAL.start_date)
217
age_band = case(
218
    when((age >= 0) & (age < 20)).then("0-19"),
219
    when((age >= 20) & (age < 40)).then("20-39"),
220
    when((age >= 40) & (age < 60)).then("40-59"),
221
    when((age >= 60) & (age < 80)).then("60-79"),
222
    when(age >= 80).then("80+"),
223
)
224
```
225
226
And then we can define a new measure exactly like the previous one but
227
with different `name` and using `age_band` in our `group_by` argument:
228
229
```python
230
measures.define_measure(
231
    name="atorva_80_by_age",
232
    numerator=atorvastatin_80_rx.exists_for_patient(),
233
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
234
    group_by={
235
        "age_band": age_band
236
    },
237
    intervals=months(6).starting_on("2022-01-01"),
238
)
239
```
240
241
242
### Removing duplication
243
244
As our second measure shares so much in common with our first it seems
245
redundant to have to specify the numerator, denominator and intervals
246
all over again. Furthermore, if we ever wanted to make changes to these
247
definitions we'd need to make the change in multiple places to keep them
248
consistent.
249
250
To avoid this we can use
251
[`measures.define_defaults()`](../reference/language.md#Measures.define_defaults)
252
to set values which we know are going to be common between all the
253
measures in our file:
254
```python
255
measures.define_defaults(
256
    numerator=atorvastatin_80_rx.exists_for_patient(),
257
    denominator=atorvastatin_rx.exists_for_patient() & has_recorded_sex,
258
    intervals=months(6).starting_on("2022-01-01"),
259
)
260
```
261
262
And then we can define our two measures using:
263
```python
264
measures.define_measure(
265
    name="atorva_80",
266
    group_by={
267
        "sex": patients.sex
268
    },
269
)
270
271
measures.define_measure(
272
    name="atorva_80_by_age",
273
    group_by={
274
        "age_band": age_band
275
    },
276
)
277
```
278
279
280
### Grouping by multiple features
281
282
The above example defines two separate measures, one grouping by sex and
283
the other by age band. But it is also possible to define a single
284
measure that groups by sex _and_ age band simultaneously:
285
```python
286
measures.define_measure(
287
    name="atorva_80_by_age_and_sex",
288
    group_by={
289
        "sex": patients.sex,
290
        "age_band": age_band,
291
    },
292
)
293
```
294
295
This will produce a row of output for each possible combination of sex
296
and age band: `male,0-19` `female,0-19`, `male,20-39`, `female,20-39`
297
and so on for all ten combinations.
298
299
As we defined our measure to cover monthly intervals over a six month
300
period, this means that this single measure will produce 60 rows – ten
301
for each month.
302
303
304
### Dummy data
305
306
When run outside of the secure environment ehrQL will generate dummy
307
data for measures just as it does with datasets. However, note that
308
generating meaningful results for measures defined over a large time
309
period and with many different groupings can require significantly more
310
dummy data than would be needed for a dataset definition. The measures
311
framework will make a crude attempt to guess how many dummy patients it
312
should generate, but you may need to adjust this number using the
313
[`measures.configure_dummy_data()`](../reference/language.md#Measures.configure_dummy_data)
314
method.
315
316
For more information about using dummy data with measures please see our
317
[how-to guide](../how-to/dummy-measures-data.md).