You can either read this page from start to end
to get an idea of the kinds of queries you can make with ehrQL.
Or you can use the navigation bar at the top-right of this page,
to see a list of the examples,
and then jump to a specific example of interest.
The examples are organised firstly by the table which they pull data from -
for a more complete guide to the tables, refer to the
Table Schemas section of the
ehrQL documentation.
define_population()
In each of these examples,
we specify that the population is all patients
via dataset.define_population(patients.exists_for_patient())
.
In practice,
you will likely want to adapt an example to filter to a specific population of interest.
Refer to the define_population()
documentation.
codelist_from_csv()
⚠️ Some examples refer to CSV codelists using the
codelist_from_csv
function,
but are incomplete.
To actually use these code example,
you will need to correctly complete the function call.
The codelists are not provided as a part of these examples.
For example, instead of:
asthma_codelist = codelist_from_csv("XXX", column="YYY")
you will need a line more like:
asthma_codelist = codelist_from_csv("your-asthma-codelist.csv", column="code")
which provides the filename your-asthma-codelist.csv
and the name of the CSV column with codes.
Some codelists will have a category column that groups individual codes into categories. For example, this codelist for ethnicity has 2 category columns, which represent categories at both 6 and 16 levels. To make use of these categories, you can use codelist_from_csv()
as follows:
ethnicity_codelist = codelist_from_csv("ethnicity_codelist_with_categories", column="snomedcode", category_column="Grouping_6")
If you include an argument for category_column
, the codelist returned will be a dictionary mapping individual codes to their respective categories. Without the category_column
argument, the codelist returned will be a list of codes.
You can see an example of how to access these categories within your dataset definition below.
Examples for the patients table.
from ehrql import create_dataset
from ehrql.tables.core import patients
dataset = create_dataset()
dataset.sex = patients.sex
dataset.define_population(patients.exists_for_patient())
The possible values are "female", "male", "intersex", and "unknown".
from ehrql import create_dataset
from ehrql.tables.core import patients
dataset = create_dataset()
dataset.date_of_birth = patients.date_of_birth
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.core import patients
dataset = create_dataset()
dataset.age = patients.age_on("2023-01-01")
dataset.define_population(patients.exists_for_patient())
Alternatively, using a native Python date
:
from datetime import date
from ehrql import create_dataset
from ehrql.tables.core import patients
dataset = create_dataset()
dataset.age = patients.age_on(date(2023, 1, 1))
dataset.define_population(patients.exists_for_patient())
Or using an index_date
variable:
from ehrql import create_dataset
from ehrql.tables.core import patients
index_date = "2023-01-01"
dataset = create_dataset()
dataset.age = patients.age_on(index_date)
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, case, when
from ehrql.tables.core import patients
dataset = create_dataset()
age = patients.age_on("2023-01-01")
dataset.age_band = case(
when(age < 20).then("0-19"),
when(age < 40).then("20-39"),
when(age < 60).then("40-59"),
when(age < 80).then("60-79"),
when(age >= 80).then("80+"),
otherwise="missing",
)
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.core import patients
dataset = create_dataset()
dataset.date_of_death = patients.date_of_death
dataset.define_population(patients.exists_for_patient())
:notepad_spiral: This value comes from the patient's EHR record. You can find more information about the accuracy of this value in the reference schema.
Examples for the ons_deaths table.
from ehrql import create_dataset
from ehrql.tables.core import ons_deaths, patients
dataset = create_dataset()
dataset.date_of_death = ons_deaths.date
dataset.underlying_cause_of_death = ons_deaths.underlying_cause_of_death
dataset.cause_of_death = ons_deaths.cause_of_death_01
dataset.define_population(patients.exists_for_patient())
:notepad_spiral: There are currently multiple cause of death fields. We aim to resolve these to a single feature in the future.
The ons_deaths
table has multiple "cause of death" fields. Using the
cause_of_death_is_in()
method we can match a codelist against all of these at once.
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import ons_deaths, patients
dataset = create_dataset()
cause_of_death_X_codelist = codelist_from_csv("XXX", column="YYY")
dataset.died_with_X = ons_deaths.cause_of_death_is_in(cause_of_death_X_codelist)
dataset.define_population(patients.exists_for_patient())
Examples for the TPP addresses table.
from ehrql import create_dataset
from ehrql.tables.tpp import addresses, patients
dataset = create_dataset()
dataset.imd = addresses.for_patient_on("2023-01-01").imd_rounded
dataset.define_population(patients.exists_for_patient())
The original IMD ranking is rounded to the nearest 100.
The rounded IMD ranking ranges from 0 to 32,800.
See this code comment about how we choose one address if a patient has multiple registered addresses on the given date.
from ehrql import create_dataset
from ehrql.tables.tpp import addresses, patients
dataset = create_dataset()
patient_address = addresses.for_patient_on("2023-01-01")
dataset.imd_quintile = patient_address.imd_quintile
dataset.imd_decile = patient_address.imd_decile
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.tpp import addresses, patients
dataset = create_dataset()
dataset.rural_urban = addresses.for_patient_on("2023-01-01").rural_urban_classification
dataset.define_population(patients.exists_for_patient())
The meaning of this value is as follows:
from ehrql import create_dataset
from ehrql.tables.tpp import addresses, patients
dataset = create_dataset()
dataset.msoa_code = addresses.for_patient_on("2023-01-01").msoa_code
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.tpp import addresses, patients
dataset = create_dataset()
address = addresses.for_patient_on("2023-01-01")
dataset.imd_rounded = address.imd_rounded
dataset.rural_urban_classification = address.rural_urban_classification
dataset.msoa_code = address.msoa_code
dataset.define_population(patients.exists_for_patient())
Examples for the practice_registrations table.
from ehrql import create_dataset
from ehrql.tables.tpp import practice_registrations, patients
dataset = create_dataset()
dataset.practice = practice_registrations.for_patient_on("2023-01-01").practice_pseudo_id
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.tpp import practice_registrations, patients
dataset = create_dataset()
dataset.stp = practice_registrations.for_patient_on("2023-01-01").practice_stp
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.tpp import practice_registrations, patients
dataset = create_dataset()
dataset.region = practice_registrations.for_patient_on("2023-01-01").practice_nuts1_region_name
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset
from ehrql.tables.tpp import practice_registrations, patients
dataset = create_dataset()
registration = practice_registrations.for_patient_on("2023-01-01")
dataset.practice = registration.practice_pseudo_id
dataset.stp = registration.practice_stp
dataset.region = registration.practice_nuts1_region_name
dataset.define_population(patients.exists_for_patient())
The following example ensures that the dataset only includes patients registered at a
single practice for the entire duration of the study, plus at least 3 months prior to the
study start.
from ehrql import create_dataset, codelist_from_csv, months
from ehrql.tables.tpp import patients, practice_registrations
study_start_date = "2022-01-01"
study_end_date = "2022-12-31"
dataset = create_dataset()
# find registrations that exist for the full study period, and at least 3 months
# prior
registrations = (
practice_registrations.where(
practice_registrations.start_date.is_on_or_before(study_start_date - months(3))
)
.except_where(
practice_registrations.end_date.is_on_or_before(study_end_date)
)
)
dataset.define_population(registrations.exists_for_patient())
Examples for the clinical_events table.
Ethnicity can be defined using a codelist. There are a lot of individual codes that can used to indicate a patients' fine-grained ethnicity. To make analysis more manageable, ethnicity is therefore commonly grouped into higher level categories. Above, we described how you can import codelists that have a category column. You can use a codelist with a category column to map clinical event codes for ethnicity to higher level categories as in this example:
from ehrql import create_dataset
from ehrql.tables.core import clinical_events, patients
from ehrql import codelist_from_csv
dataset = create_dataset()
ethnicity_codelist = codelist_from_csv(
"ethnicity_codelist_with_categories",
column="snomedcode",
category_column="Grouping_6",
)
dataset.latest_ethnicity_code = (
clinical_events.where(clinical_events.snomedct_code.is_in(ethnicity_codelist))
.where(clinical_events.date.is_on_or_before("2023-01-01"))
.sort_by(clinical_events.date)
.last_for_patient()
.snomedct_code
)
dataset.latest_ethnicity_group = dataset.latest_ethnicity_code.to_category(
ethnicity_codelist
)
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.has_had_asthma_diagnosis = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.has_recent_asthma_diagnosis = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_between("2022-07-01", "2023-01-01")
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
The first_for_patient()
and last_for_patient()
methods can only be used on a sorted frame.
Frames can be sorted by calling the sort_by()
method with the column to sort the frame by.
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.first_asthma_diagnosis_date = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).sort_by(
clinical_events.date
).first_for_patient().date
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.last_asthma_diagnosis_date = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).sort_by(
clinical_events.date
).last_for_patient().date
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
hba1c_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
hba1c_events = clinical_events.where(
clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
)
earliest_min_hba1c_event = hba1c_events.sort_by(
clinical_events.numeric_value, clinical_events.date
).first_for_patient()
earliest_max_hba1c_event = hba1c_events.sort_by(
# Note the leading minus sign to sort numeric_value in reverse order
-clinical_events.numeric_value, clinical_events.date
).first_for_patient()
latest_min_hba1c_event = hba1c_events.sort_by(
# Note the leading minus sign to sort numeric_value in reverse order
-clinical_events.numeric_value, clinical_events.date
).last_for_patient()
latest_max_hba1c_event = hba1c_events.sort_by(
clinical_events.numeric_value, clinical_events.date
).last_for_patient()
dataset.date_of_first_min_hba1c_observed = earliest_min_hba1c_event.date
dataset.date_of_first_max_hba1c_observed = earliest_max_hba1c_event.date
dataset.date_of_last_min_hba1c_observed = latest_min_hba1c_event.date
dataset.date_of_last_max_hba1c_observed = latest_max_hba1c_event.date
dataset.value_of_first_min_hba1c_observed = earliest_min_hba1c_event.numeric_value
dataset.value_of_first_max_hba1c_observed = earliest_max_hba1c_event.numeric_value
dataset.value_of_last_min_hba1c_observed = latest_min_hba1c_event.numeric_value
dataset.value_of_last_max_hba1c_observed = latest_max_hba1c_event.numeric_value
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.first_asthma_diagnosis_code = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).sort_by(
clinical_events.date
).first_for_patient().snomedct_code
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.first_asthma_diagnosis_date = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).sort_by(
clinical_events.date
).first_for_patient().date
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
first_asthma_diagnosis = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).sort_by(
clinical_events.date
).first_for_patient()
dataset.first_asthma_diagnosis_code = first_asthma_diagnosis.snomedct_code
dataset.first_asthma_diagnosis_date = first_asthma_diagnosis.date
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
hba1c_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.mean_hba1c = clinical_events.where(
clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).numeric_value.mean_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.has_recent_asthma_diagnosis = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_between("2022-07-01", "2023-01-01")
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv, weeks
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
index_date = "2022-07-01"
dataset = create_dataset()
dataset.has_recent_asthma_diagnosis = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_between(index_date, index_date + weeks(2))
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv, months
from ehrql.tables.core import clinical_events, patients
diabetes_codelist = codelist_from_csv("XXX", column="YYY")
hba1c_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
first_diabetes_code_date = clinical_events.where(
clinical_events.snomedct_code.is_in(diabetes_codelist)
).sort_by(
clinical_events.date
).first_for_patient().date
dataset.count_of_hba1c_tests_6mo_post_first_diabetes_code = clinical_events.where(
clinical_events.snomedct_code.is_in(hba1c_codelist)
).where(
clinical_events.date.is_on_or_between(first_diabetes_code_date, first_diabetes_code_date + months(6))
).count_for_patient()
dataset.define_population(patients.exists_for_patient())
Data quality issues with many sources may result in events apparently happening in future dates (e.g. 9999-01-01), it is useful to filter these from your analysis.
from datetime import date
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.has_recent_asthma_diagnosis = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date > "2022-07-01"
).where(
clinical_events.date < date.today()
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
from datetime import date
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.year_of_first = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).sort_by(
clinical_events.date
).first_for_patient().date.year
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import clinical_events, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
asthma_review_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
first_asthma_diagnosis_date = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).sort_by(clinical_events.date).first_for_patient().date
first_asthma_review_date = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_review_codelist)
).where(
clinical_events.date.is_on_or_after(first_asthma_diagnosis_date)
).sort_by(clinical_events.date).first_for_patient().date
dataset.weeks_between_diagnosis_and_review = (first_asthma_review_date - first_asthma_diagnosis_date).weeks
dataset.define_population(patients.exists_for_patient())
Examples for the TPP apcs table.
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.tpp import apcs, patients
cardiac_diagnosis_codes = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.has_recent_cardiac_admission = apcs.where(
apcs.primary_diagnosis.is_in(cardiac_diagnosis_codes)
).where(
apcs.admission_date.is_on_or_between("2022-07-01", "2023-01-01")
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
Examples for the medications table.
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import medications, patients
statin_medications = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.has_recent_statin_prescription = medications.where(
medications.dmd_code.is_in(statin_medications)
).where(
medications.date.is_on_or_between("2022-07-01", "2023-01-01")
).exists_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import medications, patients
statin_medications = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.number_of_statin_prescriptions_in_last_year = medications.where(
medications.dmd_code.is_in(statin_medications)
).where(
medications.date.is_on_or_between("2022-01-01", "2023-01-01")
).count_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import medications, patients
statin_medications = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.first_statin_prescription_date = medications.where(
medications.dmd_code.is_in(statin_medications)
).where(
medications.date.is_on_or_after("2022-07-01")
).sort_by(
medications.date
).first_for_patient().date
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import medications, patients
statin_medications = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
dataset.last_statin_prescription_date = medications.where(
medications.dmd_code.is_in(statin_medications)
).where(
medications.date.is_on_or_after("2022-07-01")
).sort_by(
medications.date
).last_for_patient().date
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv
from ehrql.tables.core import medications, patients
amoxicillin_codelist = codelist_from_csv("XXX", column="YYY")
winter_months = [10,11,12,1,2,3]
dataset = create_dataset()
dataset.winter_amoxicillin_count = medications.where(
medications.dmd_code.is_in(amoxicillin_codelist)
).where(
medications.date.month.is_in(winter_months)
).count_for_patient()
dataset.define_population(patients.exists_for_patient())
from ehrql import create_dataset, codelist_from_csv, weeks
from ehrql.tables.core import clinical_events, medications, patients
asthma_codelist = codelist_from_csv("XXX", column="YYY")
inhaled_corticosteroid_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
first_asthma_diagnosis_date = clinical_events.where(
clinical_events.snomedct_code.is_in(asthma_codelist)
).where(
clinical_events.date.is_on_or_after("2022-07-01")
).sort_by(
clinical_events.date
).first_for_patient().date
dataset.first_asthma_diagnosis_date = first_asthma_diagnosis_date
dataset.count_ics_prescriptions_2wks_post_diagnosis = medications.where(
medications.dmd_code.is_in(inhaled_corticosteroid_codelist)
).where(
medications.date.is_on_or_between(first_asthma_diagnosis_date,first_asthma_diagnosis_date + weeks(2))
).count_for_patient()
dataset.define_population(patients.exists_for_patient())
Note that in these examples, the periods defined are illustrative only.
from ehrql import case, create_dataset, codelist_from_csv, when, weeks
from ehrql.tables.tpp import medications, patients, practice_registrations
def meets_registrations_criteria(medication_date):
# For this medication date, find whether a registration exists where
# the start date and end dates are within a 12 weeks
# prior/after to the prescription
start_cutoff_date = medication_date - weeks(12)
end_cutoff_date = medication_date + weeks(12)
return (
practice_registrations.where(
practice_registrations.start_date.is_on_or_before(start_cutoff_date)
)
.except_where(
practice_registrations.end_date.is_on_or_before(end_cutoff_date)
)
.exists_for_patient()
)
medication_codelist = codelist_from_csv("XXX", column="YYY")
dataset = create_dataset()
# First relevant prescription per patient
first_prescription = (
medications.where(
medications.dmd_code.is_in(medication_codelist)
)
.sort_by(medications.date)
.first_for_patient()
)
# Include only prescriptions that fall within accepatable registration dates
dataset.prescription_date = case(
when(meets_registrations_criteria(first_prescription.date))
.then(first_prescription.date)
)
dataset.define_population(patients.exists_for_patient())
Examples for the TPP decision support values table.
from ehrql import create_dataset
from ehrql.tables.tpp import decision_support_values
dataset = create_dataset()
latest_efi_record = (
decision_support_values
.electronic_frailty_index()
.sort_by(decision_support_values.calculation_date)
.last_for_patient()
)
dataset.latest_efi = latest_efi_record.numeric_value
dataset.latest_efi_date = latest_efi_record.calculation_date
dataset.define_population(decision_support_values.exists_for_patient())