ehrql / Git / Diff of /docs/tutorial/building-a-dataset/index.md

Models:
philipB/
ehrql
Downloads: 1
Diff of /docs/tutorial/building-a-dataset/index.md [000000] .. [e988c2]
Switch to side-by-side view

--- a
+++ b/docs/tutorial/building-a-dataset/index.md
@@ -0,0 +1,172 @@
+We can now move on to implementing some of the QOF business rules.
+
+In this tutorial, we'll look at rule **DM006**, which is about finding the percentage of patients on the register, with a diagnosis of clinical proteinuria or micro-albuminuria who are currently treated with angiotensin-converting enzymes (ACEs) or angiotensin II receptor blockers (ARBs).
+
+The QOF rules define "currently treated" to mean having a medication within 180 days of the index date.
+
+To do this, we'll needs some more codelists.
+The diagnosis codelists ([proteinuria][1] and [micro-albuminuria][2]) are published by QOF; the medication codeslists aren't, so we'll use ones developed for OpenSAFELY studies ([ACEs][3] and [ARBs][4]).
+
+## Medications
+
+We have already seen how to determine whether a patient has a clinical event matching a code in a codelist.
+Medications are recorded in a similar way to clinical events, and can be queried using the same kinds of ehrQL.
+
+```ehrql
+from ehrql import codelist_from_csv, days, show
+from ehrql.tables.core import clinical_events, medications
+
+
+index_date = "2024-03-31"
+
+proteinuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-prt_cod.csv", column="code")
+microalbuminuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-mal_cod.csv", column="code")
+ace_codes = codelist_from_csv("codelists/opensafely-ace-inhibitor-medications.csv", column="code")
+arb_codes = codelist_from_csv("codelists/opensafely-angiotensin-ii-receptor-blockers-arbs.csv", column="code")
+
+previous_events = clinical_events.where(clinical_events.date.is_on_or_before(index_date))
+recent_meds = medications.where(medications.date.is_on_or_between(index_date - days(180), index_date))
+
+has_proteinuria_diagnosis = (
+    previous_events
+    .where(clinical_events.snomedct_code.is_in(proteinuria_codes))
+    .exists_for_patient()
+)
+
+has_microalbuminuria_diagnosis = (
+    previous_events
+    .where(clinical_events.snomedct_code.is_in(microalbuminuria_codes))
+    .exists_for_patient()
+)
+
+has_arb_treatment = (
+    recent_meds
+    .where(medications.dmd_code.is_in(arb_codes))
+    .exists_for_patient()
+)
+
+has_ace_treatment = (
+    recent_meds
+    .where(medications.dmd_code.is_in(ace_codes))
+    .exists_for_patient()
+)
+
+show(
+    has_proteinuria_diagnosis,
+    has_microalbuminuria_diagnosis,
+    has_arb_treatment,
+    has_ace_treatment,
+)
+```
+
+Notice that we've added a couple of intermediate variables (`previous_events` and `recent_meds`) to reduce duplication in our code.
+
+## Datasets
+
+So far, we've been using `show()` to show us what our queries return when running against dummy data.
+
+But we can't use `show()` to extract data for a real study!
+
+Instead, we need to define a dataset.  As a reminder: a dataset is a new table containing one row per patient, and a dataset definition consists of a population definition and a set of column definitions.
+
+We then use `dataset = create_dataset()` to create a new dataset object, `dataset.define_population(...)` to define the population, and `dataset.column_name = ...` to define the columns:
+
+
+```ehrql
+from ehrql import create_dataset, codelist_from_csv, days, show
+from ehrql.tables.core import patients, practice_registrations, clinical_events, medications
+
+index_date = "2024-03-31"
+
+diabetes_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-dm_cod.csv", column="code")
+resolved_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-dmres_cod.csv", column="code")
+proteinuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-prt_cod.csv", column="code")
+microalbuminuria_codes = codelist_from_csv("codelists/nhsd-primary-care-domain-refsets-mal_cod.csv", column="code")
+ace_codes = codelist_from_csv("codelists/opensafely-ace-inhibitor-medications.csv", column="code")
+arb_codes = codelist_from_csv("codelists/opensafely-angiotensin-ii-receptor-blockers-arbs.csv", column="code")
+
+previous_events = clinical_events.where(clinical_events.date.is_on_or_before(index_date))
+recent_meds = medications.where(medications.date.is_on_or_between(index_date - days(180), index_date))
+
+aged_17_or_older = (index_date - patients.date_of_birth).years >= 17
+was_alive = patients.date_of_death.is_null() | (patients.date_of_death < index_date)
+was_registered = (
+    practice_registrations.where(practice_registrations.start_date <= index_date)
+    .except_where(practice_registrations.end_date < index_date)
+    .exists_for_patient()
+)
+
+last_diagnosis_date = (
+    previous_events
+    .where(clinical_events.snomedct_code.is_in(diabetes_codes))
+    .sort_by(clinical_events.date)
+    .last_for_patient()
+    .date
+)
+last_resolved_date = (
+    previous_events
+    .where(clinical_events.snomedct_code.is_in(resolved_codes))
+    .sort_by(clinical_events.date)
+    .last_for_patient()
+    .date
+)
+
+has_unresolved_diabetes = last_diagnosis_date.is_not_null() & (
+    last_resolved_date.is_null() | (last_resolved_date < last_diagnosis_date)
+)
+
+on_register = aged_17_or_older & was_alive & was_registered & has_unresolved_diabetes
+
+has_proteinuria_diagnosis = (
+    previous_events
+    .where(clinical_events.snomedct_code.is_in(proteinuria_codes))
+    .exists_for_patient()
+)
+
+has_microalbuminuria_diagnosis = (
+    previous_events
+    .where(clinical_events.snomedct_code.is_in(microalbuminuria_codes))
+    .exists_for_patient()
+)
+
+has_arb_treatment = (
+    recent_meds
+    .where(medications.dmd_code.is_in(arb_codes))
+    .exists_for_patient()
+)
+
+has_ace_treatment = (
+    recent_meds
+    .where(medications.dmd_code.is_in(ace_codes))
+    .exists_for_patient()
+)
+
+dataset = create_dataset()
+dataset.define_population(on_register)
+
+dataset.prt_or_mal = has_proteinuria_diagnosis | has_microalbuminuria_diagnosis
+dataset.ace_or_arb = has_arb_treatment | has_ace_treatment
+
+show(dataset)
+```
+
+Note that it is essential that the dataset you create is given the name `dataset`.
+
+> Question: what's the relationship between the number of rows in the dataset and the `on_register` series?
+>
+> Question: what happens if you try to set the dataset's population to something that is not a boolean patient series?
+>
+> Question: what happens if you try to set a dataset column to something that is not a patient series?
+>
+> Question: what happens if you try to reuse a name for a dataset's column?
+
+Next: [Using ehrQL as part of a study](../using-ehrql-as-part-of-a-study/index.md)
+
+!!! abstract "Feedback"
+    Had enough? That's not a problem, but if you could fill in this very short [feedback form][5]{:target="_blank"} we'd really appreciate it.
+
+[1]: https://www.opencodelists.org/codelist/nhsd-primary-care-domain-refsets/prt_cod/
+[2]: https://www.opencodelists.org/codelist/nhsd-primary-care-domain-refsets/mal_cod/
+[3]: https://www.opencodelists.org/codelist/opensafely/ace-inhibitor-medications/
+[4]: https://www.opencodelists.org/codelist/opensafely/angiotensin-ii-receptor-blockers-arbs/
+[5]: https://docs.google.com/forms/d/e/1FAIpQLSeouuTXPnwShAjBllyln4tl2Q52PMG_aUhpma4odpE2MmCngg/viewform