--- a +++ b/docs/how-to/errors.md @@ -0,0 +1,883 @@ +## ehrQL error messages + +If an error is found in your dataset definition. +ehrQL will stop running and give you an error message. +ehrQL error messages are shown as a [Python error report, known as a "traceback"](https://realpython.com/python-traceback/). + +:notepad_spiral: The error messages are from Python because ehrQL runs in Python. + +These error messages can be confusing to read, +but they also give you lots of information to use to debug +and fix your dataset definition. + +### Example error message + +Let's look at an example of an error report: + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 7, in <module> + dataset._age = age + ^^^^^^^^^^^^^ +AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age') +``` + +* The *traceback* tells you what code actually caused the error. + The traceback shows both the filename + and where the error occurred in the file. +* There is an error message at the end. + The error message shows what kind of error occurred — + here, this is an `AttributeError` — + followed by details of what the problem is. + +## How to use this page + +### Structure of this page + +For each error, there is: + +1. a simple code example that causes the error +1. the error details +1. the simple code example modified to fix the error + +### Finding an error on this page + +If you are working with ehrQL, +and encounter an error, +this page may help you. + +Because of the included code examples and errors, +this is a long page. + +Here are some tips on narrowing down the search + +#### Using the table of contents + +Skimming the table of contents navigation bar on the right-hand side of this page, +to see if any of the general descriptions of errors apply +to what you are trying to do. + +#### Using your browser's "Find text in page" feature + +Using the "Find text in page" feature of your browser, +searching for parts of the error report. +Let's look at the example given above again: + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 7, in <module> + dataset._age = age + ^^^^^^^^^^^^^ +AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age') +``` + +The first part of this traceback depends on the specific code that has been written here. +It shows: + +* the name of the file — + `dataset_definition.py` stored in the `analysis` directory +* the line number in the file causing the error — + line 7 +* the line of code causing the error + +All of these will vary depending on the code being run. +These are useful to point you to where your error is. + +However, they are possibly less useful to search for in the list provided here, +because this part of the error report will vary. +What *will* stay more constant is the final error message. +Searching in this page for parts of that line, +for example `AttributeError` or `Variable names must start with a letter` +may show you the relevant error. + +:warning: This page covers many of the common ehrQL errors you may see, +but is not an exhaustive list. + +:warning: Notice that even the error message may contain references to the precise code. +In this example: `you defined a variable '_age'`. + +:question: Can you find the part of this page that does explain this error? + + +## Python syntax errors + +These can occur because Python has its own syntactic rules +that ehrQL code must also adhere to. + +### Code indentation error + +Python has particular rules about indentation. +If a dataset definition contains indentation errors, +the error message will tell you about them. +For example, there is an indentation error in the following dataset definition. + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age = patients.age_on("2023-01-01") + dataset.define_population(dataset.age > 16) # This line has incorrect indentation. +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Error loading file 'analysis/dataset_definition.py': + + File "/workspace/analysis/dataset_definition.py", line 6 + dataset.define_population(dataset.age > 16) +IndentationError: unexpected indent +``` + +The error message tells us that there is an indentation error, and also the line that +the error occurred on. + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age = patients.age_on("2023-01-01") +dataset.define_population(dataset.age > 16) # This line now has correct indentation. +``` + +### Forbidden feature names + +Python has constraints on allowed variable names, which also apply to the names of dataset features. +For example, a name — `age!` — with a non-alphanumeric character is invalid: + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age! = patients.age_on("2023-01-01") # age! is an invalid feature name. +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Error loading file 'analysis/dataset_definition.py': + + File "/workspace/analysis/dataset_definition.py", line 5 + dataset.age! = patients.age_on("2023-01-01") # age! is an invalid feature name. + ^ +SyntaxError: invalid syntax +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age = patients.age_on("2023-01-01") # We have changed the invalid feature name, "age!", to a valid one, "age". +``` + +## Common ehrQL errors + +These errors are specific to ehrQL, +rather than Python. + +### Forgetting to set a population + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age = patients.age_on("2023-01-01") +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +``` +A population has not been defined; define one with define_population() +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age = patients.age_on("2023-01-01") +dataset.define_population(dataset.age > 16) # Here we have now defined a population for the dataset. +``` + +### Invalid feature name: `population` is a reserved name + +There are a few constraints on feature names in ehrQL. + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.population = patients.age_on("2023-01-01") > 16 +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 6, in <module> + dataset.population = patients.age_on("2023-01-01") > 16 + ^^^^^^^^^^^^^^^^^^ +AttributeError: Cannot set variable 'population'; use define_population() instead +``` + +#### Fixed dataset definition :heavy_check_mark: + +Define population with the `define_population` syntax: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.define_population(patients.age_on("2023-01-01") > 16) +``` + +Or rename the feature, if it is required as a separate output: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.over_16 = patients.age_on("2023-01-01") > 16 +``` + +### Invalid feature name: `variables` is a reserved name + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.variables = patients.age_on("2023-01-01") > 16 +... +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 5, in <module> + dataset.variables = patients.age_on("2023-01-01") > 16 + ^^^^^^^^^^^^^^^^^ +AttributeError: 'variables' is not an allowed variable name +``` +#### Fixed dataset definition :heavy_check_mark: + +Rename the feature to something other than `variables`. + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age_greater_than_16 = patients.age_on("2023-01-01") > 16 +... +``` + +### Invalid feature name: feature names must not start with underscores + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population(age > 16) +dataset._age = age +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 7, in <module> + dataset._age = age + ^^^^^^^^^^^^^ +AttributeError: Variable names must start with a letter, and contain only alphanumeric characters and underscores (you defined a variable '_age') + +``` + +#### Fixed data definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population(age > 16) +dataset.age = age # _age feature renamed to remove the leading underscores. +``` + +### Re-defining a feature + +In the following dataset definition, `dataset.age` is first defined as `age` and then defined again as `age1`. + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2000-01-01") +age1 = patients.age_on("2023-01-01") +dataset.define_population(age > 16) +dataset.age = age +dataset.age = age1 +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 9, in <module> + dataset.age = age1 + ^^^^^^^^^^^ +AttributeError: 'age' is already set and cannot be reassigned +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2000-01-01") +age1 = patients.age_on("2023-01-01") +dataset.define_population(age > 16) +dataset.age = age +dataset.age1 = age1 # The second age feature now has a unique name on the dataset +``` + +### Undefined features + +All features set on a dataset must be defined; in the following dataset, `age` has been +defined on its own, but has not been defined when set on the dataset: + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2000-01-01") +dataset.define_population(age > 16) +dataset.age +``` + +Run the dataset definition with: +``` +opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 8, in <module> + dataset.age +AttributeError: Variable 'age' has not been defined +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2000-01-01") +dataset.define_population(age > 16) +dataset.age = age # dataset.age is now defined +``` + +### Trying to set a feature that has more than one row per patient + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.tpp import practice_registrations + +dataset = create_dataset() +dataset.registered_on = practice_registrations.start_date +``` + +The `practice_registrations` table contains multiple rows per patient. + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 5, in <module> + dataset.registered_on = practice_registrations.start_date + ^^^^^^^^^^^^^^^^^^^^^ +TypeError: Invalid variable 'registered_on'. Dataset variables must return one row per patient +``` + +#### Fixed dataset definition :heavy_check_mark: + +To return the latest `registered_on` date, first sort the practice registrations table, find the +last registration for each patient, and *then* get the start date. + +```python +from ehrql import create_dataset +from ehrql.tables.tpp import practice_registrations + +dataset = create_dataset() +latest_registration_per_patient = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient() +dataset.registered_on = latest_registration_per_patient.start_date +``` + +### Trying to set a feature to a row rather than a value + +In the following dataset definition, we have reduce the practice registrations to one row per patient, but +we have not selected a value as the feature: + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.tpp import practice_registrations + +dataset = create_dataset() +dataset.registered_on = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient() +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 5, in <module> + dataset.registered_on = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient() + ^^^^^^^^^^^^^^^^^^^^^ +TypeError: Invalid variable 'registered_on'. Dataset variables must be values not whole rows +``` + +Fix the dataset definition by setting the feature to a single value, in this case, `start_date`. + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.tpp import practice_registrations + +dataset = create_dataset() +latest_registration_per_patient = practice_registrations.sort_by(practice_registrations.start_date).last_for_patient() +dataset.registered_on = latest_registration_per_patient.start_date +``` + +### Type errors in ehrQL expressions + +Many ehrQL comparisons require the elements being compared to be of the same type. + +In the following dataset definition, `age` is an integer, but in the last line we +try to define the population by comparing age to the string `"10"` + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population(age >= "10") +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 6, in <module> + dataset.define_population(age >= "10") + ^^^^^^^^^^^ +ehrql.query_model.nodes.TypeValidationError: GE.rhs requires 'ehrql.query_model.nodes.Series[int]' but got 'ehrql.query_model.nodes.Series[str]' +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population(age >= 10) # age is now being compared to the integer 10 +``` + +### Invalid keywords "and", "or", "not" + +In normal Python, logical operations can be performed using the keywords `and`, `or` and `not`. In ehrQL +these are prohibited and will raise an error. + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population((age >= 16) and (age <= 80)) +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 6, in <module> + dataset.define_population((age >= 16) and (age <= 80)) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^ +TypeError: The keywords 'and', 'or', and 'not' cannot be used with ehrQL, please use the operators '&', '|' and '~' instead. +(You will also see this error if you try use a chained comparison, such as 'a < b < c'.) +``` + +#### Fixed dataset definition :heavy_check_mark: + +As described in the error message, use the operator `&` instead: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population((age >= 16) & (age <= 80)) +``` + +### Chaining comparisons + +Chained comparisons are not allowed in ehrQL. + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population(16 < age <= 80) +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 6, in <module> + dataset.define_population(16 < age <= 80) + ^^^^^^^^^^^^^^ +TypeError: The keywords 'and', 'or', and 'not' cannot be used with ehrQL, please use the operators '&', '|' and '~' instead. +(You will also see this error if you try use a chained comparison, such as 'a < b < c'.) +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.define_population((age >= 16) & (age <= 80)) +``` + +### Trying to perform arithmetic operations with an integer column and a float constant + +In the following dataset, `age` is an integer. We cannot subtract a float from it. + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.age_minus_5 = age - 5.5 +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 6, in <module> + dataset.age_minus_5 = age - 5.5 + ~~~~^~~~~ +ehrql.query_model.nodes.TypeValidationError: Subtract.rhs requires 'ehrql.query_model.nodes.Series[int]' but got 'ehrql.query_model.nodes.Series[float]' +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +age = patients.age_on("2023-01-01") +dataset.age_minus_5 = age - 5 +``` + +### Calculate a date difference without specifying return units + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age_in_may = "2023-05-01" - patients.date_of_birth +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 5, in <module> + dataset.age_in_may = "2023-05-01" - patients.date_of_birth + ^^^^^^^^^^^^^^^^^^ +TypeError: Invalid variable 'age_in_may'. Dataset variables must be values not whole rows +``` + +To fix this error, specify the units of the date difference that you want in the feature: + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.age_in_may = ("2023-05-01" - patients.date_of_birth).years +``` + +### Trying to subtract/add constants to dates + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.date_at_age_16 = patients.date_of_birth + 16 +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 5, in <module> + dataset.date_at_age_16 = patients.date_of_birth + 16 + ~~~~~~~~~~~~~~~~~~~~~~~^~~~ +TypeError: unsupported operand type(s) for +: 'DatePatientSeries' and 'int' +``` + +ehrQL cannot add an integer to a date - it needs to know what sort of time unit +we are adding (days, months, years). + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset, years +from ehrql.tables.core import patients + +dataset = create_dataset() +dataset.date_at_age_16 = patients.date_of_birth + years(16) +``` + +### Incorrectly referencing a table column + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import clinical_events + +dataset = create_dataset() +first_event = clinical_events.sort_by(date).first_for_patient() +dataset.event_date = first_event.date +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 5, in <module> + first_event = clinical_events.sort_by(date).first_for_patient() + ^^^^ +NameError: name 'date' is not defined +``` + +#### Fixed dataset definition :heavy_check_mark: + +Columns must be specified as the table attribute: + +```python +from ehrql import create_dataset +from ehrql.tables.core import clinical_events + +dataset = create_dataset() +first_event = clinical_events.sort_by(clinical_events.date).first_for_patient() +dataset.event_date = first_event.date +``` + +### Specifying a default for `case` which is a different type to the values + +In the following dataset definition, two age groups are defined as integers (1 and 2). A default +value (for patients who don't fall into one of the categories) is defined as "unknown". This is +an error - any default value given for a case statement must be of the same type (or None). + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset, case, when +from ehrql.tables.core import patients + +dataset = create_dataset() + +age = patients.age_on("2023-01-01") +dataset.age_group = case( + when(age < 10).then(1), + when(age > 80).then(2), + otherwise="unknown", +) +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 7, in <module> + dataset.age_group5 = case( + ^^^^^ +ehrql.query_model.nodes.TypeValidationError: Case.default requires 'ehrql.query_model.nodes.Series[int] | None' but got 'ehrql.query_model.nodes.Series[str]' +``` + +#### Fixed dataset definition :heavy_check_mark: + +```python +from ehrql import create_dataset, case, when +from ehrql.tables.core import patients + +dataset = create_dataset() + +age = patients.age_on("2023-01-01") +dataset.age_group = case( + when(age < 10).then(1), + when(age > 80).then(2), + otherwise=0, +) +``` + +### Using `is_in` without a container + +#### Failing dataset definition :x: + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() + +age = patients.age_on("2023-01-01") +dataset.age_30 = age.is_in(30) +``` + +#### Error + +```pytb +Traceback (most recent call last): + File "/workspace/analysis/dataset_definition.py", line 7, in <module> + dataset.age_30 = age.is_in(30) + ^^^^^^^^^^^^^ +ehrql.query_model.nodes.TypeValidationError: In.rhs requires 'ehrql.query_model.nodes.Series[collections.abc.Set[int]]' but got 'ehrql.query_model.nodes.Series[int]' +``` + +:notepad_spiral: This is also an error: + +```python +dataset.age_30_or_40 = age.is_in(30, 40) +``` + +#### Fixed dataset definition :heavy_check_mark: + +Arguments passed to `is_in` must be wrapped in a python container - a set, list or tuple. +All of the following features defined with `is_in` are valid. + +```python +from ehrql import create_dataset +from ehrql.tables.core import patients + +dataset = create_dataset() + +age = patients.age_on("2023-01-01") +dataset.age_30_list = age.is_in([30]) +dataset.age_30_or_40_set = age.is_in({30, 40}) +dataset.age_30_or_40_tuple = age.is_in((30, 40)) +```