--- a +++ b/tests/generative/README.md @@ -0,0 +1,424 @@ +# Generative Tests Overview + +The generative (property-based) tests use Hypothesis to generate query model variable definitions. + +There are detailed docstrings in the code, and a description of how to run the tests in the +[developer docs](../../DEVELOPERS.md#generative-tests). This document is an attempt to +provide an overview and introduction to what the tests are doing. + +There is one main test, `test_query_model` in [test_query_model.py](test_query_model.py). This +generates a variable definition, executes it using the MSSQL, SQLite and in-memory query engines, +and checks that the results are the same. + +Note that because the inputs are automatically generated, we can't test that the results are +*correct*, but we rely on other (unit, integration, acceptance) tests to check the exact outputs +produced by the query engines, and we assume that the engines are sufficiently different in +implementation that it is a sufficient test to check that they always produce the same results +from the same inputs. + +## Background + +The most difficult part of the generative tests is teaching Hypothesis how to construct queries +out of query model objects. The query model has various constraints or validation rules which tell +you, given a particular object, what sort of operations you can apply to it and what other kinds of +object you can combine it with. + +One way of trying to produce valid queries is just to generate lots of different objects, try sticking +them together and reject any structures that are invalid. This is what we did at first. It has the benefit +of being simple, but as the number of different types in the query model grows it becomes harder and harder +to randomly generate valid examples and eventually it stops working altogether. + +The approach we take now involves, effectively, applying the query model validation rules backwards. We start +by deciding on the sort of object we want to end up with e.g. a Series of one-row-per-patient integers. Then +we ask, given we've got this object, what sorts of operation could validly produce an object like that? We +then get Hypothesis to pick one. That operation in turn requires other inputs and so, again, we ask what +sorts of operation could validly produce it and get Hypothesis to pick one. This process repeats until we +reach a "terminal node" i.e. an operation which doesn't require any inputs. (And if Hypothesis doesn't +naturally give us a terminal node after reaching a certain depth, we force it to choose one.) + +This is very efficient at generating large and complex queries for testing. But applying the validation rules +backwards is fundamentally quite a mind-stretching exercise so don't be surprised if it takes a little while +for everything to fall into place. + +## Terminology + +We are used to thinking about query model things such as `Series` as representing a concrete column +in a table. The generative tests require a mental shift from thinking in terms of concrete objects to +thinking in terms of `strategies` - i.e. a `value` strategy is a recipe for generating a `Value`, +rather than a `Value` itself. + +### Building strategies + +Strategies for query model nodes are built in hypothesis by using +[`hypothesis.strategies.builds`](https://hypothesis.readthedocs.io/en/latest/data.html#hypothesis.strategies.builds) +or the [`hypothesis.strategies.composite`](https://hypothesis.readthedocs.io/en/latest/data.html#composite-strategies) +decorator. + +Calling `builds` with a callable and strategies for the callable's arguments +creates a new strategy that works by drawing the arguments and then +passing them to the callable +e.g. if we have defined strategies for choosing integers and strings, we can create a query model +`Value` strategy with: + +``` +from hypothesis import strategies as st + +integer_strategy = st.integers(min_value=0, max_value=10) +str_strategy = st.text(alphabet=["a", "b", "c"], min_size=0, max_size=3) +value = st.builds(Value, value=st.one_of(integer_strategy, str_strategy)) +``` + +`value` here is still a *strategy*, not a concrete thing. + +``` +>>> value +builds(Value, value=one_of(integers(min_value=0, max_value=10), text(alphabet=['a', 'b', 'c'], max_size=3))) +``` + +We can get an actual example by calling `example()` on a strategy: +``` +>>> value.example() +Value(value='ca') +``` + +(The `example()` method is intended to be used for exploration only, and not in tests or strategy +definitions. Hypothesis will complain if you try to do that.) + +If we need to reason about the examples being drawn, we can use the `composite` decorator. This gives us a magic `draw` argument that we can use to get examples out of a component strategy: + +``` +@st.composite +def value(draw, integer_strategy, str_strategy): + raw_value = draw(st.one_of(integer_strategy, str_strategy)) + if isinstance(raw_value, str): + # do something + ... + return Value(value=raw_value) +``` + +Note that the function we've written here creates one example `Value`, which we're calling in +the normal way, with a concrete `value` keyword argument. The `composite` decorator works by +taking a function like this which returns *one* example, and converting it into function that +returns a *strategy* that produces such examples. + +When we call `value()`, we get the function back: +``` +>>> value(integer_strategy, str_strategy) +value(integer_strategy=integers(min_value=0, max_value=10), str_strategy=text(alphabet=['a', 'b', 'c'], max_size=3)) +``` + +As it's a strategy, we can call example() on it to get an actual example: +``` +>>> value_st = value(integer_strategy, str_strategy) +>>> value_st.example() +Value(value=5) +``` + +### Value vs Series + +Note that in the query model, a `Value` is a type of `Series` which wraps a single static value +in a one-row-per-patient series which just has that value for every patient. The variable strategies +treat `Value` somewhat differently to other types of `Series`; the overall strategy for a `series` +selects from all possible query model nodes that return a `Series`, *except* `Value`. This is to +avoid generating a lot of examples that do not involve the database at all. + +It's import to remember that when a strategy for a node uses input arguments that are +`series` and `value` strategies, those *both* represent `Series` nodes of different types. + + +## test_query_model + +`test_query_model` is the main generative test, and takes a `variable` and `data`, both hypothesis +strategies. + +``` +@hyp.given(variable=variable_strategy, data=data_strategy) +@hyp.settings(**settings) +def test_query_model(query_engines, variable, data, recorder): + recorder.record_inputs(variable, data) # this is used to record and report on some helpful data about the tests + run_test(query_engines, data, variable, recorder) +``` + +## data strategy + +We define a simplified table schema that contains columns of various types. The [`data_setup`](data_setup.py) uses this schema to setup up a number of patient and event tables (2 of each), and +the [`data_strategy`](data_strategies.py) populates the tables with hypothesis-generated data +for each test. + +## value strategies + +The value strategies are strategies defining simple types; `int`, `bool`, `date`, `float` and `str`. + +``` +value_strategies = { + int: st.integers(min_value=0, max_value=10), + bool: st.booleans(), + datetime.date: st.dates( + min_value=datetime.date(2010, 1, 1), max_value=datetime.date(2020, 12, 31) + ), + float: st.floats(min_value=0.0, max_value=11.0, width=16, allow_infinity=False), + str: st.text(alphabet=["a", "b", "c"], min_size=0, max_size=3), +} +``` + +The same value strategies are used for defining both data strategies and variable (query model) +strategies. Generally, we try to define these values with narrow enough ranges that they will +sometimes overlap, and we can test equality. e.g. if we are testing an addition operation that +takes 2 ints, we can be reasonably sure that `st.integers(min_value=0, max_value=10)` will +test addition of two ints that are the same at some point. + +## variable strategies + +The variable strategies are the most complex part of ehrQL's generative test strategies. + +A variable is defined by calling `variable()` in [`variable_strategies.py`](variable_strategies.py), +with the tables, schema and value strategies as described above. + +`variable` defines lots of inner functions, each of which returns a strategy for creating the +thing it is named for, not the thing itself. +For example, the `value` inner function returns a strategy for creating `Value` objects, and not a `Value` object itself. + +A valid variable is a `Series`, chosen by selecting a type (one of the types in `value_strategies`) and a frame +that the variable must be consistent with (with one row per patient, because we require that any +variable on a dataset represents one row per patient). The variable returned will be a +series of the chosen type, consistent with the chosen frame. + +``` +def variable(patient_tables, event_tables, schema, value_strategies): + ... + + @st.composite + def valid_variable(draw): + type_ = draw(any_type()) + frame = draw(one_row_per_patient_frame()) + return draw(series(type_, frame)) + + return valid_variable() +``` + +Let's take a closer look at the elements of the `valid_variable`. + +`type_` is a type, drawn using the `any_type()` strategy, that chooses one of the types used as the keys for `value_strategies`: +``` + def any_type(): + return st.sampled_from(list(value_strategies.keys())) +``` + +`frame` is a frame with one row per patient, drawn using the `one_row_per_patient_frame()` +strategy. That could be a simple patient table (picked from one of the two +`SelectPatientTable` nodes defined by our data strategies), or a `PickOneRowPerPatient` +node, which is the result of a number of sorting and filtering operations. + +Each of the sort and filter operations are themselves defined by strategies which require a +series to sort/filter on, so a frame strategy can quickly become deeply nested. The nesting +can potentially become so deep that it exceeds Hypothesis' max allowed depth, resulting in the +generated example being discarded; to prevent this, various strategies check the depth, and if +it's gone too far, they return a strategy that will pick a "terminal" node; i.e. one that we know +won't recurse any more. In the case of a one-row-per-patient frame, that is a `SelectPatientTable` +node, which can only take one of the two patient tables. + +``` + def one_row_per_patient_frame(): + if depth_exceeded(): + return select_patient_table() + return st.one_of(select_patient_table(), pick_one_row_per_patient_frame()) +``` + +### The `series` strategy + +The `series` strategy is where things become particularly complex! + +`series()` is a strategy for choosing another strategy. Specifically, it chooses a strategy +that will generate a particular type of `Series` node. Whenever a series is needed, we call `series()` +passing in the type of the series that we want to generate, and a frame that it should be consistent +with. The `series()` strategy does the job of finding all the available strategies, and choosing one. + +>**NOTE** +> +>The `type_` and `frame` arguments to `series()` can be thought of as describing properties of the +>series we want to generate from the strategy. So when we pass in `int` as the `type_` and a patient +>frame, it means that we expect to generate an int-type series that has one row per patient. The way +>that series is generated (and whether it uses the `type_` and `frame` arguments itself) is determined +>by the specific strategy that `series()` chooses. + + +Within `series()`, we define `series_constraints`; this is a dict of all possible strategies for +operations that produce a series. The keys are strategy callables. The values are 2-tuples representing the possible return types of the series that will be generated by that strategy, +and the possible domains of the frame it can be drawn from. + +Frame domains indicate whether a particular `Series` node needs to be drawn from a patient table (i.e.one-row-per-patient), a non-patient table, or either. + +The `series()` strategy chooses from the possible operation strategies that meet the constraints of +the `type_` and `frame` passed into it. + +### A simple example: count + +For example, if we only had strategies for `exists` and `count`, and we call `series(int, patient_table)`, +where `patient_table` has been chosen via `one_row_per_patient_frame()`: +``` + def series(type_, frame): + ... + # define contraints for possible series strategies + series_constraints = { + exists: ({bool}, DomainConstraint.PATIENT), + count: ({int}, DomainConstraint.PATIENT), + } + ... + def constraints_match(s): + ... + + # find possible series strategies that match constraints, and choose one + possible_series = [s for s in series_types if constraints_match(s)] + series_strategy = draw(st.sampled_from(possible_series)) + + # draw a series from the chosen strategy + return draw(series_strategy(type_, frame)) + +``` +`exists` returns a bool series with one row per patient. + +`count` returns an int series with one row per patient. + +The `frame` we've passed in matches the domain constriants for both strategies (it's a +one-row-per-patient frame). However, `exists` produces a bool series, and we need a strategy +that produces an int series, so `count` is the only possible strategy that matches. `series()` +will draw from the `count` strategy. + +Note that the `type_` and `frame` are always passed on as arguments to the selected series +strategy, so that all series operation strategies have the same function signature. +However, whether they are actually used depends on the individual strategy. + +For individual series strategies, it's important to remember that the `type_` argument they +receive represents the type of the **resulting series**; it may be important for that particular +series node that we know what that resulting type is +(e.g. see the example of [`Add`](#a-more-complex-example-add) below). + +Following through to the `count` strategy, called by our call to `series(int, frame)`: + +``` + def count(_type, _frame): + return st.builds(AggregateByPatient.Count, any_frame()) +``` + +`count` is passed a type (int) from the `series()` strategy; this is the expected return +type of the series, but it's not required here; it's also not necessary to use the +one-row-per-patient frame we passed in. (By convention, the arguments are named with leading +underscores to indicate this). + +`count` generates an `AggregateByPatient.Count` node which can drawn from either patient-level +or an event-level frame, so we let Hypothesis choose one to use as the input to +`AggregateByPatient.Count`. (Note that the `series` returned *from* `AggregateByPatient.Count` +will be a one-row-per-patient series, consistent with `frame`.) + + +### A more complex example: add + +If we now assume we also have a stratgey for `add`, and we again call `series(int, frame)`, +where `frame` has been chosen via `one_row_per_patient_frame()`: +``` + def series(type_, frame): + ... + # define contraints for possible series strategies + series_constraints = { + exists: ({bool}, DomainConstraint.PATIENT), + count: ({int}, DomainConstraint.PATIENT), + add: ({int, float}, DomainConstraint.PATIENT)) + } + ... +``` +`add` returns either an int or a float series, with one row per patient. + +This time, the `frame` we've passed in again matches the domain constriants for all strategies. +We need a strategy that produces an int series, so now `series` can select from the `count` and +`add` strategies. Let's assume it chooses `add`. + +``` + def add(type_, frame): + ... +``` + +`add` is passed a type (int) from the `series()` strategy; this is the **expected return +type** of the series. In the case of the `add` strategy, this IS important. An `add` strategy +produces an `Add` query model node, which takes two series as arguments, and returns another series. +It can return an int or a float, and the aruments can be either int or float, BUT they all must be +the same. + +This is the definition of `Add` in [`query_model.nodes`](ehrql/query_model/nodes.py) +``` +class Function: + + class Add(Series[Numeric]): + lhs: Series[Numeric] + rhs: Series[Numeric] +``` + +A numeric type `Series` can be float or int, but both arguments to `Add` must be of the same type, +and return another `Series` of that type. In other words, if the expected return type is an int, we +know that we need to build our `add` strategy with int inputs. + +So, for each of the lhs and rhs of the `Add` operation, we need a Hypothesis strategy that +produces an int `Series`. We can do that with `series(type_, frame)`. + +However, there's a bit more to consider here. As mentioned before, a `Value` is also a type of +`Series`. `Value`s represent constants, and in order to cover the case where the `rhs` or `lhs` is +a constant, we need our strategy to allow a `Value` to be chosen as well. In addition, we have a +`frame` passed to the `add` strategy, but the `lhs` and `rhs` arguments can be chosen from different +tables, so we need our strategy to allow for this too. + +So, we have two arguments; we say that one of them MUST be an int series drawn from the provided frame. This ensures that we don't end up with operations on two `Value` series, which don't touch +the database. + +For the second argument, this could be a series drawn from the same frame as the first series, or +from a different frame, OR it could be a value. + +And finally, these arguments could be either the `lhs` or the `rhs` of the `Add` operation, so we +need the strategy to choose that too. + +This logic is implemented in the `binary_operation_with_types()` helper method, which deals with the +many query model nodes operate on two inputs, a `lhs` and `rhs` `Series`. + +### Adding a new variable strategy + +Let's assume we want to add the `GT` node as a new variable strategy. + +First, look at how the `GT` node is implemented: + +``` +class Function: + class GT(Series[bool]): + lhs: Series[Comparable] + rhs: Series[Comparable] +``` + +A `GT` node takes two `Comparable` series as inputs, and returns a boolean series. A `Comparable` +series can be of any type except bool, but the `lhs` and `rhs` series must be of the same type. + +We can add a new entry to the `series_constraints` dict in `series()`, defining the return type of +the series (bool), and the domain constraint. +A `GT` operation can return a series consistent with *either* a one-row-per-patient frame, or a many-rows-per-patient frame: + +``` + def series(type_, frame): + ... + # define contraints for possible series strategies + series_constraints = { + exists: ({bool}, DomainConstraint.PATIENT), + count: ({int}, DomainConstraint.PATIENT), + gt: ({bool}, DomainContstraint.ANY), + } + ... +``` + +Next we define the `gt` strategy. We can use a helper function `any_comparable_type()` to draw a +suitable type. Note that the `_type` passed in to the `gt` function is the expected return type of the series (bool), and in this case it does not have an impact on the implementation of the `gt` +strategy. + +This is a binary operation like many other nodes, so the implementation follows the same strategy as +described for the `add` strategy above. + +``` + @st.composite + def gt(draw, _type, frame): + type_ = draw(any_comparable_type()) + return draw(binary_operation(type_, frame, Function.GT)) +```