Switch to unified view

a b/tests/generative/README.md
1
# Generative Tests Overview
2
3
The generative (property-based) tests use Hypothesis to generate query model variable definitions.
4
5
There are detailed docstrings in the code, and a description of how to run the tests in the
6
[developer docs](../../DEVELOPERS.md#generative-tests). This document is an attempt to
7
provide an overview and introduction to what the tests are doing.
8
9
There is one main test, `test_query_model` in [test_query_model.py](test_query_model.py). This
10
generates a variable definition, executes it using the MSSQL, SQLite and in-memory query engines,
11
and checks that the results are the same.
12
13
Note that because the inputs are automatically generated, we can't test that the results are
14
*correct*, but we rely on other (unit, integration, acceptance) tests to check the exact outputs
15
produced by the query engines, and we assume that the engines are sufficiently different in
16
implementation that it is a sufficient test to check that they always produce the same results
17
from the same inputs.
18
19
## Background
20
21
The most difficult part of the generative tests is teaching Hypothesis how to construct queries
22
out of query model objects. The query model has various constraints or validation rules which tell
23
you, given a particular object, what sort of operations you can apply to it and what other kinds of
24
object you can combine it with.
25
26
One way of trying to produce valid queries is just to generate lots of different objects, try sticking
27
them together and reject any structures that are invalid. This is what we did at first. It has the benefit
28
of being simple, but as the number of different types in the query model grows it becomes harder and harder
29
to randomly generate valid examples and eventually it stops working altogether.
30
31
The approach we take now involves, effectively, applying the query model validation rules backwards. We start
32
by deciding on the sort of object we want to end up with e.g. a Series of one-row-per-patient integers. Then
33
we ask, given we've got this object, what sorts of operation could validly produce an object like that? We
34
then get Hypothesis to pick one. That operation in turn requires other inputs and so, again, we ask what
35
sorts of operation could validly produce it and get Hypothesis to pick one. This process repeats until we
36
reach a "terminal node" i.e. an operation which doesn't require any inputs. (And if Hypothesis doesn't
37
naturally give us a terminal node after reaching a certain depth, we force it to choose one.)
38
39
This is very efficient at generating large and complex queries for testing. But applying the validation rules
40
backwards is fundamentally quite a mind-stretching exercise so don't be surprised if it takes a little while
41
for everything to fall into place.
42
43
## Terminology
44
45
We are used to thinking about query model things such as `Series` as representing a concrete column
46
in a table. The generative tests require a mental shift from thinking in terms of concrete objects to
47
thinking in terms of  `strategies` - i.e. a `value` strategy is a recipe for generating a `Value`,
48
rather than a `Value` itself.
49
50
### Building strategies
51
52
Strategies for query model nodes are built in hypothesis by using
53
[`hypothesis.strategies.builds`](https://hypothesis.readthedocs.io/en/latest/data.html#hypothesis.strategies.builds)
54
or the [`hypothesis.strategies.composite`](https://hypothesis.readthedocs.io/en/latest/data.html#composite-strategies)
55
decorator.
56
57
Calling `builds` with a callable and strategies for the callable's arguments
58
creates a new strategy that works by drawing the arguments and then
59
passing them to the callable
60
e.g. if we have defined strategies for choosing integers and strings, we can create a query model
61
`Value` strategy with:
62
63
```
64
from hypothesis import strategies as st
65
66
integer_strategy = st.integers(min_value=0, max_value=10)
67
str_strategy = st.text(alphabet=["a", "b", "c"], min_size=0, max_size=3)
68
value = st.builds(Value, value=st.one_of(integer_strategy, str_strategy))
69
```
70
71
`value` here is still a *strategy*, not a concrete thing.
72
73
```
74
>>> value
75
builds(Value, value=one_of(integers(min_value=0, max_value=10), text(alphabet=['a', 'b', 'c'], max_size=3)))
76
```
77
78
We can get an actual example by calling `example()` on a strategy:
79
```
80
>>> value.example()
81
Value(value='ca')
82
```
83
84
(The `example()` method is intended to be used for exploration only, and not in tests or strategy
85
definitions. Hypothesis will complain if you try to do that.)
86
87
If we need to reason about the examples being drawn, we can use the `composite` decorator. This gives us a magic `draw` argument that we can use to get examples out of a component strategy:
88
89
```
90
@st.composite
91
def value(draw, integer_strategy, str_strategy):
92
    raw_value = draw(st.one_of(integer_strategy, str_strategy))
93
    if isinstance(raw_value, str):
94
        # do something
95
        ...
96
    return Value(value=raw_value)
97
```
98
99
Note that the function we've written here creates one example `Value`, which we're calling in
100
the normal way, with a concrete `value` keyword argument. The `composite` decorator works by
101
taking a function like this which returns *one* example, and converting it into function that
102
returns a *strategy* that produces such examples.
103
104
When we call `value()`, we get the function back:
105
```
106
>>> value(integer_strategy, str_strategy)
107
value(integer_strategy=integers(min_value=0, max_value=10), str_strategy=text(alphabet=['a', 'b', 'c'], max_size=3))
108
```
109
110
As it's a strategy, we can call example() on it to get an actual example:
111
```
112
>>> value_st = value(integer_strategy, str_strategy)
113
>>> value_st.example()
114
Value(value=5)
115
```
116
117
### Value vs Series
118
119
Note that in the query model, a `Value` is a type of `Series` which wraps a single static value
120
in a one-row-per-patient series which just has that value for every patient. The variable strategies
121
treat `Value` somewhat differently to other types of `Series`; the overall strategy for a `series`
122
selects from all possible query model nodes that return a `Series`, *except* `Value`. This is to
123
avoid generating a lot of examples that do not involve the database at all.
124
125
It's import to remember that when a strategy for a node uses input arguments that are
126
`series` and `value` strategies, those *both* represent `Series` nodes of different types.
127
128
129
## test_query_model
130
131
`test_query_model` is the main generative test, and takes a `variable` and `data`, both hypothesis
132
strategies.
133
134
```
135
@hyp.given(variable=variable_strategy, data=data_strategy)
136
@hyp.settings(**settings)
137
def test_query_model(query_engines, variable, data, recorder):
138
    recorder.record_inputs(variable, data)  # this is used to record and report on some helpful data about the tests
139
    run_test(query_engines, data, variable, recorder)
140
```
141
142
## data strategy
143
144
We define a simplified table schema that contains columns of various types. The [`data_setup`](data_setup.py) uses this schema to setup up a number of patient and event tables (2 of each), and
145
the [`data_strategy`](data_strategies.py) populates the tables with hypothesis-generated data
146
for each test.
147
148
## value strategies
149
150
The value strategies are strategies defining simple types; `int`, `bool`, `date`, `float` and `str`.
151
152
```
153
value_strategies = {
154
    int: st.integers(min_value=0, max_value=10),
155
    bool: st.booleans(),
156
    datetime.date: st.dates(
157
        min_value=datetime.date(2010, 1, 1), max_value=datetime.date(2020, 12, 31)
158
    ),
159
    float: st.floats(min_value=0.0, max_value=11.0, width=16, allow_infinity=False),
160
    str: st.text(alphabet=["a", "b", "c"], min_size=0, max_size=3),
161
}
162
```
163
164
The same value strategies are used for defining both data strategies and variable (query model)
165
strategies. Generally, we try to define these values with narrow enough ranges that they will
166
sometimes overlap, and we can test equality. e.g. if we are testing an addition operation that
167
takes 2 ints, we can be reasonably sure that `st.integers(min_value=0, max_value=10)` will
168
test addition of two ints that are the same at some point.
169
170
## variable strategies
171
172
The variable strategies are the most complex part of ehrQL's generative test strategies.
173
174
A variable is defined by calling `variable()` in [`variable_strategies.py`](variable_strategies.py),
175
with the tables, schema and value strategies as described above.
176
177
`variable` defines lots of inner functions, each of which returns a strategy for creating the
178
thing it is named for, not the thing itself.
179
For example, the `value` inner function returns a strategy for creating `Value` objects, and not a `Value` object itself.
180
181
A valid variable is a `Series`, chosen by selecting a type (one of the types in `value_strategies`) and a frame
182
that the variable must be consistent with (with one row per patient, because we require that any
183
variable on a dataset represents one row per patient). The variable returned will be a
184
series of the chosen type, consistent with the chosen frame.
185
186
```
187
def variable(patient_tables, event_tables, schema, value_strategies):
188
   ...
189
190
    @st.composite
191
    def valid_variable(draw):
192
        type_ = draw(any_type())
193
        frame = draw(one_row_per_patient_frame())
194
        return draw(series(type_, frame))
195
196
    return valid_variable()
197
```
198
199
Let's take a closer look at the elements of the `valid_variable`.
200
201
`type_` is a type, drawn using the `any_type()` strategy, that chooses one of the types used as the keys for `value_strategies`:
202
```
203
    def any_type():
204
        return st.sampled_from(list(value_strategies.keys()))
205
```
206
207
`frame` is a frame with one row per patient, drawn using the `one_row_per_patient_frame()`
208
strategy. That could be a simple patient table (picked from one of the two
209
`SelectPatientTable` nodes defined by our data strategies), or a `PickOneRowPerPatient`
210
node, which is the result of a number of sorting and filtering operations.
211
212
Each of the sort and filter operations are themselves defined by strategies which require a
213
series to sort/filter on, so a frame strategy can quickly become deeply nested. The nesting
214
can potentially become so deep that it exceeds Hypothesis' max allowed depth, resulting in the
215
generated example being discarded; to prevent this, various strategies check the depth, and if
216
it's gone too far, they return a strategy that will pick a "terminal" node; i.e. one that we know
217
won't recurse any more. In the case of a one-row-per-patient frame, that is a `SelectPatientTable`
218
node, which can only take one of the two patient tables.
219
220
```
221
    def one_row_per_patient_frame():
222
        if depth_exceeded():
223
            return select_patient_table()
224
        return st.one_of(select_patient_table(), pick_one_row_per_patient_frame())
225
```
226
227
### The `series` strategy
228
229
The `series` strategy is where things become particularly complex!
230
231
`series()` is a strategy for choosing another strategy. Specifically, it chooses a strategy
232
that will generate a particular type of `Series` node. Whenever a series is needed, we call `series()`
233
passing in the type of the series that we want to generate, and a frame that it should be consistent
234
with. The `series()` strategy does the job of finding all the available strategies, and choosing one.
235
236
>**NOTE**
237
>
238
>The `type_` and `frame` arguments to `series()` can be thought of as describing properties of the
239
>series we want to generate from the strategy. So when we pass in `int` as the `type_` and a patient
240
>frame, it means that we expect to generate an int-type series that has one row per patient. The way
241
>that series is generated (and whether it uses the `type_` and `frame` arguments itself) is determined
242
>by the specific strategy that `series()` chooses.
243
244
245
Within `series()`, we define `series_constraints`; this is a dict of all possible strategies for
246
operations that produce a series. The keys are strategy callables. The values are 2-tuples representing the possible return types of the series that will be generated by that strategy,
247
and the possible domains of the frame it can be drawn from.
248
249
Frame domains indicate whether a particular `Series` node needs to be drawn from a patient table (i.e.one-row-per-patient), a non-patient table, or either.
250
251
The `series()` strategy chooses from the possible operation strategies that meet the constraints of
252
the `type_` and `frame` passed into it.
253
254
### A simple example: count
255
256
For example, if we only had strategies for `exists` and `count`, and we call `series(int, patient_table)`,
257
where `patient_table` has been chosen via `one_row_per_patient_frame()`:
258
```
259
    def series(type_, frame):
260
        ...
261
        # define contraints for possible series strategies
262
        series_constraints = {
263
            exists: ({bool}, DomainConstraint.PATIENT),
264
            count: ({int}, DomainConstraint.PATIENT),
265
        }
266
        ...
267
        def constraints_match(s):
268
            ...
269
270
        # find possible series strategies that match constraints, and choose one
271
        possible_series = [s for s in series_types if constraints_match(s)]
272
        series_strategy = draw(st.sampled_from(possible_series))
273
274
        # draw a series from the chosen strategy
275
        return draw(series_strategy(type_, frame))
276
277
```
278
`exists` returns a bool series with one row per patient.
279
280
`count` returns an int series with one row per patient.
281
282
The `frame` we've passed in matches the domain constriants for both strategies (it's a
283
one-row-per-patient frame). However, `exists` produces a bool series, and we need a strategy
284
that produces an int series, so `count` is the only possible strategy that matches. `series()`
285
will draw from the `count` strategy.
286
287
Note that the `type_` and `frame` are always passed on as arguments to the selected series
288
strategy, so that all series operation strategies have the same function signature.
289
However, whether they are actually used depends on the individual strategy.
290
291
For individual series strategies, it's important to remember that the `type_` argument they
292
receive represents the type of the **resulting series**; it may be important for that particular
293
series node that we know what that resulting type is
294
(e.g. see the example of [`Add`](#a-more-complex-example-add) below).
295
296
Following through to the `count` strategy, called by our call to `series(int, frame)`:
297
298
```
299
    def count(_type, _frame):
300
        return st.builds(AggregateByPatient.Count, any_frame())
301
```
302
303
`count` is passed a type (int) from the `series()` strategy; this is the expected return
304
type of the series, but it's not required here; it's also not necessary to use the
305
one-row-per-patient frame we passed in. (By convention, the arguments are named with leading
306
underscores to indicate this).
307
308
`count` generates an `AggregateByPatient.Count` node which can drawn from either patient-level
309
or an event-level frame, so we let Hypothesis choose one to use as the input to
310
`AggregateByPatient.Count`. (Note that the `series` returned *from* `AggregateByPatient.Count`
311
will be a one-row-per-patient series, consistent with `frame`.)
312
313
314
### A more complex example: add
315
316
If we now assume we also have a stratgey for `add`, and we again call `series(int, frame)`,
317
where `frame` has been chosen via `one_row_per_patient_frame()`:
318
```
319
    def series(type_, frame):
320
        ...
321
        # define contraints for possible series strategies
322
        series_constraints = {
323
            exists: ({bool}, DomainConstraint.PATIENT),
324
            count: ({int}, DomainConstraint.PATIENT),
325
            add: ({int, float}, DomainConstraint.PATIENT))
326
        }
327
        ...
328
```
329
`add` returns either an int or a float series, with one row per patient.
330
331
This time, the `frame` we've passed in again matches the domain constriants for all strategies.
332
We need a strategy that produces an int series, so now `series` can select from the `count` and
333
`add` strategies. Let's assume it chooses `add`.
334
335
```
336
    def add(type_, frame):
337
        ...
338
```
339
340
`add` is passed a type (int) from the `series()` strategy; this is the **expected return
341
type** of the series. In the case of the `add` strategy, this IS important. An `add` strategy
342
produces an `Add` query model node, which takes two series as arguments, and returns another series.
343
It can return an int or a float, and the aruments can be either int or float, BUT they all must be
344
the same.
345
346
This is the definition of `Add` in [`query_model.nodes`](ehrql/query_model/nodes.py)
347
```
348
class Function:
349
350
    class Add(Series[Numeric]):
351
        lhs: Series[Numeric]
352
        rhs: Series[Numeric]
353
```
354
355
A numeric type `Series` can be float or int, but both arguments to `Add` must be of the same type,
356
and return another `Series` of that type. In other words, if the expected return type is an int, we
357
know that we need to build our `add` strategy with int inputs.
358
359
So, for each of the lhs and rhs of the `Add` operation, we need a Hypothesis strategy that
360
produces an int `Series`. We can do that with `series(type_, frame)`.
361
362
However, there's a bit more to consider here. As mentioned before, a `Value` is also a type of
363
`Series`. `Value`s represent constants, and in order to cover the case where the `rhs` or `lhs` is
364
a constant, we need our strategy to allow a `Value` to be chosen as well. In addition, we have a
365
`frame` passed to the `add` strategy, but the `lhs` and `rhs` arguments can be chosen from different
366
tables, so we need our strategy to allow for this too.
367
368
So, we have two arguments; we say that one of them MUST be an int series drawn from the provided frame. This ensures that we don't end up with operations on two `Value` series, which don't touch
369
the database.
370
371
For the second argument, this could be a series drawn from the same frame as the first series, or
372
from a different frame, OR it could be a value.
373
374
And finally, these arguments could be either the `lhs` or the `rhs` of the `Add` operation, so we
375
need the strategy to choose that too.
376
377
This logic is implemented in the `binary_operation_with_types()` helper method, which deals with the
378
many query model nodes operate on two inputs, a `lhs` and `rhs` `Series`.
379
380
### Adding a new variable strategy
381
382
Let's assume we want to add the `GT` node as a new variable strategy.
383
384
First, look at how the `GT` node is implemented:
385
386
```
387
class Function:
388
    class GT(Series[bool]):
389
        lhs: Series[Comparable]
390
        rhs: Series[Comparable]
391
```
392
393
A `GT` node takes two `Comparable` series as inputs, and returns a boolean series. A `Comparable`
394
series can be of any type except bool, but the `lhs` and `rhs` series must be of the same type.
395
396
We can add a new entry to the `series_constraints` dict in `series()`, defining the return type of
397
the series (bool), and the domain constraint.
398
A `GT` operation can return a series consistent with *either* a one-row-per-patient frame, or a many-rows-per-patient frame:
399
400
```
401
    def series(type_, frame):
402
        ...
403
        # define contraints for possible series strategies
404
        series_constraints = {
405
            exists: ({bool}, DomainConstraint.PATIENT),
406
            count: ({int}, DomainConstraint.PATIENT),
407
            gt: ({bool}, DomainContstraint.ANY),
408
        }
409
        ...
410
```
411
412
Next we define the `gt` strategy. We can use a helper function `any_comparable_type()` to draw a
413
suitable type. Note that the `_type` passed in to the `gt` function is the expected return type of the series (bool), and in this case it does not have an impact on the implementation of the `gt`
414
strategy.
415
416
This is a binary operation like many other nodes, so the implementation follows the same strategy as
417
described for the `add` strategy above.
418
419
```
420
    @st.composite
421
    def gt(draw, _type, frame):
422
        type_ = draw(any_comparable_type())
423
        return draw(binary_operation(type_, frame, Function.GT))
424
```