Diff of /README.md [000000] .. [811e40]

Switch to unified view

a b/README.md
1
eligibility_criteria_parser
2
================
3
4
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
5
6
## Install
7
8
In order to install the module issue the following commands
9
10
``` sh
11
bash$ git clone https://github.com/megaduks/criteria_parser.git
12
13
bash$ cd criteria_parser
14
15
bash$ pip install -r requirements.txt
16
17
bash$ pip install -e '.[dev]'
18
```
19
20
The next step is to run `dvc` to download the data
21
22
``` bash
23
bash$ dvc pull
24
```
25
26
## How to use
27
28
The function `load_chia()` downloads the entire dataset as a dataframe
29
30
``` python
31
from eligibility_criteria_parser.core import *
32
33
df = load_chia()
34
```
35
36
``` python
37
df.head()
38
```
39
40
<div>
41
<style scoped>
42
    .dataframe tbody tr th:only-of-type {
43
        vertical-align: middle;
44
    }
45
46
    .dataframe tbody tr th {
47
        vertical-align: top;
48
    }
49
50
    .dataframe thead th {
51
        text-align: right;
52
    }
53
</style>
54
<table border="1" class="dataframe">
55
  <thead>
56
    <tr style="text-align: right;">
57
      <th></th>
58
      <th>ct_no</th>
59
      <th>criteria</th>
60
      <th>mode</th>
61
      <th>drugs</th>
62
      <th>persons</th>
63
      <th>procedures</th>
64
      <th>conditions</th>
65
      <th>devices</th>
66
      <th>visits</th>
67
      <th>scopes</th>
68
      <th>observations</th>
69
      <th>measurements</th>
70
    </tr>
71
  </thead>
72
  <tbody>
73
    <tr>
74
      <th>0</th>
75
      <td>NCT03124329</td>
76
      <td>Male and female individuals between ages of 18...</td>
77
      <td>inclusion</td>
78
      <td>None</td>
79
      <td>[ages]</td>
80
      <td>None</td>
81
      <td>[gingival recession defects, recession defects]</td>
82
      <td>None</td>
83
      <td>None</td>
84
      <td>None</td>
85
      <td>[cervical restorations extending to the CEJ]</td>
86
      <td>[recession, keratinized gingiva, Miller]</td>
87
    </tr>
88
    <tr>
89
      <th>1</th>
90
      <td>NCT02796378</td>
91
      <td>Elevated blood-cholesterol</td>
92
      <td>inclusion</td>
93
      <td>None</td>
94
      <td>None</td>
95
      <td>None</td>
96
      <td>None</td>
97
      <td>None</td>
98
      <td>None</td>
99
      <td>None</td>
100
      <td>None</td>
101
      <td>[blood-cholesterol]</td>
102
    </tr>
103
    <tr>
104
      <th>2</th>
105
      <td>NCT03216967</td>
106
      <td>Adult patients Kidney transplant recipients Pa...</td>
107
      <td>inclusion</td>
108
      <td>[calcineurin inhibitor, mycophenolic acid]</td>
109
      <td>[Adult]</td>
110
      <td>None</td>
111
      <td>None</td>
112
      <td>None</td>
113
      <td>None</td>
114
      <td>None</td>
115
      <td>None</td>
116
      <td>[Viremia, pregnancy test, blood ß-HCG dosage]</td>
117
    </tr>
118
    <tr>
119
      <th>3</th>
120
      <td>NCT02200978</td>
121
      <td>Patients less than 16 years old with newly dia...</td>
122
      <td>inclusion</td>
123
      <td>None</td>
124
      <td>[old]</td>
125
      <td>None</td>
126
      <td>[acute promyelocytic leukemia]</td>
127
      <td>None</td>
128
      <td>None</td>
129
      <td>None</td>
130
      <td>None</td>
131
      <td>[PML-RARa]</td>
132
    </tr>
133
    <tr>
134
      <th>4</th>
135
      <td>NCT01314898</td>
136
      <td>Male and/or female healthy volunteers, age 18 ...</td>
137
      <td>inclusion</td>
138
      <td>None</td>
139
      <td>[Male, female, age, Females]</td>
140
      <td>None</td>
141
      <td>[healthy, childbearing potential]</td>
142
      <td>None</td>
143
      <td>None</td>
144
      <td>None</td>
145
      <td>None</td>
146
      <td>[Body Mass Index (BMI), total body weight]</td>
147
    </tr>
148
  </tbody>
149
</table>
150
</div>
151
152
The dataset consists of 2000 clinical trial criteria annotated with 10
153
different entities
154
155
``` python
156
df.shape
157
```
158
159
    (2000, 12)
160
161
To extract a particular entity use `get_annotations()` function. This
162
function accepts the name of the annotated entity, the number of
163
examples to be downloaded, and the flag to allow for random/ordered
164
retrieval of examples.
165
166
The result is a list of tuples, each tuple contains the clinical trial
167
ID, the text of the criterion, and the annotated entities.
168
169
``` python
170
examples = get_annotations("drugs", n=5, random=False)
171
examples
172
```
173
174
    [('NCT03216967',
175
      'Adult patients Kidney transplant recipients Patients treated by a calcineurin inhibitor and mycophenolic acid Viremia >= 3 log UI/ml Patients who have given written informed consent Negative pregnancy test (blood ß-HCG dosage)',
176
      ['calcineurin inhibitor', 'mycophenolic acid']),
177
     ('NCT00730301',
178
      'Patient diagnosed by HRCT Core Lab with eligible heterogeneous disease distribution and at least one complete oblique fissure.  Age from 40 to 75 years  BMI < 32 kg/m2  FEV1 < 40% of predicted value, FEV1/FVC < 70%  TLC > 120% predicted, RV > 150% predicted.  Stable with < 20 mg prednisone (or equivalent) qd  PaCO2 < 50mm Hg  PaO2 > 45 mm Hg on room air  6-min walk of > 50m (without rehabilitation) or > 100m (with rehabilitation)  Nonsmoking for 4 months prior to initial interview and throughout screening  The patient agrees to all protocol required follow-up intervals.  The patient has no child bearing potential  The patient is willing and able to complete protocol required baseline assessments and procedures ',
179
      ['prednisone']),
180
     ('NCT02715466',
181
      'Male or female patients = 18 and = 85 years of age Women of child bearing potential must test negative on standard pregnancy test (urine or serum) Patients with body weight = 55 kg and = 140 kg and body mass index (BMI) = 18 kg/m2 Patients diagnosed severe sepsis / septic shock at admission on Intensive Care Unit who can be enrolled within 90 min after admission OR patients diagnosed severe sepsis / septic shock during Intensive Care Unit stay who can be enrolled within 90 min after diagnosis Patients where antibiotic therapy has already been started (prior to randomization) Patient who are fluid responsive. Fluid responsiveness is defined as increase of > 10% in mean arterial pressure (MAP) after passive leg raising (PLR) Signed informed consent by patient, legal representative or authorized person or deferred consent',
182
      ['antibiotic therapy']),
183
     ('NCT02735902',
184
      'The patient or his/her representative must have given free and informed consent and signed the consent The patient must be insured or beneficiary of a health insurance plan The patient is available for 12 months of follow-up The patient underwent a successful transcutaneous implant procedure for an aortic valve within the past 24 hours The patient was receiving anti-vitamin K (AVK) treatment before percutaneous implantation of the aortic valve',
185
      ['anti-vitamin K', 'AVK']),
186
     ('NCT00989261',
187
      '1. Males and females age ≥18 years in second relapse or refractory.  2. Males and females age ≥60 years in first relapse or refractory.  3. Must have baseline bone marrow sample taken.  4. Morphologically documented primary AML or AML secondary to myelodysplastic syndrome (MDS with ≥20% bone marrow or peripheral blasts), as defined by the World Health Organization (WHO) criteria, confirmed by pathology review at treating institution.  5. Able to swallow the liquid study drug.  6. ECOG performance status of 0 to 2  7. In the absence of rapidly progressing disease, the interval from prior treatment to time of AC220 administration will be at least 2 weeks for cytotoxic agents or at least 5 half-lives for noncytotoxic agents. The use of chemotherapeutic or antileukemic agents other than hydroxyurea is not permitted during the study with the possible exception of intrathecal (IT) therapy at the discretion of the Investigator and with the agreement of the Sponsor.  8. Persistent chronic clinically significant non-hematological toxicities from prior treatment must be ≤Grade 1.  9. Prior therapy with FLT3 inhibitors is permitted, except previous treatment with AC220.  10. Serum creatinine ≤1.5 × ULN and glomerular filtration rate (GFR) > 30 mL/min  11. Serum potassium, magnesium, and calcium levels should be at least within institutional normal limits.  12. Total serum bilirubin ≤1.5 × ULN  13. Serum aspartate transaminase (AST) and/or alanine transaminase (ALT) ≤2.5 × ULN  14. Females of childbearing potential must have a negative pregnancy test (urine β-hCG).  15. Females of childbearing potential and sexually mature males must agree to use a medically accepted method of contraception throughout the study.  16. Written informed consent must be provided. ',
188
      ['FLT3 inhibitors', 'AC220'])]
189
190
In order to use this data for prompting, the IDs, criteria, and
191
annotations have to be separated into lists.
192
193
``` python
194
ids, criteria, ents_true = map(list, zip(*examples))
195
196
print(ids[:3])
197
print(criteria[:3])
198
print(ents_true[:3])
199
```
200
201
    ['NCT03216967', 'NCT00730301', 'NCT02715466']
202
    ['Adult patients Kidney transplant recipients Patients treated by a calcineurin inhibitor and mycophenolic acid Viremia >= 3 log UI/ml Patients who have given written informed consent Negative pregnancy test (blood ß-HCG dosage)', 'Patient diagnosed by HRCT Core Lab with eligible heterogeneous disease distribution and at least one complete oblique fissure.  Age from 40 to 75 years  BMI < 32 kg/m2  FEV1 < 40% of predicted value, FEV1/FVC < 70%  TLC > 120% predicted, RV > 150% predicted.  Stable with < 20 mg prednisone (or equivalent) qd  PaCO2 < 50mm Hg  PaO2 > 45 mm Hg on room air  6-min walk of > 50m (without rehabilitation) or > 100m (with rehabilitation)  Nonsmoking for 4 months prior to initial interview and throughout screening  The patient agrees to all protocol required follow-up intervals.  The patient has no child bearing potential  The patient is willing and able to complete protocol required baseline assessments and procedures ', 'Male or female patients = 18 and = 85 years of age Women of child bearing potential must test negative on standard pregnancy test (urine or serum) Patients with body weight = 55 kg and = 140 kg and body mass index (BMI) = 18 kg/m2 Patients diagnosed severe sepsis / septic shock at admission on Intensive Care Unit who can be enrolled within 90 min after admission OR patients diagnosed severe sepsis / septic shock during Intensive Care Unit stay who can be enrolled within 90 min after diagnosis Patients where antibiotic therapy has already been started (prior to randomization) Patient who are fluid responsive. Fluid responsiveness is defined as increase of > 10% in mean arterial pressure (MAP) after passive leg raising (PLR) Signed informed consent by patient, legal representative or authorized person or deferred consent']
203
    [['calcineurin inhibitor', 'mycophenolic acid'], ['prednisone'], ['antibiotic therapy']]
204
205
The last step is to prepare two utility functions: - prompting function:
206
creates a prompt for a given example - deprompting function: reads the
207
answer from the language model and extracts predicted entities
208
209
Below is an example of a simple prompting function. This function
210
constructs a specific template with `n_shots` examples and attaches the
211
`criterion` for which the language model has to generate the response
212
213
``` python
214
from typing import List, Tuple
215
216
def simple_prompt(criterion: str, examples: List[Tuple[id, str,str]], entity: str, n_shots: int) -> str:
217
    
218
    TEXT = ""
219
    for ids, c, e in examples[:n_shots]:
220
        TEXT += f"""[text]: {c} \n###\n[{entity}]: {e} \n###\n"""
221
    
222
    return f"""{TEXT}[text]: {criterion} \n###\n[{entity}]:"""
223
```
224
225
As can be seen from the signature, the function accepts the following
226
input: - `criterion`: the input example - `examples`: list of tuples
227
(clinical trial id, criterion, true entities) that can be used to
228
generate a few shot examples - `entity`: the name of the entity -
229
`num_shots`: number of examples to be included in the prompt
230
231
The `examples` input has exactly the same structure as the output of the
232
`get_annotations()` function.
233
234
Let’s test the prompt generated by the function
235
236
``` python
237
ct_id, criterion, e_true = examples[-1]
238
239
print(f"criterion: {criterion} \n\n annotated drugs: {e_true}")
240
```
241
242
    criterion: 1. Males and females age ≥18 years in second relapse or refractory.  2. Males and females age ≥60 years in first relapse or refractory.  3. Must have baseline bone marrow sample taken.  4. Morphologically documented primary AML or AML secondary to myelodysplastic syndrome (MDS with ≥20% bone marrow or peripheral blasts), as defined by the World Health Organization (WHO) criteria, confirmed by pathology review at treating institution.  5. Able to swallow the liquid study drug.  6. ECOG performance status of 0 to 2  7. In the absence of rapidly progressing disease, the interval from prior treatment to time of AC220 administration will be at least 2 weeks for cytotoxic agents or at least 5 half-lives for noncytotoxic agents. The use of chemotherapeutic or antileukemic agents other than hydroxyurea is not permitted during the study with the possible exception of intrathecal (IT) therapy at the discretion of the Investigator and with the agreement of the Sponsor.  8. Persistent chronic clinically significant non-hematological toxicities from prior treatment must be ≤Grade 1.  9. Prior therapy with FLT3 inhibitors is permitted, except previous treatment with AC220.  10. Serum creatinine ≤1.5 × ULN and glomerular filtration rate (GFR) > 30 mL/min  11. Serum potassium, magnesium, and calcium levels should be at least within institutional normal limits.  12. Total serum bilirubin ≤1.5 × ULN  13. Serum aspartate transaminase (AST) and/or alanine transaminase (ALT) ≤2.5 × ULN  14. Females of childbearing potential must have a negative pregnancy test (urine β-hCG).  15. Females of childbearing potential and sexually mature males must agree to use a medically accepted method of contraception throughout the study.  16. Written informed consent must be provided.  
243
244
     annotated drugs: ['FLT3 inhibitors', 'AC220']
245
246
``` python
247
prompt = simple_prompt(criterion=criterion, examples=examples, entity="drugs", n_shots=3)
248
249
print(prompt)
250
```
251
252
    [text]: Adult patients Kidney transplant recipients Patients treated by a calcineurin inhibitor and mycophenolic acid Viremia >= 3 log UI/ml Patients who have given written informed consent Negative pregnancy test (blood ß-HCG dosage) 
253
    ###
254
    [drugs]: ['calcineurin inhibitor', 'mycophenolic acid'] 
255
    ###
256
    [text]: Patient diagnosed by HRCT Core Lab with eligible heterogeneous disease distribution and at least one complete oblique fissure.  Age from 40 to 75 years  BMI < 32 kg/m2  FEV1 < 40% of predicted value, FEV1/FVC < 70%  TLC > 120% predicted, RV > 150% predicted.  Stable with < 20 mg prednisone (or equivalent) qd  PaCO2 < 50mm Hg  PaO2 > 45 mm Hg on room air  6-min walk of > 50m (without rehabilitation) or > 100m (with rehabilitation)  Nonsmoking for 4 months prior to initial interview and throughout screening  The patient agrees to all protocol required follow-up intervals.  The patient has no child bearing potential  The patient is willing and able to complete protocol required baseline assessments and procedures  
257
    ###
258
    [drugs]: ['prednisone'] 
259
    ###
260
    [text]: Male or female patients = 18 and = 85 years of age Women of child bearing potential must test negative on standard pregnancy test (urine or serum) Patients with body weight = 55 kg and = 140 kg and body mass index (BMI) = 18 kg/m2 Patients diagnosed severe sepsis / septic shock at admission on Intensive Care Unit who can be enrolled within 90 min after admission OR patients diagnosed severe sepsis / septic shock during Intensive Care Unit stay who can be enrolled within 90 min after diagnosis Patients where antibiotic therapy has already been started (prior to randomization) Patient who are fluid responsive. Fluid responsiveness is defined as increase of > 10% in mean arterial pressure (MAP) after passive leg raising (PLR) Signed informed consent by patient, legal representative or authorized person or deferred consent 
261
    ###
262
    [drugs]: ['antibiotic therapy'] 
263
    ###
264
    [text]: 1. Males and females age ≥18 years in second relapse or refractory.  2. Males and females age ≥60 years in first relapse or refractory.  3. Must have baseline bone marrow sample taken.  4. Morphologically documented primary AML or AML secondary to myelodysplastic syndrome (MDS with ≥20% bone marrow or peripheral blasts), as defined by the World Health Organization (WHO) criteria, confirmed by pathology review at treating institution.  5. Able to swallow the liquid study drug.  6. ECOG performance status of 0 to 2  7. In the absence of rapidly progressing disease, the interval from prior treatment to time of AC220 administration will be at least 2 weeks for cytotoxic agents or at least 5 half-lives for noncytotoxic agents. The use of chemotherapeutic or antileukemic agents other than hydroxyurea is not permitted during the study with the possible exception of intrathecal (IT) therapy at the discretion of the Investigator and with the agreement of the Sponsor.  8. Persistent chronic clinically significant non-hematological toxicities from prior treatment must be ≤Grade 1.  9. Prior therapy with FLT3 inhibitors is permitted, except previous treatment with AC220.  10. Serum creatinine ≤1.5 × ULN and glomerular filtration rate (GFR) > 30 mL/min  11. Serum potassium, magnesium, and calcium levels should be at least within institutional normal limits.  12. Total serum bilirubin ≤1.5 × ULN  13. Serum aspartate transaminase (AST) and/or alanine transaminase (ALT) ≤2.5 × ULN  14. Females of childbearing potential must have a negative pregnancy test (urine β-hCG).  15. Females of childbearing potential and sexually mature males must agree to use a medically accepted method of contraception throughout the study.  16. Written informed consent must be provided.  
265
    ###
266
    [drugs]:
267
268
Similarly, a deprompting function has to be created to parse the answer
269
from the language model and extract only the part relevant to the
270
predicted entities. Below is an example of a simple deprompting
271
function. The output of the language model **does not contain the input
272
prompt**. The function simply removes all punctuation and all mentions
273
of the entity name, and returns a list of unique terms generated by the
274
language model.
275
276
``` python
277
def simple_deprompt(model_output: str, entity: str) -> List[str]:
278
    return list(
279
        set(
280
            model_output.translate(str.maketrans("", "", string.punctuation))
281
            .replace(f"{entity}", "")
282
            .split()
283
        )
284
    )
285
```
286
287
The prediction is performed by the
288
[`fit_prompt`](https://Mikołaj%20Morzy.github.io/eligibility_criteria_parser/core.html#fit_prompt)
289
function which expects the following parameters: - `examples`: list of
290
examples for which to perform prompting - `entity`: name of the entity -
291
`model`: an object representing the BioGPT model - `prompt_fun`: a
292
handle to the prompting funciton - `deprompt_fun`: a handle to the
293
deprompting function
294
295
Assuming we have correctly initialized the BioGPT model under the
296
`model` variable, the invocation of the function is:
297
298
``` python
299
# from fairseq.models.transformer_lm import TransformerLanguageModel
300
301
# model = TransformerLanguageModel.from_pretrained(
302
#     "biogpt/checkpoints/Pre-trained-BioGPT", 
303
#     "checkpoint.pt", 
304
#     "biogpt/BioGPT/data",
305
#     tokenizer='moses', 
306
#     bpe='fastbpe', 
307
#     bpe_codes="biogpt/BioGPT/data/bpecodes",
308
#     min_len=100,
309
#     max_len_b=2048,
310
#     cuda=True,
311
#     verbose=False,
312
# )
313
314
model = None # here the model should be initialized as commented out
315
316
ents_pred = fit_prompt(examples, "drugs", model, simple_prompt, simple_deprompt)
317
```
318
319
Finally, the results can be computed using a single function
320
`prompt_score()` which accepts two lists: true entities and the entities
321
predicted from the language model. Both arguments are lists of lists of
322
strings. The true entities are returned from the `get_annotations()`
323
function, and the predicted entities are the results of the
324
`fit_prompt()` function.
325
326
The results of the function is a dictionary with keys representing each
327
mode of Jaccard coefficient (*strict, left, right, relaxed*), each value
328
is a tuple with four numbers: - mean jaccard score of entity matches -
329
standard deviation of jaccard scores of entity matches - mean percentage
330
coverage of entities - standard deviation of percentage coverages