|
a |
|
b/docs/tutorials/detecting-dates.md |
|
|
1 |
# Detecting dates |
|
|
2 |
|
|
|
3 |
We now know how to match a terminology and qualify detected entities, which covers most use cases for a typical medical NLP project. |
|
|
4 |
In this tutorial, we'll see how to use EDS-NLP to detect and normalise date mentions using `eds.dates`. |
|
|
5 |
|
|
|
6 |
This can have many applications, for dating medical events in particular. |
|
|
7 |
The `eds.consultation_dates` component, for instance, |
|
|
8 |
combines the date detection capabilities with a few simple patterns to detect the date of the consultation, when mentioned in clinical reports. |
|
|
9 |
|
|
|
10 |
## Dates in clinical notes |
|
|
11 |
|
|
|
12 |
Consider the following example: |
|
|
13 |
|
|
|
14 |
=== "French" |
|
|
15 |
|
|
|
16 |
``` |
|
|
17 |
Le patient est admis le 21 janvier pour une douleur dans le cou. |
|
|
18 |
Il se plaint d'une douleur chronique qui a débuté il y a trois ans. |
|
|
19 |
``` |
|
|
20 |
|
|
|
21 |
=== "English" |
|
|
22 |
|
|
|
23 |
``` |
|
|
24 |
The patient is admitted on January 21st for a neck pain. |
|
|
25 |
He complains about chronique pain that started three years ago. |
|
|
26 |
``` |
|
|
27 |
|
|
|
28 |
Clinical notes contain many different types of dates. To name a few examples: |
|
|
29 |
|
|
|
30 |
| Type | Description | Examples | |
|
|
31 |
| -------- | ----------------------------------- | ------------------------------------------------ | |
|
|
32 |
| Absolute | Explicit date | `2022-03-03` | |
|
|
33 |
| Partial | Date missing the day, month or year | `le 3 janvier/on January 3rd`, `en 2021/in 2021` | |
|
|
34 |
| Relative | Relative dates | `hier/yesterday`, `le mois dernier/last month` | |
|
|
35 |
| Duration | Durations | `pendant trois mois/for three months` | |
|
|
36 |
|
|
|
37 |
!!! warning |
|
|
38 |
|
|
|
39 |
We show an English example just to explain the issue. |
|
|
40 |
EDS-NLP remains a **French-language** medical NLP library. |
|
|
41 |
|
|
|
42 |
## Extracting dates |
|
|
43 |
|
|
|
44 |
The followings snippet adds the `eds.dates` component to the pipeline: |
|
|
45 |
|
|
|
46 |
```python |
|
|
47 |
import edsnlp, edsnlp.pipes as eds |
|
|
48 |
|
|
|
49 |
nlp = edsnlp.blank("eds") |
|
|
50 |
nlp.add_pipe(eds.dates()) # (1) |
|
|
51 |
|
|
|
52 |
text = ( |
|
|
53 |
"Le patient est admis le 21 janvier pour une douleur dans le cou.\n" |
|
|
54 |
"Il se plaint d'une douleur chronique qui a débuté il y a trois ans." |
|
|
55 |
) |
|
|
56 |
|
|
|
57 |
# Detecting dates becomes trivial |
|
|
58 |
doc = nlp(text) |
|
|
59 |
|
|
|
60 |
# Likewise, accessing detected dates is hassle-free |
|
|
61 |
dates = doc.spans["dates"] # (2) |
|
|
62 |
``` |
|
|
63 |
|
|
|
64 |
1. The date detection component is declared with `eds.dates` |
|
|
65 |
2. Dates are saved in the `#!python doc.spans["dates"]` key |
|
|
66 |
|
|
|
67 |
After this, accessing dates and there normalisation becomes trivial: |
|
|
68 |
|
|
|
69 |
```python |
|
|
70 |
# ↑ Omitted code above ↑ |
|
|
71 |
|
|
|
72 |
dates # (1) |
|
|
73 |
# Out: [21 janvier, il y a trois ans] |
|
|
74 |
``` |
|
|
75 |
|
|
|
76 |
1. `dates` is a list of spaCy `Span` objects. |
|
|
77 |
|
|
|
78 |
## Normalisation |
|
|
79 |
|
|
|
80 |
We can review each date and get its normalisation: |
|
|
81 |
|
|
|
82 |
| `date.text` | `date._.date` | |
|
|
83 |
| ------------------ | ------------------------------------------- | |
|
|
84 |
| `21 janvier` | `#!python {"day": 21, "month": 1}` | |
|
|
85 |
| `il y a trois ans` | `#!python {"direction": "past", "year": 3}` | |
|
|
86 |
|
|
|
87 |
Dates detected by the pipeline component are parsed into a dictionary-like object. |
|
|
88 |
It includes every information that is actually contained in the text. |
|
|
89 |
|
|
|
90 |
To get a more usable representation, you may call the `to_datetime()` method. |
|
|
91 |
If there's enough information, the date will be represented |
|
|
92 |
in a `datetime.datetime` or `datetime.timedelta` object. If some information is missing, |
|
|
93 |
It will return `None`. |
|
|
94 |
Alternatively for this case, you can optionally set to `True` the parameter `infer_from_context` and |
|
|
95 |
you may also give a value for `note_datetime`. |
|
|
96 |
|
|
|
97 |
!!! note "Date normalisation" |
|
|
98 |
|
|
|
99 |
Since dates can be missing some information (eg `en août`), we refrain from |
|
|
100 |
outputting a `datetime` object in that case. Doing so would amount to guessing, |
|
|
101 |
and we made the choice of letting you decide how you want to handle missing dates. |
|
|
102 |
|
|
|
103 |
## What next? |
|
|
104 |
|
|
|
105 |
The `eds.dates` pipe component's role is merely to detect and normalise dates. |
|
|
106 |
It is the user's responsibility to use this information in a downstream application. |
|
|
107 |
|
|
|
108 |
For instance, you could use this pipeline to date medical entities. Let's do that. |
|
|
109 |
|
|
|
110 |
### A medical event tagger |
|
|
111 |
|
|
|
112 |
Our pipeline will detect entities and events separately, |
|
|
113 |
and we will post-process the output `Doc` object to determine |
|
|
114 |
whether a given entity can be linked to a date. |
|
|
115 |
|
|
|
116 |
```python |
|
|
117 |
import edsnlp, edsnlp.pipes as eds |
|
|
118 |
from datetime import datetime |
|
|
119 |
|
|
|
120 |
nlp = edsnlp.blank("eds") |
|
|
121 |
nlp.add_pipe(eds.sentences()) |
|
|
122 |
nlp.add_pipe(eds.dates()) |
|
|
123 |
nlp.add_pipe( |
|
|
124 |
eds.matcher( |
|
|
125 |
regex=dict(admission=["admissions?", "admise?", "prise? en charge"]), |
|
|
126 |
attr="LOWER", |
|
|
127 |
) |
|
|
128 |
) |
|
|
129 |
|
|
|
130 |
text = ( |
|
|
131 |
"Le patient est admis le 12 avril pour une douleur " |
|
|
132 |
"survenue il y a trois jours. " |
|
|
133 |
"Il avait été pris en charge l'année dernière. " |
|
|
134 |
"Il a été diagnostiqué en mai 1995." |
|
|
135 |
) |
|
|
136 |
|
|
|
137 |
doc = nlp(text) |
|
|
138 |
``` |
|
|
139 |
|
|
|
140 |
At this point, the document is ready to be post-processed: its `ents` and `#!python spans["dates"]` are populated: |
|
|
141 |
|
|
|
142 |
```python |
|
|
143 |
# ↑ Omitted code above ↑ |
|
|
144 |
|
|
|
145 |
doc.ents |
|
|
146 |
# Out: (admis, pris en charge) |
|
|
147 |
|
|
|
148 |
doc.spans["dates"] |
|
|
149 |
# Out: [12 avril, il y a trois jours, l'année dernière, mai 1995] |
|
|
150 |
|
|
|
151 |
note_datetime = datetime(year=1999, month=8, day=27) |
|
|
152 |
|
|
|
153 |
for i, date in enumerate(doc.spans["dates"]): |
|
|
154 |
print( |
|
|
155 |
i, |
|
|
156 |
" - ", |
|
|
157 |
date, |
|
|
158 |
" - ", |
|
|
159 |
date._.date.to_datetime( |
|
|
160 |
note_datetime=note_datetime, infer_from_context=False, tz=None |
|
|
161 |
), |
|
|
162 |
) |
|
|
163 |
# Out: 0 - 12 avril - None |
|
|
164 |
# Out: 1 - il y a trois jours - 1999-08-24 00:00:00 |
|
|
165 |
# Out: 2 - l'année dernière - 1998-08-27 00:00:00 |
|
|
166 |
# Out: 3 - mai 1995 - None |
|
|
167 |
|
|
|
168 |
|
|
|
169 |
for i, date in enumerate(doc.spans["dates"]): |
|
|
170 |
print( |
|
|
171 |
i, |
|
|
172 |
" - ", |
|
|
173 |
date, |
|
|
174 |
" - ", |
|
|
175 |
date._.date.to_datetime( |
|
|
176 |
note_datetime=note_datetime, |
|
|
177 |
infer_from_context=True, |
|
|
178 |
tz=None, |
|
|
179 |
default_day=15, |
|
|
180 |
), |
|
|
181 |
) |
|
|
182 |
# Out: 0 - 12 avril - 1999-04-12 00:00:00 |
|
|
183 |
# Out: 1 - il y a trois jours - 1999-08-24 00:00:00 |
|
|
184 |
# Out: 2 - l'année dernière - 1998-08-27 00:00:00 |
|
|
185 |
# Out: 3 - mai 1995 - 1995-05-15 00:00:00 |
|
|
186 |
``` |
|
|
187 |
|
|
|
188 |
As a first heuristic, let's consider that an entity can be linked to a date if the two are in the same |
|
|
189 |
sentence. In the case where multiple dates are present, we'll select the closest one. |
|
|
190 |
|
|
|
191 |
```python title="utils.py" |
|
|
192 |
from spacy.tokens import Span |
|
|
193 |
from typing import List, Optional |
|
|
194 |
|
|
|
195 |
|
|
|
196 |
def candidate_dates(ent: Span) -> List[Span]: |
|
|
197 |
"""Return every dates in the same sentence as the entity""" |
|
|
198 |
return [date for date in ent.doc.spans["dates"] if date.sent == ent.sent] |
|
|
199 |
|
|
|
200 |
|
|
|
201 |
def get_event_date(ent: Span) -> Optional[Span]: |
|
|
202 |
"""Link an entity to the closest date in the sentence, if any""" |
|
|
203 |
|
|
|
204 |
dates = candidate_dates(ent) # (1) |
|
|
205 |
|
|
|
206 |
if not dates: |
|
|
207 |
return |
|
|
208 |
|
|
|
209 |
dates = sorted( |
|
|
210 |
dates, |
|
|
211 |
key=lambda d: min(abs(d.start - ent.end), abs(ent.start - d.end)), |
|
|
212 |
) |
|
|
213 |
|
|
|
214 |
return dates[0] # (2) |
|
|
215 |
``` |
|
|
216 |
|
|
|
217 |
1. Get all dates present in the same sentence. |
|
|
218 |
2. Sort the dates, and keep the first item. |
|
|
219 |
|
|
|
220 |
We can apply this simple function: |
|
|
221 |
|
|
|
222 |
```python |
|
|
223 |
import edsnlp, edsnlp.pipes as eds |
|
|
224 |
from datetime import datetime |
|
|
225 |
|
|
|
226 |
nlp = edsnlp.blank("eds") |
|
|
227 |
nlp.add_pipe(eds.sentences()) |
|
|
228 |
nlp.add_pipe(eds.dates()) |
|
|
229 |
nlp.add_pipe( |
|
|
230 |
eds.matcher( |
|
|
231 |
regex=dict(admission=["admissions?", "admise?", "prise? en charge"]), |
|
|
232 |
attr="LOWER", |
|
|
233 |
) |
|
|
234 |
) |
|
|
235 |
|
|
|
236 |
text = ( |
|
|
237 |
"Le patient est admis le 12 avril pour une douleur " |
|
|
238 |
"survenue il y a trois jours. " |
|
|
239 |
"Il avait été pris en charge l'année dernière." |
|
|
240 |
) |
|
|
241 |
|
|
|
242 |
doc = nlp(text) |
|
|
243 |
now = datetime.now() |
|
|
244 |
|
|
|
245 |
for ent in doc.ents: |
|
|
246 |
if ent.label_ != "admission": |
|
|
247 |
continue |
|
|
248 |
date = get_event_date(ent) |
|
|
249 |
print( |
|
|
250 |
f"{ent.text:<20}{date.text:<20}{date._.date.to_datetime(now).strftime('%d/%m/%Y'):<15}{date._.date.to_duration(now)}" |
|
|
251 |
) |
|
|
252 |
# Out: admis 12 avril 12/04/2023 21 weeks 4 days 6 hours 3 minutes 26 seconds |
|
|
253 |
# Out: pris en charge l'année dernière 10/09/2022 -1 year |
|
|
254 |
``` |
|
|
255 |
|
|
|
256 |
Which will output: |
|
|
257 |
|
|
|
258 |
| `ent` | `get_event_date(ent)` | `get_event_date(ent)._.date.to_datetime()` | |
|
|
259 |
|----------------|-----------------------|--------------------------------------------| |
|
|
260 |
| admis | 12 avril | `2020-04-12T00:00:00+02:00` | |
|
|
261 |
| pris en charge | l'année dernière | `-1 year` | |