edsnlp / Git / Diff of /docs/tutorials/endlines.md

Models:
philipB/
edsnlp
Downloads: 1
Diff of /docs/tutorials/endlines.md [000000] .. [cad161]
Switch to side-by-side view

--- a
+++ b/docs/tutorials/endlines.md
@@ -0,0 +1,105 @@
+# Detecting end-of-lines
+
+A common problem in medical corpus is that the character `\n` does not necessarily correspond to a real new line as in other domains.
+
+For example, it is common to find texts like:
+
+```
+Il doit prendre
+le medicament indiqué 3 fois par jour. Revoir médecin
+dans 1 mois.
+```
+
+!!! note "Inserted new line characters"
+
+    This issue is especially impactful for clinical notes that have been extracted from PDF documents.
+    In that case, the new line character could be deliberately inserted by the doctor, or more likely
+    added to respect the layout during the edition of the PDF.
+
+The aim of this tutorial is to train a unsupervised model to detect this _false endlines_ and to use it for inference.
+The implemented model is based on the work of Zweigenbaum et al[@zweigenbaum2016].
+
+## Training the model
+
+Let's train the model using an example corpus of three documents:
+
+```python
+import edsnlp
+from edsnlp.pipes.core.endlines.model import EndLinesModel
+
+nlp = edsnlp.blank("eds")
+
+text1 = """Le patient est arrivé hier soir.
+Il est accompagné par son fils
+
+ANTECEDENTS
+Il a fait une TS en 2010;
+Fumeur, il est arrêté il a 5 mois
+Chirurgie de coeur en 2011
+CONCLUSION
+Il doit prendre
+le medicament indiqué 3 fois par jour. Revoir médecin
+dans 1 mois.
+DIAGNOSTIC :
+
+Antecedents Familiaux:
+- 1. Père avec diabète
+"""
+
+text2 = """J'aime le \nfromage...\n"""
+text3 = (
+    "/n"
+    "Intervention(s) - acte(s) réalisé(s) :/n"
+    "Parathyroïdectomie élective le [DATE]"
+)
+
+texts = [
+    text1,
+    text2,
+    text3,
+]
+
+corpus = nlp.pipe(texts)
+
+# Fit the model
+endlines = EndLinesModel(nlp=nlp)  # (1)
+df = endlines.fit_and_predict(corpus)  # (2)
+
+# Save model
+PATH = "/tmp/path_to_model"
+endlines.save(PATH)
+```
+
+1. Initialize the [`EndLinesModel`][edsnlp.pipes.core.endlines.model.EndLinesModel]
+   object and then fit (and predict) in the training corpus.
+2. The corpus should be an iterable of edsnlp documents.
+
+## Use a trained model for inference
+
+```{ .python .no-check }
+import edsnlp, edsnlp.pipes as eds
+
+nlp = edsnlp.blank("eds")
+
+PATH = "/path_to_model"
+nlp.add_pipe(eds.endlines(model_path=PATH))  # (1)
+nlp.add_pipe(eds.sentences())  # (1)
+
+docs = list(nlp.pipe([text1, text2, text3]))
+
+doc = docs[1]
+doc
+# Out: J'aime le
+# Out: fromage...
+
+list(doc.sents)[0]
+# Out: J'aime le
+# Out: fromage...
+```
+
+1. You should specify the path to the trained model here.
+2. All fake new line are excluded by setting their `tag` to 'EXCLUDED' and all true new lines' `tag` are set to 'ENDLINE'.
+
+## Declared extensions
+
+It lets downstream matchers skip excluded tokens (see [normalisation](../pipes/core/normalizer.md)) for more detail.