[cad161]: / docs / tokenizers.md

Download this file

36 lines (24 with data), 1.4 kB

Tokenizers

In addition to the standard spaCy FrenchLanguage (fr), EDS-NLP offers a new language better fit
for French clinical documents: EDSLanguage (eds). Additionally, the EDSLanguage document creation should be around 5-6 times faster than
the fr language. The main differences lie in the tokenization process.

A comparison of the two tokenization methods is demonstrated below:

Example FrenchLanguage EDSLanguage
ACR5 [ACR5] [ACR, 5]
26.5/ [26.5/] [26.5, /]
\n \n CONCLUSION [\n \n, CONCLUSION] [\n, \n, CONCLUSION]
l'artère [l', artère] [l', artère] (same)
Dr. Pichon [Dr, ., Pichon] [Dr., Pichon]
B.H.HP.A.7.A [B.H.HP.A.7.A] [B., H., HP., A, 7, A, 0]

To instantiate one of the two languages, you can call the spacy.blank method.

=== "EDSLanguage"

```python
import edsnlp

nlp = edsnlp.blank("eds")
```

=== "FrenchLanguage"

```python
import edsnlp

nlp = edsnlp.blank("fr")
```