a b/docs/tokenizers.md
1
# Tokenizers
2
3
4
In addition to the standard spaCy `FrenchLanguage` (`fr`), EDS-NLP offers a new language better fit
5
for French clinical documents: `EDSLanguage` (`eds`). Additionally, the `EDSLanguage` document creation should be around 5-6 times faster than
6
the `fr` language. The main differences lie in the tokenization process.
7
8
A comparison of the two tokenization methods is demonstrated below:
9
10
| Example            | FrenchLanguage            | EDSLanguage                               |
11
|--------------------|---------------------------|-------------------------------------------|
12
| `ACR5`             | \[`ACR5`\]                | \[`ACR`, `5`\]                            |
13
| `26.5/`            | \[`26.5/`\]               | \[`26.5`, `/`\]                           |
14
| `\n \n CONCLUSION` | \[`\n \n`, `CONCLUSION`\] | \[`\n`, `\n`, `CONCLUSION`\]              |
15
| `l'artère`         | \[`l'`, `artère`\]        | \[`l'`, `artère`\] (same)                 |
16
| `Dr. Pichon`       | \[`Dr`, `.`, `Pichon`\]   | \[`Dr.`, `Pichon`\]                       |
17
| `B.H.HP.A.7.A`     | \[`B.H.HP.A.7.A`\]        | \[`B.`, `H.`, `HP.`, `A`, `7`, `A`, `0`\] |
18
19
To instantiate one of the two languages, you can call the `spacy.blank` method.
20
21
=== "EDSLanguage"
22
23
    ```python
24
    import edsnlp
25
26
    nlp = edsnlp.blank("eds")
27
    ```
28
29
=== "FrenchLanguage"
30
31
    ```python
32
    import edsnlp
33
34
    nlp = edsnlp.blank("fr")
35
    ```