|
a |
|
b/docs/tokenizers.md |
|
|
1 |
# Tokenizers |
|
|
2 |
|
|
|
3 |
|
|
|
4 |
In addition to the standard spaCy `FrenchLanguage` (`fr`), EDS-NLP offers a new language better fit |
|
|
5 |
for French clinical documents: `EDSLanguage` (`eds`). Additionally, the `EDSLanguage` document creation should be around 5-6 times faster than |
|
|
6 |
the `fr` language. The main differences lie in the tokenization process. |
|
|
7 |
|
|
|
8 |
A comparison of the two tokenization methods is demonstrated below: |
|
|
9 |
|
|
|
10 |
| Example | FrenchLanguage | EDSLanguage | |
|
|
11 |
|--------------------|---------------------------|-------------------------------------------| |
|
|
12 |
| `ACR5` | \[`ACR5`\] | \[`ACR`, `5`\] | |
|
|
13 |
| `26.5/` | \[`26.5/`\] | \[`26.5`, `/`\] | |
|
|
14 |
| `\n \n CONCLUSION` | \[`\n \n`, `CONCLUSION`\] | \[`\n`, `\n`, `CONCLUSION`\] | |
|
|
15 |
| `l'artère` | \[`l'`, `artère`\] | \[`l'`, `artère`\] (same) | |
|
|
16 |
| `Dr. Pichon` | \[`Dr`, `.`, `Pichon`\] | \[`Dr.`, `Pichon`\] | |
|
|
17 |
| `B.H.HP.A.7.A` | \[`B.H.HP.A.7.A`\] | \[`B.`, `H.`, `HP.`, `A`, `7`, `A`, `0`\] | |
|
|
18 |
|
|
|
19 |
To instantiate one of the two languages, you can call the `spacy.blank` method. |
|
|
20 |
|
|
|
21 |
=== "EDSLanguage" |
|
|
22 |
|
|
|
23 |
```python |
|
|
24 |
import edsnlp |
|
|
25 |
|
|
|
26 |
nlp = edsnlp.blank("eds") |
|
|
27 |
``` |
|
|
28 |
|
|
|
29 |
=== "FrenchLanguage" |
|
|
30 |
|
|
|
31 |
```python |
|
|
32 |
import edsnlp |
|
|
33 |
|
|
|
34 |
nlp = edsnlp.blank("fr") |
|
|
35 |
``` |