edsnlp / Git / Diff of /docs/tokenizers.md

Models:
philipB/
edsnlp
Downloads: 1
Diff of /docs/tokenizers.md [000000] .. [cad161]
Switch to side-by-side view

--- a
+++ b/docs/tokenizers.md
@@ -0,0 +1,35 @@
+# Tokenizers
+
+
+In addition to the standard spaCy `FrenchLanguage` (`fr`), EDS-NLP offers a new language better fit
+for French clinical documents: `EDSLanguage` (`eds`). Additionally, the `EDSLanguage` document creation should be around 5-6 times faster than
+the `fr` language. The main differences lie in the tokenization process.
+
+A comparison of the two tokenization methods is demonstrated below:
+
+| Example            | FrenchLanguage            | EDSLanguage                               |
+|--------------------|---------------------------|-------------------------------------------|
+| `ACR5`             | \[`ACR5`\]                | \[`ACR`, `5`\]                            |
+| `26.5/`            | \[`26.5/`\]               | \[`26.5`, `/`\]                           |
+| `\n \n CONCLUSION` | \[`\n \n`, `CONCLUSION`\] | \[`\n`, `\n`, `CONCLUSION`\]              |
+| `l'artère`         | \[`l'`, `artère`\]        | \[`l'`, `artère`\] (same)                 |
+| `Dr. Pichon`       | \[`Dr`, `.`, `Pichon`\]   | \[`Dr.`, `Pichon`\]                       |
+| `B.H.HP.A.7.A`     | \[`B.H.HP.A.7.A`\]        | \[`B.`, `H.`, `HP.`, `A`, `7`, `A`, `0`\] |
+
+To instantiate one of the two languages, you can call the `spacy.blank` method.
+
+=== "EDSLanguage"
+
+    ```python
+    import edsnlp
+
+    nlp = edsnlp.blank("eds")
+    ```
+
+=== "FrenchLanguage"
+
+    ```python
+    import edsnlp
+
+    nlp = edsnlp.blank("fr")
+    ```