a b/docs/data/converters.md
1
# Converters {: #converters }
2
3
Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, which expect a key-value representation and not Doc object.
4
For that purpose, we document here a set of converters that can be used to convert between these representations and Doc objects.
5
6
Converters can be configured in the `from_*` (or `read_*` in the case of files) and `to_*` (or `write_*` in the case of files) methods, depending on the chosen `converter` argument, which can be:
7
8
- a function, in which case it will be interpreted as a custom converter
9
- a string, in which case it will be interpreted as the name of a pre-defined converter
10
11
## No converter (`converter=None`) {: #none }
12
13
Except in `read_standoff` and `write_standoff`, the default converter is `None`. When `converter=None`, readers output the raw content of the input data (most often dictionaries) and writers expect dictionaries. This can actually be useful is you plan to use Streams without converting to Doc objects, for instance to parallelizing the execution of a function on raw Json, Parquet files or simple lists.
14
15
```python
16
import edsnlp.data
17
18
19
def complex_func(n):
20
    return n * n
21
22
23
stream = edsnlp.data.from_iterable(range(20))
24
stream = stream.map(complex_func)
25
stream = stream.set_processing(num_cpu_workers=2)
26
res = list(stream)
27
```
28
29
## Custom converter {: #custom }
30
31
You can always define your own converter functions to convert between your data and Doc objects.
32
33
### Reading from a custom schema
34
35
```{ .python .no-check }
36
import edsnlp, edsnlp.pipes as eds
37
from spacy.tokens import Doc
38
from edsnlp.data.converters import get_current_tokenizer
39
from typing import Dict
40
41
def convert_row_to_dict(row: Dict) -> Doc:
42
    # Tokenizer will be inferred from the pipeline
43
    doc = get_current_tokenizer()(row["custom_content"])
44
    doc._.note_id = row["custom_id"]
45
    doc._.note_datetime = row["custom_datetime"]
46
    # ...
47
    return doc
48
49
nlp = edsnlp.blank("eds")
50
nlp.add_pipe(eds.normalizer())
51
nlp.add_pipe(eds.covid())
52
53
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
54
docs = edsnlp.data.from_pandas(
55
    # Path to the file or directory
56
    dataframe,
57
    # How to convert JSON-like samples to Doc objects
58
    converter=convert_row_to_dict,
59
)
60
docs = docs.map_pipeline(nlp)
61
```
62
63
### Writing to a custom schema
64
65
```{ .python .no-check }
66
def convert_doc_to_row(doc: Doc) -> Dict:
67
    return {
68
        "custom_id": doc._.id,
69
        "custom_content": doc.text,
70
        "custom_datetime": doc._.note_datetime,
71
        # ...
72
    }
73
74
# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
75
docs.write_parquet(
76
    "path/to/output_folder",
77
    # How to convert Doc objects to JSON-like samples
78
    converter=convert_doc_to_row,
79
)
80
```
81
82
!!! note "One row per entity"
83
84
    This function can also return a list of dicts, for instance one dict per detected entity, that will be treated as multiple rows in dataframe writers (e.g., `to_pandas`, `to_spark`, `write_parquet`).
85
86
    ```{ .python .no-check }
87
    def convert_ents_to_rows(doc: Doc) -> List[Dict]:
88
        return [
89
            {
90
                "note_id": doc._.id,
91
                "ent_text": ent.text,
92
                "ent_label": ent.label_,
93
                "custom_datetime": doc._.note_datetime,
94
                # ...
95
            }
96
            for ent in doc.ents
97
        ]
98
99
100
    docs.write_parquet(
101
        "path/to/output_folder",
102
        # How to convert entities of Doc objects to JSON-like samples
103
        converter=convert_ents_to_rows,
104
    )
105
    ```
106
107
## OMOP (`converter="omop"`) {: #omop }
108
109
OMOP is a schema that is used in the medical domain. It is based on the [OMOP Common Data Model](https://ohdsi.github.io/CommonDataModel/). We are mainly interested in the `note` table, which contains
110
the clinical notes, and deviate from the original schema by adding an *optional* `entities` column that can be computed from the `note_nlp` table.
111
112
Therefore, a complete OMOP-style document would look like this:
113
114
```{ .json }
115
{
116
  "note_id": 0,
117
  "note_text": "Le patient ...",
118
  "entities": [
119
    {
120
      "note_nlp_id": 0,
121
      "start_char": 3,
122
      "end_char": 10,
123
      "lexical_variant": "patient",
124
      "note_nlp_source_value": "person",
125
126
      # optional fields
127
      "negated": False,
128
      "certainty": "probable",
129
      ...
130
    },
131
    ...
132
  ],
133
134
  # optional fields
135
  "custom_doc_field": "..."
136
  ...
137
}
138
```
139
140
### Converting OMOP data to Doc objects {: #edsnlp.data.converters.OmopDict2DocConverter }
141
142
::: edsnlp.data.converters.OmopDict2DocConverter
143
    options:
144
        heading_level: 4
145
        show_source: false
146
147
### Converting Doc objects to OMOP data {: #edsnlp.data.converters.OmopDoc2DictConverter }
148
149
::: edsnlp.data.converters.OmopDoc2DictConverter
150
    options:
151
        heading_level: 4
152
        show_source: false
153
154
## Standoff (`converter="standoff"`) {: #standoff }
155
156
Standoff refers mostly to the [BRAT standoff format](https://brat.nlplab.org/standoff.html), but doesn't indicate how
157
the annotations should be stored in a JSON-like schema. We use the following schema:
158
159
```{ .json }
160
{
161
  "doc_id": 0,
162
  "text": "Le patient ...",
163
  "entities": [
164
    {
165
      "entity_id": 0,
166
      "label": "drug",
167
      "fragments": [{
168
        "start": 0,
169
        "end": 10
170
      }],
171
      "attributes": {
172
        "negated": True,
173
        "certainty": "probable"
174
      }
175
    },
176
    ...
177
  ]
178
}
179
```
180
181
### Converting Standoff data to Doc objects {: #edsnlp.data.converters.StandoffDict2DocConverter }
182
183
::: edsnlp.data.converters.StandoffDict2DocConverter
184
    options:
185
        heading_level: 4
186
        show_source: false
187
188
### Converting Doc objects to Standoff data {: #edsnlp.data.converters.StandoffDoc2DictConverter }
189
190
::: edsnlp.data.converters.StandoffDoc2DictConverter
191
    options:
192
        heading_level: 4
193
        show_source: false
194
195
196
197
## Entities (`converter="ents"`) {: #edsnlp.data.converters.EntsDoc2DictConverter }
198
199
We also provide a simple one-way (export) converter to convert Doc into a list of dictionaries,
200
one per entity, that can be used to write to a dataframe. The schema of each produced row is the following:
201
202
```{ .json }
203
{
204
    "note_id": 0,
205
    "start": 3,
206
    "end": 10,
207
    "label": "drug",
208
    "lexical_variant": "patient",
209
210
    # Optional fields
211
    "negated": False,
212
    "certainty": "probable"
213
    ...
214
}
215
```
216
217
::: edsnlp.data.converters.EntsDoc2DictConverter
218
    options:
219
        heading_level: 4
220
        show_source: false