Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, which expect a key-value representation and not Doc object.
For that purpose, we document here a set of converters that can be used to convert between these representations and Doc objects.
Converters can be configured in the from_*
(or read_*
in the case of files) and to_*
(or write_*
in the case of files) methods, depending on the chosen converter
argument, which can be:
converter=None
) {: #none }Except in read_standoff
and write_standoff
, the default converter is None
. When converter=None
, readers output the raw content of the input data (most often dictionaries) and writers expect dictionaries. This can actually be useful is you plan to use Streams without converting to Doc objects, for instance to parallelizing the execution of a function on raw Json, Parquet files or simple lists.
import edsnlp.data
def complex_func(n):
return n * n
stream = edsnlp.data.from_iterable(range(20))
stream = stream.map(complex_func)
stream = stream.set_processing(num_cpu_workers=2)
res = list(stream)
You can always define your own converter functions to convert between your data and Doc objects.
import edsnlp, edsnlp.pipes as eds
from spacy.tokens import Doc
from edsnlp.data.converters import get_current_tokenizer
from typing import Dict
def convert_row_to_dict(row: Dict) -> Doc:
# Tokenizer will be inferred from the pipeline
doc = get_current_tokenizer()(row["custom_content"])
doc._.note_id = row["custom_id"]
doc._.note_datetime = row["custom_datetime"]
# ...
return doc
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.covid())
# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
# Path to the file or directory
dataframe,
# How to convert JSON-like samples to Doc objects
converter=convert_row_to_dict,
)
docs = docs.map_pipeline(nlp)
def convert_doc_to_row(doc: Doc) -> Dict:
return {
"custom_id": doc._.id,
"custom_content": doc.text,
"custom_datetime": doc._.note_datetime,
# ...
}
# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
docs.write_parquet(
"path/to/output_folder",
# How to convert Doc objects to JSON-like samples
converter=convert_doc_to_row,
)
!!! note "One row per entity"
This function can also return a list of dicts, for instance one dict per detected entity, that will be treated as multiple rows in dataframe writers (e.g., `to_pandas`, `to_spark`, `write_parquet`).
```{ .python .no-check }
def convert_ents_to_rows(doc: Doc) -> List[Dict]:
return [
{
"note_id": doc._.id,
"ent_text": ent.text,
"ent_label": ent.label_,
"custom_datetime": doc._.note_datetime,
# ...
}
for ent in doc.ents
]
docs.write_parquet(
"path/to/output_folder",
# How to convert entities of Doc objects to JSON-like samples
converter=convert_ents_to_rows,
)
```
converter="omop"
) {: #omop }OMOP is a schema that is used in the medical domain. It is based on the OMOP Common Data Model. We are mainly interested in the note
table, which contains
the clinical notes, and deviate from the original schema by adding an optional entities
column that can be computed from the note_nlp
table.
Therefore, a complete OMOP-style document would look like this:
{
"note_id": 0,
"note_text": "Le patient ...",
"entities": [
{
"note_nlp_id": 0,
"start_char": 3,
"end_char": 10,
"lexical_variant": "patient",
"note_nlp_source_value": "person",
# optional fields
"negated": False,
"certainty": "probable",
...
},
...
],
# optional fields
"custom_doc_field": "..."
...
}
::: edsnlp.data.converters.OmopDict2DocConverter
options:
heading_level: 4
show_source: false
::: edsnlp.data.converters.OmopDoc2DictConverter
options:
heading_level: 4
show_source: false
converter="standoff"
) {: #standoff }Standoff refers mostly to the BRAT standoff format, but doesn't indicate how
the annotations should be stored in a JSON-like schema. We use the following schema:
{
"doc_id": 0,
"text": "Le patient ...",
"entities": [
{
"entity_id": 0,
"label": "drug",
"fragments": [{
"start": 0,
"end": 10
}],
"attributes": {
"negated": True,
"certainty": "probable"
}
},
...
]
}
::: edsnlp.data.converters.StandoffDict2DocConverter
options:
heading_level: 4
show_source: false
::: edsnlp.data.converters.StandoffDoc2DictConverter
options:
heading_level: 4
show_source: false
converter="ents"
) {: #edsnlp.data.converters.EntsDoc2DictConverter }We also provide a simple one-way (export) converter to convert Doc into a list of dictionaries,
one per entity, that can be used to write to a dataframe. The schema of each produced row is the following:
{
"note_id": 0,
"start": 3,
"end": 10,
"label": "drug",
"lexical_variant": "patient",
# Optional fields
"negated": False,
"certainty": "probable"
...
}
::: edsnlp.data.converters.EntsDoc2DictConverter
options:
heading_level: 4
show_source: false