[cad161]: / docs / data / converters.md

Download this file

221 lines (171 with data), 6.4 kB

Converters {: #converters }

Data can be read from and writen to various sources, like JSON/BRAT/CSV files or dataframes, which expect a key-value representation and not Doc object.
For that purpose, we document here a set of converters that can be used to convert between these representations and Doc objects.

Converters can be configured in the from_* (or read_* in the case of files) and to_* (or write_* in the case of files) methods, depending on the chosen converter argument, which can be:

  • a function, in which case it will be interpreted as a custom converter
  • a string, in which case it will be interpreted as the name of a pre-defined converter

No converter (converter=None) {: #none }

Except in read_standoff and write_standoff, the default converter is None. When converter=None, readers output the raw content of the input data (most often dictionaries) and writers expect dictionaries. This can actually be useful is you plan to use Streams without converting to Doc objects, for instance to parallelizing the execution of a function on raw Json, Parquet files or simple lists.

import edsnlp.data


def complex_func(n):
    return n * n


stream = edsnlp.data.from_iterable(range(20))
stream = stream.map(complex_func)
stream = stream.set_processing(num_cpu_workers=2)
res = list(stream)

Custom converter {: #custom }

You can always define your own converter functions to convert between your data and Doc objects.

Reading from a custom schema

import edsnlp, edsnlp.pipes as eds
from spacy.tokens import Doc
from edsnlp.data.converters import get_current_tokenizer
from typing import Dict

def convert_row_to_dict(row: Dict) -> Doc:
    # Tokenizer will be inferred from the pipeline
    doc = get_current_tokenizer()(row["custom_content"])
    doc._.note_id = row["custom_id"]
    doc._.note_datetime = row["custom_datetime"]
    # ...
    return doc

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.covid())

# Any kind of reader (`edsnlp.data.read/from_...`) can be used here
docs = edsnlp.data.from_pandas(
    # Path to the file or directory
    dataframe,
    # How to convert JSON-like samples to Doc objects
    converter=convert_row_to_dict,
)
docs = docs.map_pipeline(nlp)

Writing to a custom schema

def convert_doc_to_row(doc: Doc) -> Dict:
    return {
        "custom_id": doc._.id,
        "custom_content": doc.text,
        "custom_datetime": doc._.note_datetime,
        # ...
    }

# Any kind of writer (`edsnlp.data.write/to_...`) can be used here
docs.write_parquet(
    "path/to/output_folder",
    # How to convert Doc objects to JSON-like samples
    converter=convert_doc_to_row,
)

!!! note "One row per entity"

This function can also return a list of dicts, for instance one dict per detected entity, that will be treated as multiple rows in dataframe writers (e.g., `to_pandas`, `to_spark`, `write_parquet`).

```{ .python .no-check }
def convert_ents_to_rows(doc: Doc) -> List[Dict]:
    return [
        {
            "note_id": doc._.id,
            "ent_text": ent.text,
            "ent_label": ent.label_,
            "custom_datetime": doc._.note_datetime,
            # ...
        }
        for ent in doc.ents
    ]


docs.write_parquet(
    "path/to/output_folder",
    # How to convert entities of Doc objects to JSON-like samples
    converter=convert_ents_to_rows,
)
```

OMOP (converter="omop") {: #omop }

OMOP is a schema that is used in the medical domain. It is based on the OMOP Common Data Model. We are mainly interested in the note table, which contains
the clinical notes, and deviate from the original schema by adding an optional entities column that can be computed from the note_nlp table.

Therefore, a complete OMOP-style document would look like this:

{
  "note_id": 0,
  "note_text": "Le patient ...",
  "entities": [
    {
      "note_nlp_id": 0,
      "start_char": 3,
      "end_char": 10,
      "lexical_variant": "patient",
      "note_nlp_source_value": "person",

      # optional fields
      "negated": False,
      "certainty": "probable",
      ...
    },
    ...
  ],

  # optional fields
  "custom_doc_field": "..."
  ...
}

Converting OMOP data to Doc objects {: #edsnlp.data.converters.OmopDict2DocConverter }

::: edsnlp.data.converters.OmopDict2DocConverter
options:
heading_level: 4
show_source: false

Converting Doc objects to OMOP data {: #edsnlp.data.converters.OmopDoc2DictConverter }

::: edsnlp.data.converters.OmopDoc2DictConverter
options:
heading_level: 4
show_source: false

Standoff (converter="standoff") {: #standoff }

Standoff refers mostly to the BRAT standoff format, but doesn't indicate how
the annotations should be stored in a JSON-like schema. We use the following schema:

{
  "doc_id": 0,
  "text": "Le patient ...",
  "entities": [
    {
      "entity_id": 0,
      "label": "drug",
      "fragments": [{
        "start": 0,
        "end": 10
      }],
      "attributes": {
        "negated": True,
        "certainty": "probable"
      }
    },
    ...
  ]
}

Converting Standoff data to Doc objects {: #edsnlp.data.converters.StandoffDict2DocConverter }

::: edsnlp.data.converters.StandoffDict2DocConverter
options:
heading_level: 4
show_source: false

Converting Doc objects to Standoff data {: #edsnlp.data.converters.StandoffDoc2DictConverter }

::: edsnlp.data.converters.StandoffDoc2DictConverter
options:
heading_level: 4
show_source: false

Entities (converter="ents") {: #edsnlp.data.converters.EntsDoc2DictConverter }

We also provide a simple one-way (export) converter to convert Doc into a list of dictionaries,
one per entity, that can be used to write to a dataframe. The schema of each produced row is the following:

{
    "note_id": 0,
    "start": 3,
    "end": 10,
    "label": "drug",
    "lexical_variant": "patient",

    # Optional fields
    "negated": False,
    "certainty": "probable"
    ...
}

::: edsnlp.data.converters.EntsDoc2DictConverter
options:
heading_level: 4
show_source: false