Diff of /docs/data/parquet.md [000000] .. [cad161]

Switch to unified view

a b/docs/data/parquet.md
1
# Parquet
2
3
??? abstract "TLDR"
4
5
    ```{ .python .no-check }
6
    import edsnlp
7
8
    stream = edsnlp.data.read_parquet(path, converter="omop")
9
    stream = stream.map_pipeline(nlp)
10
    res = stream.to_parquet(path, converter="omop")
11
    # or equivalently
12
    edsnlp.data.to_parquet(stream, path, converter="omop")
13
    ```
14
15
We provide methods to read and write documents (raw or annotated) from and to parquet files.
16
17
As an example, imagine that we have the following document that uses the OMOP schema (parquet files are not actually stored as human-readable text, but this is for the sake of the example):
18
19
```{ title="data.pq" }
20
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] }
21
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] }
22
...
23
```
24
25
You could also have multiple parquet files in a directory, the reader will read them all.
26
27
## Reading Parquet files {: #edsnlp.data.parquet.read_parquet }
28
29
::: edsnlp.data.parquet.read_parquet
30
    options:
31
        heading_level: 3
32
        show_source: false
33
        show_toc: false
34
        show_bases: false
35
36
## Writing Parquet files {: #edsnlp.data.parquet.write_parquet }
37
38
::: edsnlp.data.parquet.write_parquet
39
    options:
40
        heading_level: 3
41
        show_source: false
42
        show_toc: false
43
        show_bases: false