|
a |
|
b/docs/data/parquet.md |
|
|
1 |
# Parquet |
|
|
2 |
|
|
|
3 |
??? abstract "TLDR" |
|
|
4 |
|
|
|
5 |
```{ .python .no-check } |
|
|
6 |
import edsnlp |
|
|
7 |
|
|
|
8 |
stream = edsnlp.data.read_parquet(path, converter="omop") |
|
|
9 |
stream = stream.map_pipeline(nlp) |
|
|
10 |
res = stream.to_parquet(path, converter="omop") |
|
|
11 |
# or equivalently |
|
|
12 |
edsnlp.data.to_parquet(stream, path, converter="omop") |
|
|
13 |
``` |
|
|
14 |
|
|
|
15 |
We provide methods to read and write documents (raw or annotated) from and to parquet files. |
|
|
16 |
|
|
|
17 |
As an example, imagine that we have the following document that uses the OMOP schema (parquet files are not actually stored as human-readable text, but this is for the sake of the example): |
|
|
18 |
|
|
|
19 |
```{ title="data.pq" } |
|
|
20 |
{ "note_id": 0, "note_text": "Le patient ...", "note_datetime": "2021-10-23", "entities": [...] } |
|
|
21 |
{ "note_id": 1, "note_text": "Autre doc ...", "note_datetime": "2022-12-24", "entities": [] } |
|
|
22 |
... |
|
|
23 |
``` |
|
|
24 |
|
|
|
25 |
You could also have multiple parquet files in a directory, the reader will read them all. |
|
|
26 |
|
|
|
27 |
## Reading Parquet files {: #edsnlp.data.parquet.read_parquet } |
|
|
28 |
|
|
|
29 |
::: edsnlp.data.parquet.read_parquet |
|
|
30 |
options: |
|
|
31 |
heading_level: 3 |
|
|
32 |
show_source: false |
|
|
33 |
show_toc: false |
|
|
34 |
show_bases: false |
|
|
35 |
|
|
|
36 |
## Writing Parquet files {: #edsnlp.data.parquet.write_parquet } |
|
|
37 |
|
|
|
38 |
::: edsnlp.data.parquet.write_parquet |
|
|
39 |
options: |
|
|
40 |
heading_level: 3 |
|
|
41 |
show_source: false |
|
|
42 |
show_toc: false |
|
|
43 |
show_bases: false |