|
a |
|
b/docs/data/pandas.md |
|
|
1 |
# Pandas |
|
|
2 |
|
|
|
3 |
??? abstract "TLDR" |
|
|
4 |
|
|
|
5 |
```{ .python .no-check } |
|
|
6 |
import edsnlp |
|
|
7 |
|
|
|
8 |
stream = edsnlp.data.from_pandas(df, converter="omop") |
|
|
9 |
stream = stream.map_pipeline(nlp) |
|
|
10 |
res = stream.to_pandas(converter="omop") |
|
|
11 |
# or equivalently |
|
|
12 |
edsnlp.data.to_pandas(stream, converter="omop") |
|
|
13 |
``` |
|
|
14 |
|
|
|
15 |
We provide methods to read and write documents (raw or annotated) from and to Pandas DataFrames. |
|
|
16 |
|
|
|
17 |
As an example, imagine that we have the following OMOP dataframe (we'll name it `note_df`) |
|
|
18 |
|
|
|
19 |
| note_id | note_text | note_datetime | |
|
|
20 |
|--------:|:----------------------------------------------|:--------------| |
|
|
21 |
| 0 | Le patient est admis pour une pneumopathie... | 2021-10-23 | |
|
|
22 |
|
|
|
23 |
## Reading from a Pandas Dataframe {: #edsnlp.data.pandas.from_pandas } |
|
|
24 |
|
|
|
25 |
::: edsnlp.data.pandas.from_pandas |
|
|
26 |
options: |
|
|
27 |
heading_level: 3 |
|
|
28 |
show_source: false |
|
|
29 |
show_toc: false |
|
|
30 |
show_bases: false |
|
|
31 |
|
|
|
32 |
|
|
|
33 |
## Writing to a Pandas DataFrame {: #edsnlp.data.pandas.to_pandas } |
|
|
34 |
|
|
|
35 |
::: edsnlp.data.pandas.to_pandas |
|
|
36 |
options: |
|
|
37 |
heading_level: 3 |
|
|
38 |
show_source: false |
|
|
39 |
show_toc: false |
|
|
40 |
show_bases: false |
|
|
41 |
|
|
|
42 |
|
|
|
43 |
## Importing entities from a Pandas DataFrame |
|
|
44 |
|
|
|
45 |
If you have a dataframe with entities (e.g., `note_nlp` in OMOP), you must join it with the dataframe containing the raw text (e.g., `note` in OMOP) to obtain a single dataframe with the entities next to the raw text. For instance, the second `note_nlp` dataframe that we will name `note_nlp_df`. |
|
|
46 |
|
|
|
47 |
| note_nlp_id | note_id | start_char | end_char | note_nlp_source_value | lexical_variant | |
|
|
48 |
|------------:|--------:|-----------:|---------:|:----------------------|:----------------| |
|
|
49 |
| 0 | 0 | 46 | 57 | disease | coronavirus | |
|
|
50 |
| 1 | 0 | 77 | 88 | drug | paracétamol | |
|
|
51 |
| ... | ... | ... | ... | ... | ... | |
|
|
52 |
|
|
|
53 |
```{ .python .no-check } |
|
|
54 |
df = ( |
|
|
55 |
note_df |
|
|
56 |
.set_index("note_id") |
|
|
57 |
.join( |
|
|
58 |
note_nlp_df |
|
|
59 |
.set_index('note_id') |
|
|
60 |
.groupby(level=0) |
|
|
61 |
.apply(pd.DataFrame.to_dict, orient='records') |
|
|
62 |
.rename("entities") |
|
|
63 |
) |
|
|
64 |
).reset_index() |
|
|
65 |
``` |
|
|
66 |
|
|
|
67 |
| note_id | note_text | note_datetime | entities | |
|
|
68 |
|--------:|---------------|---------------|---------------------------------------------:| |
|
|
69 |
| 0 | Le patient... | 2021-10-23 | `[{"note_nlp_id": 0, "start_char": 46, ...]` | |
|
|
70 |
| ... | ... | ... | ... | |