Switch to unified view

a b/notebooks/sections/section-dataset.md
1
---
2
jupyter:
3
  jupytext:
4
    formats: ipynb,md
5
    text_representation:
6
      extension: .md
7
      format_name: markdown
8
      format_version: '1.3'
9
      jupytext_version: 1.11.4
10
  kernelspec:
11
    display_name: Python 3
12
    language: python
13
    name: python3
14
---
15
16
```python
17
%reload_ext autoreload
18
%autoreload 2
19
```
20
21
```python
22
import pandas as pd
23
```
24
25
```python
26
import os
27
```
28
29
```python
30
import context
31
```
32
33
```python
34
from edsnlp.utils.brat import BratConnector
35
```
36
37
# Sections dataset
38
39
40
Réutilisation du [travail réalisé par Ivan Lerner à l'EDS](https://gitlab.eds.aphp.fr/IvanL/section_dataset).
41
42
```python
43
data_dir = '../../data/section_dataset/'
44
```
45
46
```python
47
brat = BratConnector(data_dir)
48
```
49
50
```python
51
texts, annotations = brat.get_brat()
52
```
53
54
```python
55
df = annotations[['lexical_variant']].drop_duplicates()
56
```
57
58
```python
59
df['section'] = ''
60
```
61
62
```python
63
df.to_csv('sections.tsv', sep='\t', index=False)
64
```
65
66
```python
67
annotated = pd.read_csv('sections.tsv', sep='\t')
68
```
69
70
```python
71
annotated.to_csv('annotated_sections.csv', index=False)
72
```
73
74
```python
75
annotated = pd.read_excel('sections.xlsx', sheet_name='Annotation', engine='openpyxl')
76
```
77
78
```python
79
annotated.columns = ['lexical_variant', 'section', 'keep', 'comment']
80
```
81
82
```python
83
annotated.keep = annotated.keep.fillna('Oui') == 'Oui'
84
```
85
86
```python
87
annotated = annotated.query('keep')[['lexical_variant', 'section']]
88
```
89
90
```python
91
annotated.merge(annotations, on='lexical_variant').section.value_counts()
92
```
93
94
```python
95
annotated.lexical_variant = annotated.lexical_variant.str.lower()
96
```
97
98
```python
99
annotated_unnaccented = annotated.copy()
100
```
101
102
```python
103
from unidecode import unidecode
104
```
105
106
```python
107
annotated_unnaccented.lexical_variant = annotated_unnaccented.lexical_variant.apply(unidecode)
108
```
109
110
```python
111
# annotated = pd.concat([annotated, annotated_unnaccented])
112
annotated = annotated_unnaccented
113
```
114
115
```python
116
annotated = annotated.drop_duplicates()
117
```
118
119
```python
120
annotated = annotated.sort_values(['lexical_variant', 'section'])
121
```
122
123
```python
124
annotated
125
```
126
127
```python
128
annotated = annotated.drop_duplicates()
129
```
130
131
```python
132
sections = {
133
    section.replace(' ', '_'): list(annotated.query('section == @section').lexical_variant)
134
    for section in annotated.section.unique()
135
}
136
```
137
138
```python
139
for k, v in sections.items():
140
    print(unidecode(k.replace(' ', '_')), '=', v)
141
    print()
142
```
143
144
```python
145
sections = {
146
    section: unidecode(section.replace(' ', '_'))
147
    for section in annotated.section.unique()
148
}
149
```
150
151
```python
152
for k, v in sections.items():
153
    print(f"{repr(k)}: {v},")
154
```
155
156
```python
157
158
```