|
a |
|
b/docs/data/conll.md |
|
|
1 |
# CoNLL |
|
|
2 |
|
|
|
3 |
??? abstract "TLDR" |
|
|
4 |
|
|
|
5 |
```{ .python .no-check } |
|
|
6 |
import edsnlp |
|
|
7 |
|
|
|
8 |
stream = edsnlp.data.read_conll(path) |
|
|
9 |
stream = stream.map_pipeline(nlp) |
|
|
10 |
``` |
|
|
11 |
|
|
|
12 |
You can easily integrate CoNLL formatted files into your project by using EDS-NLP's CoNLL reader. |
|
|
13 |
|
|
|
14 |
There are many CoNLL formats corresponding to different shared tasks, but one of the most common is the CoNLL-U format, which is used for dependency parsing. In CoNLL files, each line corresponds to a token and contains various columns with information about the token, such as its index, form, lemma, POS tag, and dependency relation. |
|
|
15 |
|
|
|
16 |
EDS-NLP lets you specify the name of the `columns` if they are different from the default CoNLL-U format. If the `columns` parameter is unset, the reader looks for a comment containing `# global.columns` to infer the column names. Otherwise, the columns are |
|
|
17 |
|
|
|
18 |
``` |
|
|
19 |
ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC |
|
|
20 |
``` |
|
|
21 |
|
|
|
22 |
A typical CoNLL file looks like this: |
|
|
23 |
|
|
|
24 |
```{ title="sample.conllu" } |
|
|
25 |
1 euh euh INTJ _ _ 5 discourse _ SpaceAfter=No |
|
|
26 |
2 , , PUNCT _ _ 1 punct _ _ |
|
|
27 |
3 il lui PRON _ Gender=Masc|Number=Sing|Person=3|PronType=Prs 5 expl:subj _ _ |
|
|
28 |
... |
|
|
29 |
``` |
|
|
30 |
|
|
|
31 |
## Reading CoNLL files {: #edsnlp.data.conll.read_conll } |
|
|
32 |
|
|
|
33 |
::: edsnlp.data.conll.read_conll |
|
|
34 |
options: |
|
|
35 |
heading_level: 3 |
|
|
36 |
show_source: false |
|
|
37 |
show_toc: false |
|
|
38 |
show_bases: false |