|
a |
|
b/docs/02_train_evaluate_models.md |
|
|
1 |
# Train and Evaluate Models |
|
|
2 |
|
|
|
3 |
In this guide, we show how to train and evaluate your own models. We assume that you created a corpus with the appropriate [data format](01_data_format.md) in `data/corpus/<corpus_name>`. |
|
|
4 |
|
|
|
5 |
## Model Training Scripts |
|
|
6 |
|
|
|
7 |
All training scripts are located under `deidentify/methods/` and are prefixed with `run_`. For example, use `deidentify/methods/bilstmcrf/run_bilstmcrf.py` to train a BiLSTM-CRF model. |
|
|
8 |
|
|
|
9 |
Each script takes a set of arguments that you can print as follows: |
|
|
10 |
|
|
|
11 |
```sh |
|
|
12 |
python deidentify/methods/bilstmcrf/run_bilstmcrf.py --help |
|
|
13 |
``` |
|
|
14 |
|
|
|
15 |
Below is a list of available scripts: |
|
|
16 |
|
|
|
17 |
```sh |
|
|
18 |
> tree -P run_*.py -I __* deidentify/methods/ |
|
|
19 |
deidentify/methods/ |
|
|
20 |
├── bilstmcrf |
|
|
21 |
│ ├── run_bilstmcrf.py # Train a BiLSTM-CRF |
|
|
22 |
│ └── run_bilstmcrf_training_sample.py # Train a BiLSTM-CRF with a fraction of the training set |
|
|
23 |
├── crf |
|
|
24 |
│ ├── run_crf.py # Train a CRF model |
|
|
25 |
│ ├── run_crf_hyperopt.py # Perform a random search for a CRF model |
|
|
26 |
│ ├── run_crf_learning_curve.py # Print a learning curve for a CRF model |
|
|
27 |
│ └── run_crf_training_sample.py # Train a CRF with a fraction of the training set |
|
|
28 |
└── deduce |
|
|
29 |
└── run_deduce.py # Run the DEDUCE tagger on your dataset |
|
|
30 |
``` |
|
|
31 |
|
|
|
32 |
All scripts save their predictions and model artifacts (e.g., pickle files, training logs) to `output/predictions/<corpus_name>/<script>_<run_id>/`. This allows you to evaluate the predictions at a later stage. |
|
|
33 |
|
|
|
34 |
## Example: Train and Evaluate a BiLSTM-CRF Model |
|
|
35 |
|
|
|
36 |
Execute the command below to run the BiLSTM-CRF pipeline on the corpus `ons` (aka. NUT) with run id `demo_run`: |
|
|
37 |
|
|
|
38 |
```sh |
|
|
39 |
python deidentify/methods/bilstmcrf/run_bilstmcrf.py ons demo_run \ |
|
|
40 |
--pooled_contextual_embeddings \ |
|
|
41 |
--train_with_dev |
|
|
42 |
``` |
|
|
43 |
|
|
|
44 |
The script saves the train/dev/test set predictions to `output/predictions/ons/bilstmcrf_dummy_run`. We can the script below to evaluate a single run: |
|
|
45 |
|
|
|
46 |
```sh |
|
|
47 |
python deidentify/evaluation/evaluate_run.py nl data/corpus/ons/test/ data/corpus/ons/test/ output/predictions/ons/bilstmcrf_demo_run/test/ |
|
|
48 |
``` |
|
|
49 |
|
|
|
50 |
It should print an evaluation report on an entity-level, token-level and blind token-level for each PHI tag. Example: |
|
|
51 |
|
|
|
52 |
``` |
|
|
53 |
> python deidentify/evaluation/evaluate_run.py nl data/corpus/ons/test/ data/corpus/ons/test/ output/predictions/ons/bilstmcrf_demo_run/test/ |
|
|
54 |
|
|
|
55 |
entity level tp: 3168 - fp: 288 - fn: 469 - tn: 0 - precision: 0.9167 - recall: 0.8710 - accuracy: 0.8071 - f1-score: 0.8933 |
|
|
56 |
Address tp: 132 - fp: 26 - fn: 24 - tn: 0 - precision: 0.8354 - recall: 0.8462 - accuracy: 0.7253 - f1-score: 0.8408 |
|
|
57 |
Age tp: 30 - fp: 8 - fn: 11 - tn: 0 - precision: 0.7895 - recall: 0.7317 - accuracy: 0.6122 - f1-score: 0.7595 |
|
|
58 |
Care_Institute tp: 142 - fp: 65 - fn: 74 - tn: 0 - precision: 0.6860 - recall: 0.6574 - accuracy: 0.5053 - f1-score: 0.6714 |
|
|
59 |
Date tp: 739 - fp: 59 - fn: 64 - tn: 0 - precision: 0.9261 - recall: 0.9203 - accuracy: 0.8573 - f1-score: 0.9232 |
|
|
60 |
Email tp: 10 - fp: 1 - fn: 0 - tn: 0 - precision: 0.9091 - recall: 1.0000 - accuracy: 0.9091 - f1-score: 0.9524 |
|
|
61 |
Hospital tp: 7 - fp: 2 - fn: 3 - tn: 0 - precision: 0.7778 - recall: 0.7000 - accuracy: 0.5833 - f1-score: 0.7369 |
|
|
62 |
ID tp: 12 - fp: 3 - fn: 13 - tn: 0 - precision: 0.8000 - recall: 0.4800 - accuracy: 0.4286 - f1-score: 0.6000 |
|
|
63 |
Initials tp: 111 - fp: 23 - fn: 67 - tn: 0 - precision: 0.8284 - recall: 0.6236 - accuracy: 0.5522 - f1-score: 0.7116 |
|
|
64 |
Internal_Location tp: 28 - fp: 10 - fn: 27 - tn: 0 - precision: 0.7368 - recall: 0.5091 - accuracy: 0.4308 - f1-score: 0.6021 |
|
|
65 |
Name tp: 1856 - fp: 67 - fn: 85 - tn: 0 - precision: 0.9652 - recall: 0.9562 - accuracy: 0.9243 - f1-score: 0.9607 |
|
|
66 |
Organization_Company tp: 71 - fp: 20 - fn: 65 - tn: 0 - precision: 0.7802 - recall: 0.5221 - accuracy: 0.4551 - f1-score: 0.6256 |
|
|
67 |
Other tp: 0 - fp: 0 - fn: 4 - tn: 0 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000 |
|
|
68 |
Phone_fax tp: 16 - fp: 2 - fn: 0 - tn: 0 - precision: 0.8889 - recall: 1.0000 - accuracy: 0.8889 - f1-score: 0.9412 |
|
|
69 |
Profession tp: 11 - fp: 1 - fn: 31 - tn: 0 - precision: 0.9167 - recall: 0.2619 - accuracy: 0.2558 - f1-score: 0.4074 |
|
|
70 |
SSN tp: 0 - fp: 1 - fn: 0 - tn: 0 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000 |
|
|
71 |
URL_IP tp: 3 - fp: 0 - fn: 1 - tn: 0 - precision: 1.0000 - recall: 0.7500 - accuracy: 0.7500 - f1-score: 0.8571 |
|
|
72 |
|
|
|
73 |
token level tp: 4894 - fp: 308 - fn: 500 - tn: 1810993 - precision: 0.9408 - recall: 0.9073 - accuracy: 0.8583 - f1-score: 0.9237 |
|
|
74 |
Address tp: 217 - fp: 22 - fn: 29 - tn: 120845 - precision: 0.9079 - recall: 0.8821 - accuracy: 0.8097 - f1-score: 0.8948 |
|
|
75 |
Age tp: 48 - fp: 11 - fn: 13 - tn: 121041 - precision: 0.8136 - recall: 0.7869 - accuracy: 0.6667 - f1-score: 0.8000 |
|
|
76 |
Care_Institute tp: 266 - fp: 78 - fn: 81 - tn: 120688 - precision: 0.7733 - recall: 0.7666 - accuracy: 0.6259 - f1-score: 0.7699 |
|
|
77 |
Date tp: 1835 - fp: 66 - fn: 36 - tn: 119176 - precision: 0.9653 - recall: 0.9808 - accuracy: 0.9473 - f1-score: 0.9730 |
|
|
78 |
Email tp: 10 - fp: 1 - fn: 0 - tn: 121102 - precision: 0.9091 - recall: 1.0000 - accuracy: 0.9091 - f1-score: 0.9524 |
|
|
79 |
Hospital tp: 11 - fp: 3 - fn: 3 - tn: 121096 - precision: 0.7857 - recall: 0.7857 - accuracy: 0.6471 - f1-score: 0.7857 |
|
|
80 |
ID tp: 12 - fp: 3 - fn: 12 - tn: 121086 - precision: 0.8000 - recall: 0.5000 - accuracy: 0.4444 - f1-score: 0.6154 |
|
|
81 |
Initials tp: 113 - fp: 21 - fn: 72 - tn: 120907 - precision: 0.8433 - recall: 0.6108 - accuracy: 0.5485 - f1-score: 0.7085 |
|
|
82 |
Internal_Location tp: 47 - fp: 11 - fn: 45 - tn: 121010 - precision: 0.8103 - recall: 0.5109 - accuracy: 0.4563 - f1-score: 0.6267 |
|
|
83 |
Name tp: 2135 - fp: 60 - fn: 80 - tn: 118838 - precision: 0.9727 - recall: 0.9639 - accuracy: 0.9385 - f1-score: 0.9683 |
|
|
84 |
Organization_Company tp: 119 - fp: 27 - fn: 89 - tn: 120878 - precision: 0.8151 - recall: 0.5721 - accuracy: 0.5064 - f1-score: 0.6723 |
|
|
85 |
Other tp: 0 - fp: 0 - fn: 5 - tn: 121108 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000 |
|
|
86 |
Phone_fax tp: 38 - fp: 2 - fn: 0 - tn: 121073 - precision: 0.9500 - recall: 1.0000 - accuracy: 0.9500 - f1-score: 0.9744 |
|
|
87 |
Profession tp: 40 - fp: 3 - fn: 34 - tn: 121036 - precision: 0.9302 - recall: 0.5405 - accuracy: 0.5195 - f1-score: 0.6837 |
|
|
88 |
URL_IP tp: 3 - fp: 0 - fn: 1 - tn: 121109 - precision: 1.0000 - recall: 0.7500 - accuracy: 0.7500 - f1-score: 0.8571 |
|
|
89 |
|
|
|
90 |
token (blind) tp: 5016 - fp: 187 - fn: 379 - tn: 115532 - precision: 0.9641 - recall: 0.9297 - accuracy: 0.8986 - f1-score: 0.9466 |
|
|
91 |
ENT tp: 5016 - fp: 187 - fn: 379 - tn: 115532 - precision: 0.9641 - recall: 0.9297 - accuracy: 0.8986 - f1-score: 0.9466 |
|
|
92 |
``` |
|
|
93 |
|
|
|
94 |
You can use the `evaluate_corpus.py` script to evaluate all runs for a given corpus. The script produces a CSV file with the evaluation measures for each corpus part (i.e., train/dev/test) that you can use this for further analysis. |
|
|
95 |
|
|
|
96 |
```sh |
|
|
97 |
> python deidentify/evaluation/evaluate_corpus.py <corpus_name> <language> |
|
|
98 |
[...] |
|
|
99 |
> tree output/evaluation/<corpus_name> |
|
|
100 |
output/evaluation/<corpus_name> |
|
|
101 |
├── summary_dev.csv |
|
|
102 |
├── summary_test.csv |
|
|
103 |
└── summary_train.csv |
|
|
104 |
``` |