[7fc5df]: / docs / 02_train_evaluate_models.md

Download this file

105 lines (84 with data), 7.9 kB

Train and Evaluate Models

In this guide, we show how to train and evaluate your own models. We assume that you created a corpus with the appropriate data format in data/corpus/<corpus_name>.

Model Training Scripts

All training scripts are located under deidentify/methods/ and are prefixed with run_. For example, use deidentify/methods/bilstmcrf/run_bilstmcrf.py to train a BiLSTM-CRF model.

Each script takes a set of arguments that you can print as follows:

python deidentify/methods/bilstmcrf/run_bilstmcrf.py --help

Below is a list of available scripts:

> tree -P run_*.py -I __* deidentify/methods/
deidentify/methods/
├── bilstmcrf
│   ├── run_bilstmcrf.py                    # Train a BiLSTM-CRF   └── run_bilstmcrf_training_sample.py    # Train a BiLSTM-CRF with a fraction of the training set
├── crf
│   ├── run_crf.py                          # Train a CRF model   ├── run_crf_hyperopt.py                 # Perform a random search for a CRF model   ├── run_crf_learning_curve.py           # Print a learning curve for a CRF model   └── run_crf_training_sample.py          # Train a CRF with a fraction of the training set
└── deduce
    └── run_deduce.py                       # Run the DEDUCE tagger on your dataset

All scripts save their predictions and model artifacts (e.g., pickle files, training logs) to output/predictions/<corpus_name>/<script>_<run_id>/. This allows you to evaluate the predictions at a later stage.

Example: Train and Evaluate a BiLSTM-CRF Model

Execute the command below to run the BiLSTM-CRF pipeline on the corpus ons (aka. NUT) with run id demo_run:

python deidentify/methods/bilstmcrf/run_bilstmcrf.py ons demo_run \
    --pooled_contextual_embeddings \
    --train_with_dev

The script saves the train/dev/test set predictions to output/predictions/ons/bilstmcrf_dummy_run. We can the script below to evaluate a single run:

python deidentify/evaluation/evaluate_run.py nl data/corpus/ons/test/ data/corpus/ons/test/ output/predictions/ons/bilstmcrf_demo_run/test/

It should print an evaluation report on an entity-level, token-level and blind token-level for each PHI tag. Example:

> python deidentify/evaluation/evaluate_run.py nl data/corpus/ons/test/ data/corpus/ons/test/ output/predictions/ons/bilstmcrf_demo_run/test/

entity level            tp: 3168  - fp: 288   - fn: 469   - tn: 0     - precision: 0.9167 - recall: 0.8710 - accuracy: 0.8071 - f1-score: 0.8933
Address                 tp: 132   - fp: 26    - fn: 24    - tn: 0     - precision: 0.8354 - recall: 0.8462 - accuracy: 0.7253 - f1-score: 0.8408
Age                     tp: 30    - fp: 8     - fn: 11    - tn: 0     - precision: 0.7895 - recall: 0.7317 - accuracy: 0.6122 - f1-score: 0.7595
Care_Institute          tp: 142   - fp: 65    - fn: 74    - tn: 0     - precision: 0.6860 - recall: 0.6574 - accuracy: 0.5053 - f1-score: 0.6714
Date                    tp: 739   - fp: 59    - fn: 64    - tn: 0     - precision: 0.9261 - recall: 0.9203 - accuracy: 0.8573 - f1-score: 0.9232
Email                   tp: 10    - fp: 1     - fn: 0     - tn: 0     - precision: 0.9091 - recall: 1.0000 - accuracy: 0.9091 - f1-score: 0.9524
Hospital                tp: 7     - fp: 2     - fn: 3     - tn: 0     - precision: 0.7778 - recall: 0.7000 - accuracy: 0.5833 - f1-score: 0.7369
ID                      tp: 12    - fp: 3     - fn: 13    - tn: 0     - precision: 0.8000 - recall: 0.4800 - accuracy: 0.4286 - f1-score: 0.6000
Initials                tp: 111   - fp: 23    - fn: 67    - tn: 0     - precision: 0.8284 - recall: 0.6236 - accuracy: 0.5522 - f1-score: 0.7116
Internal_Location       tp: 28    - fp: 10    - fn: 27    - tn: 0     - precision: 0.7368 - recall: 0.5091 - accuracy: 0.4308 - f1-score: 0.6021
Name                    tp: 1856  - fp: 67    - fn: 85    - tn: 0     - precision: 0.9652 - recall: 0.9562 - accuracy: 0.9243 - f1-score: 0.9607
Organization_Company    tp: 71    - fp: 20    - fn: 65    - tn: 0     - precision: 0.7802 - recall: 0.5221 - accuracy: 0.4551 - f1-score: 0.6256
Other                   tp: 0     - fp: 0     - fn: 4     - tn: 0     - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000
Phone_fax               tp: 16    - fp: 2     - fn: 0     - tn: 0     - precision: 0.8889 - recall: 1.0000 - accuracy: 0.8889 - f1-score: 0.9412
Profession              tp: 11    - fp: 1     - fn: 31    - tn: 0     - precision: 0.9167 - recall: 0.2619 - accuracy: 0.2558 - f1-score: 0.4074
SSN                     tp: 0     - fp: 1     - fn: 0     - tn: 0     - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000
URL_IP                  tp: 3     - fp: 0     - fn: 1     - tn: 0     - precision: 1.0000 - recall: 0.7500 - accuracy: 0.7500 - f1-score: 0.8571

token level             tp: 4894  - fp: 308   - fn: 500   - tn: 1810993 - precision: 0.9408 - recall: 0.9073 - accuracy: 0.8583 - f1-score: 0.9237
Address                 tp: 217   - fp: 22    - fn: 29    - tn: 120845 - precision: 0.9079 - recall: 0.8821 - accuracy: 0.8097 - f1-score: 0.8948
Age                     tp: 48    - fp: 11    - fn: 13    - tn: 121041 - precision: 0.8136 - recall: 0.7869 - accuracy: 0.6667 - f1-score: 0.8000
Care_Institute          tp: 266   - fp: 78    - fn: 81    - tn: 120688 - precision: 0.7733 - recall: 0.7666 - accuracy: 0.6259 - f1-score: 0.7699
Date                    tp: 1835  - fp: 66    - fn: 36    - tn: 119176 - precision: 0.9653 - recall: 0.9808 - accuracy: 0.9473 - f1-score: 0.9730
Email                   tp: 10    - fp: 1     - fn: 0     - tn: 121102 - precision: 0.9091 - recall: 1.0000 - accuracy: 0.9091 - f1-score: 0.9524
Hospital                tp: 11    - fp: 3     - fn: 3     - tn: 121096 - precision: 0.7857 - recall: 0.7857 - accuracy: 0.6471 - f1-score: 0.7857
ID                      tp: 12    - fp: 3     - fn: 12    - tn: 121086 - precision: 0.8000 - recall: 0.5000 - accuracy: 0.4444 - f1-score: 0.6154
Initials                tp: 113   - fp: 21    - fn: 72    - tn: 120907 - precision: 0.8433 - recall: 0.6108 - accuracy: 0.5485 - f1-score: 0.7085
Internal_Location       tp: 47    - fp: 11    - fn: 45    - tn: 121010 - precision: 0.8103 - recall: 0.5109 - accuracy: 0.4563 - f1-score: 0.6267
Name                    tp: 2135  - fp: 60    - fn: 80    - tn: 118838 - precision: 0.9727 - recall: 0.9639 - accuracy: 0.9385 - f1-score: 0.9683
Organization_Company    tp: 119   - fp: 27    - fn: 89    - tn: 120878 - precision: 0.8151 - recall: 0.5721 - accuracy: 0.5064 - f1-score: 0.6723
Other                   tp: 0     - fp: 0     - fn: 5     - tn: 121108 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000
Phone_fax               tp: 38    - fp: 2     - fn: 0     - tn: 121073 - precision: 0.9500 - recall: 1.0000 - accuracy: 0.9500 - f1-score: 0.9744
Profession              tp: 40    - fp: 3     - fn: 34    - tn: 121036 - precision: 0.9302 - recall: 0.5405 - accuracy: 0.5195 - f1-score: 0.6837
URL_IP                  tp: 3     - fp: 0     - fn: 1     - tn: 121109 - precision: 1.0000 - recall: 0.7500 - accuracy: 0.7500 - f1-score: 0.8571

token (blind)           tp: 5016  - fp: 187   - fn: 379   - tn: 115532 - precision: 0.9641 - recall: 0.9297 - accuracy: 0.8986 - f1-score: 0.9466
ENT                     tp: 5016  - fp: 187   - fn: 379   - tn: 115532 - precision: 0.9641 - recall: 0.9297 - accuracy: 0.8986 - f1-score: 0.9466

You can use the evaluate_corpus.py script to evaluate all runs for a given corpus. The script produces a CSV file with the evaluation measures for each corpus part (i.e., train/dev/test) that you can use this for further analysis.

> python deidentify/evaluation/evaluate_corpus.py <corpus_name> <language>
[...]
> tree output/evaluation/<corpus_name>
output/evaluation/<corpus_name>
├── summary_dev.csv
├── summary_test.csv
└── summary_train.csv