--- a
+++ b/docs/05_hsdm2020_english_datasets.md
@@ -0,0 +1,33 @@
+# Benchmark on English Datasets
+
+In [our paper](TODO), we also test the performance of the CRF and BiLSTM-CRF on two English
+datasets: [nursing notes](https://physionet.org/content/deidentifiedmedicaltext/1.0/) and [i2b2/UTHealth 2014](https://www.i2b2.org/NLP/DataSets/).
+Both datasets can be obtained after signing a data use agreement with the corresponding research institutes. Below, we show how to convert those datasets to the [standoff format](01_data_format.md) used throughout this project. The datasets are placed in `data/corpus/nursing` and `data/corpus/i2b2`. Afterwards, the datasets can be used to [train and evaluate models](02_train_evaluate_models.md) on them.
+
+## Nursing Notes
+
+Download raw notes and PHI annotations:
+   - `id.text` from https://physionet.org/content/deidentifiedmedicaltext/1.0/
+   - `id-phi.phrase` from https://physionet.org/content/deid/1.1/
+
+Assuming the `id.text` and `id-phi.phrase` files are located in `data/raw/nursing-notes/`, the nursing notes Corpus can be generated as follows:
+
+```sh
+# Convert nursing notes corpus to brat format
+NN_DATA=data/raw/nursing-notes
+python deidentify/dataset/nursing2brat.py \
+    $NN_DATA/id.text \
+    $NN_DATA/id-phi.phrase \
+    $NN_DATA/brat/
+
+# Split nursing notes into 60/20/20 train/dev/test set
+python deidentify/dataset/brat2corpus.py nursing $NN_DATA/brat
+```
+
+## i2b2/UTHealth
+
+The script assumes that `training-PHI-Gold-Set1`, `training-PHI-Gold-Set2`, and `testing-PHI-Gold-fixed` are located in `data/raw/i2b2/`. The corpus can then be generated as follows:
+
+```sh
+python deidentify/dataset/uthealth2corpus.py
+```