Switch to unified view

a b/docs/04_hsdm2020_surrogate_generation.md
1
# Surrogate Generation
2
3
Before using the our dataset for the development of de-identification methods, we turned it into a
4
dummy dataset by replacing protected health information (PHI) with artificial, but realistic
5
replacements - a process called surrogate generation. We did a (quite ad-hoc) implementation of the
6
surrogate generation method described by Stubbs et al. (2015). Below, we explain how apply this
7
method to a new dataset.
8
9
## Apply Surrogate Generation Method
10
11
### Setup
12
13
The surrogate generation scripts use the python `locale` package to support internationalization.
14
Currently, the script assumes that the `en_US.UTF-8`, `nl_NL.UTF-8` and `de_DE.UTF-8` locales are
15
installed.
16
17
```sh
18
# Verify that locales are installed
19
locale -a
20
21
# If not, generate the missing locales using the `locales` (e.g., apt-get install locales) package:
22
sudo dpkg-reconfigure locales
23
```
24
25
### Step 1: Generate surrogates
26
27
First, we will generate the surrogates. The command below assumes that your dataset is located in `data/gold-annotations` and is in [standoff format](01_data_format.md).
28
29
```sh
30
python deidentify/surrogates/generate_surrogates.py \
31
    data/gold-annotations/ \
32
    data/surrogate-mapping/gold-surrogate-mapping.csv
33
```
34
35
This will output a `.csv` file in following form:
36
37
```csv
38
doc_id,ann_id,text,start,end,tag,surrogate,manual_surrogate,checked
39
example-1,T1,"van Janssen, Jan",23,39,Name,"Linders, Xandro",,False
40
example-1,T2,j.van.jansen@nedap.nl,41,62,Email,t.njg.nmmeso@rcrmb.nl,,False
41
```
42
43
### Step 2: Revise automatic replacements
44
45
Import the `.csv` file in your favorite spreadsheet editor and fix any automatic replacement errors by adding an entry in the `manual_surrogate` column of the respective row. At the least, the surrogates for the `OTHER` category must be manually added. Afterwards, export the table again to `.csv`.
46
47
### Step 3: Rewrite documents/annotation files
48
49
Use the following script to replace PHI in the original `*.txt/*.ann` files with the surrogates from the mapping table.
50
51
```sh
52
python deidentify/surrogates/rewrite_dataset.py \
53
    data/surrogate-mapping/gold-surrogate-mapping-revised.csv \
54
    data/gold-annotations/ \
55
    data/surrogate-annotations/
56
```
57
58
59
## References
60
61
* Amber Stubbs, Özlem Uzuner, Christopher Kotfila, Ira Goldstein, and Peter Szolovits. 2015. Challenges in Synthesizing Surrogate PHI in Narrative EMRs. *In Medical Data Privacy Handbook*, Aris Gkoulalas-Divanis and Grigorios Loukides (Eds.). Springer International Publishing, 717–735. DOI: https://doi.org/10.1007/978-3-319-23633-9_27