|
a |
|
b/docs/04_hsdm2020_surrogate_generation.md |
|
|
1 |
# Surrogate Generation |
|
|
2 |
|
|
|
3 |
Before using the our dataset for the development of de-identification methods, we turned it into a |
|
|
4 |
dummy dataset by replacing protected health information (PHI) with artificial, but realistic |
|
|
5 |
replacements - a process called surrogate generation. We did a (quite ad-hoc) implementation of the |
|
|
6 |
surrogate generation method described by Stubbs et al. (2015). Below, we explain how apply this |
|
|
7 |
method to a new dataset. |
|
|
8 |
|
|
|
9 |
## Apply Surrogate Generation Method |
|
|
10 |
|
|
|
11 |
### Setup |
|
|
12 |
|
|
|
13 |
The surrogate generation scripts use the python `locale` package to support internationalization. |
|
|
14 |
Currently, the script assumes that the `en_US.UTF-8`, `nl_NL.UTF-8` and `de_DE.UTF-8` locales are |
|
|
15 |
installed. |
|
|
16 |
|
|
|
17 |
```sh |
|
|
18 |
# Verify that locales are installed |
|
|
19 |
locale -a |
|
|
20 |
|
|
|
21 |
# If not, generate the missing locales using the `locales` (e.g., apt-get install locales) package: |
|
|
22 |
sudo dpkg-reconfigure locales |
|
|
23 |
``` |
|
|
24 |
|
|
|
25 |
### Step 1: Generate surrogates |
|
|
26 |
|
|
|
27 |
First, we will generate the surrogates. The command below assumes that your dataset is located in `data/gold-annotations` and is in [standoff format](01_data_format.md). |
|
|
28 |
|
|
|
29 |
```sh |
|
|
30 |
python deidentify/surrogates/generate_surrogates.py \ |
|
|
31 |
data/gold-annotations/ \ |
|
|
32 |
data/surrogate-mapping/gold-surrogate-mapping.csv |
|
|
33 |
``` |
|
|
34 |
|
|
|
35 |
This will output a `.csv` file in following form: |
|
|
36 |
|
|
|
37 |
```csv |
|
|
38 |
doc_id,ann_id,text,start,end,tag,surrogate,manual_surrogate,checked |
|
|
39 |
example-1,T1,"van Janssen, Jan",23,39,Name,"Linders, Xandro",,False |
|
|
40 |
example-1,T2,j.van.jansen@nedap.nl,41,62,Email,t.njg.nmmeso@rcrmb.nl,,False |
|
|
41 |
``` |
|
|
42 |
|
|
|
43 |
### Step 2: Revise automatic replacements |
|
|
44 |
|
|
|
45 |
Import the `.csv` file in your favorite spreadsheet editor and fix any automatic replacement errors by adding an entry in the `manual_surrogate` column of the respective row. At the least, the surrogates for the `OTHER` category must be manually added. Afterwards, export the table again to `.csv`. |
|
|
46 |
|
|
|
47 |
### Step 3: Rewrite documents/annotation files |
|
|
48 |
|
|
|
49 |
Use the following script to replace PHI in the original `*.txt/*.ann` files with the surrogates from the mapping table. |
|
|
50 |
|
|
|
51 |
```sh |
|
|
52 |
python deidentify/surrogates/rewrite_dataset.py \ |
|
|
53 |
data/surrogate-mapping/gold-surrogate-mapping-revised.csv \ |
|
|
54 |
data/gold-annotations/ \ |
|
|
55 |
data/surrogate-annotations/ |
|
|
56 |
``` |
|
|
57 |
|
|
|
58 |
|
|
|
59 |
## References |
|
|
60 |
|
|
|
61 |
* Amber Stubbs, Özlem Uzuner, Christopher Kotfila, Ira Goldstein, and Peter Szolovits. 2015. Challenges in Synthesizing Surrogate PHI in Narrative EMRs. *In Medical Data Privacy Handbook*, Aris Gkoulalas-Divanis and Grigorios Loukides (Eds.). Springer International Publishing, 717–735. DOI: https://doi.org/10.1007/978-3-319-23633-9_27 |