a b/extractiveSummarization/README.md
1
# Unsupervised Extractive Summarization 
2
3
Implements a summarizers package for text summarization and evaluation. Package includes Lexrank and folder2rouge. Also implements scripts for producing batch summaries using the Lexrank package on pubmed and noteevents data, and the bert extractive summarizers pypi package on pubmed data.
4
5
## Setup
6
7
After following EHRKit installation instructions, install dependencies for this subpackage.
8
9
```
10
cd EHRKit/Sanya/
11
pip install -r requirements.txt
12
``` 
13
14
For folder2rouge, first setup files2rouge by following instructions here https://github.com/pltrdy/files2rouge
15
16
## Scripts
17
18
**generate_summaries.py** generates Lexrank summaries using this summarizers package for a directory of source documents.
19
- --train: Takes in the directory path containing training documents, which are used for calculating idf scores.
20
- --saveto: Takes in the directory path for saving the generated summaries (saved with the extension *name*.sum)
21
- --test (optional): Takes in the directory path containing documents for which to produce summaries. The default is the documents in the train directory.
22
- --ntrain (optional): First n number of documents to train on. Default is all documents.
23
- --ntest (optional): First n number of documents to produce summaries of. Default is all documents.
24
- --threshold (optional): Sets threshold for Lexrank algorithm. Default is 0.03.
25
- --size (optional): Sets size (number of sentences) of summaries produced. Default is 1.
26
27
```
28
> python generate_summaries.py --train /data/lily/jmg277/nc_text/source/ --saveto /data/lily/sn482/summaries
29
``` 
30
31
32
**rouge_scores.py** generates rouge score for the summaries produced using this summarizers package.
33
- --summaries: Takes in the directory path containing summaries in the format name.sum.
34
- --references: Takes in the directory path for the references of the summaries produced in the format name.tgt. Each tgt file must have same number of newlines as corresponding .sum file.
35
- --saveto (optional): Takes in the file path for saving the ROUGE score results.
36
37
```
38
> python rouge_scores.py --summaries /data/lily/sn482/summaries --references /data/lily/sn482/reference_abstracts/
39
``` 
40
41
42
**/scripts/bert_summaries.py** generates bert extractive summaries for directory containing source documents using the pypi package:
43
- --source: Takes in the directory path containing source documents for which to produce summaries.
44
- --saveto: Takes in the directory path for saving the generated summaries (saved with the extension *name*.sum)
45
- --n (optional): First n number of documents to produce summaries of. Default is all documents.
46
- --ratio (optional): Sets ratio of number of sentences in summaries wrt the source document. Default is 0.05.
47
48
```
49
> python bert_pubmed.py --source /data/lily/jmg277/nc_text_body/source --saveto /data/lily/sn482/pubmed_summaries/bert_script_summaries/
50
``` 
51
52
53
**/scripts/noteevents_lexrank.py** generates Lexrank summaries by note and by entire patient history for notes in MIMIC-III database using this summarizers package. 
54
- --saveto: Takes in the directory path for saving the generated summaries. Saved with the extension *name*.sum under script_summary_byetirehistory and script_summary_bynote. 
55
- --ntrain (optional): First n number of documents to train on. Default is all documents.
56
- --ntest (optional): First n number of documents to produce summaries of. Default is all documents.
57
```
58
> python noteevents_lexrank.py --saveto /data/lily/sn482/NOTEEVENTS_summaries
59
``` 
60
61
## Summarizers Package 
62
63
```
64
summarizers
65
|_ lexrank.py
66
|_ /evaluate
67
    |_ folder2rouge.py
68
69
``` 
70
71
### Lexrank 
72
73
***Initializing***
74
75
**documents:** A list of documents where each document has been parsed into a list of sentences.
76
77
**stopwords:** Set of commonly used words that the algorithm ignores. The default is the english set of words from nltk corpus.
78
79
**threshold:** Default is 0.03.
80
81
***Getting the summary***
82
83
**sentences:** Document parsed into list of sentences
84
85
**summary_size:** Number of sentences to include in the summary. Default is 1.
86
87
#### Usage
88
89
90
```
91
from summarizers import Lexrank
92
93
documents = [['This is the first document sentence 1', 'This is the first document  sentence 2'],
94
['This is the second document sentence 1', 'This is the second document  sentence 2']]
95
96
sentences = ['sentence 1', 'sentence 2']
97
98
lxr = Lexrank(documents, stopwords)
99
summary = lxr.get_summary(sentences, summary_size=10, threshold=.1)
100
```
101
102
### ROUGE Scores 
103
104
NOTE: The number of newlines in a summary file must match the number of newlines in the target file.
105
106
**summaries_dir:** Directory path which contains the generated summaries with the format DOC_NAME.sum
107
108
**reference_dir:** Directory path which contains the reference summaries with the format DOC_NAME.tgt
109
110
**saveto:** (optional) File path to save the ROUGE score 
111
112
#### Usage
113
```
114
from summarizers.evaluate import folder2rouge
115
116
rouge = folder2rouge(summaries_dir, reference_dir)
117
rouge.run(saveto=saveto_path)
118
```
119
120
### References
121
122
Lexrank largely derived from https://github.com/crabcamp/lexrank
123
124
folder2rouge imports files2rouge from https://github.com/pltrdy/files2rouge
125
126
bert script uses https://github.com/dmmiller612/bert-extractive-summarizer