Unsupervised Extractive Summarization

Implements a summarizers package for text summarization and evaluation. Package includes Lexrank and folder2rouge. Also implements scripts for producing batch summaries using the Lexrank package on pubmed and noteevents data, and the bert extractive summarizers pypi package on pubmed data.

Setup

After following EHRKit installation instructions, install dependencies for this subpackage.

cd EHRKit/Sanya/
pip install -r requirements.txt

For folder2rouge, first setup files2rouge by following instructions here https://github.com/pltrdy/files2rouge

Scripts

generate_summaries.py generates Lexrank summaries using this summarizers package for a directory of source documents.
- --train: Takes in the directory path containing training documents, which are used for calculating idf scores.
- --saveto: Takes in the directory path for saving the generated summaries (saved with the extension name.sum)
- --test (optional): Takes in the directory path containing documents for which to produce summaries. The default is the documents in the train directory.
- --ntrain (optional): First n number of documents to train on. Default is all documents.
- --ntest (optional): First n number of documents to produce summaries of. Default is all documents.
- --threshold (optional): Sets threshold for Lexrank algorithm. Default is 0.03.
- --size (optional): Sets size (number of sentences) of summaries produced. Default is 1.

> python generate_summaries.py --train /data/lily/jmg277/nc_text/source/ --saveto /data/lily/sn482/summaries

rouge_scores.py generates rouge score for the summaries produced using this summarizers package.
- --summaries: Takes in the directory path containing summaries in the format name.sum.
- --references: Takes in the directory path for the references of the summaries produced in the format name.tgt. Each tgt file must have same number of newlines as corresponding .sum file.
- --saveto (optional): Takes in the file path for saving the ROUGE score results.

> python rouge_scores.py --summaries /data/lily/sn482/summaries --references /data/lily/sn482/reference_abstracts/

/scripts/bert_summaries.py generates bert extractive summaries for directory containing source documents using the pypi package:
- --source: Takes in the directory path containing source documents for which to produce summaries.
- --saveto: Takes in the directory path for saving the generated summaries (saved with the extension name.sum)
- --n (optional): First n number of documents to produce summaries of. Default is all documents.
- --ratio (optional): Sets ratio of number of sentences in summaries wrt the source document. Default is 0.05.

> python bert_pubmed.py --source /data/lily/jmg277/nc_text_body/source --saveto /data/lily/sn482/pubmed_summaries/bert_script_summaries/

/scripts/noteevents_lexrank.py generates Lexrank summaries by note and by entire patient history for notes in MIMIC-III database using this summarizers package.
- --saveto: Takes in the directory path for saving the generated summaries. Saved with the extension name.sum under script_summary_byetirehistory and script_summary_bynote.
- --ntrain (optional): First n number of documents to train on. Default is all documents.
- --ntest (optional): First n number of documents to produce summaries of. Default is all documents.

> python noteevents_lexrank.py --saveto /data/lily/sn482/NOTEEVENTS_summaries

Summarizers Package

summarizers
|_ lexrank.py
|_ /evaluate
    |_ folder2rouge.py

Lexrank

Initializing

documents: A list of documents where each document has been parsed into a list of sentences.

stopwords: Set of commonly used words that the algorithm ignores. The default is the english set of words from nltk corpus.

threshold: Default is 0.03.

Getting the summary

sentences: Document parsed into list of sentences

summary_size: Number of sentences to include in the summary. Default is 1.

Usage

from summarizers import Lexrank

documents = [['This is the first document sentence 1', 'This is the first document  sentence 2'],
['This is the second document sentence 1', 'This is the second document  sentence 2']]

sentences = ['sentence 1', 'sentence 2']

lxr = Lexrank(documents, stopwords)
summary = lxr.get_summary(sentences, summary_size=10, threshold=.1)

ROUGE Scores

NOTE: The number of newlines in a summary file must match the number of newlines in the target file.

summaries_dir: Directory path which contains the generated summaries with the format DOC_NAME.sum

reference_dir: Directory path which contains the reference summaries with the format DOC_NAME.tgt

saveto: (optional) File path to save the ROUGE score

Usage

from summarizers.evaluate import folder2rouge

rouge = folder2rouge(summaries_dir, reference_dir)
rouge.run(saveto=saveto_path)

References

Lexrank largely derived from https://github.com/crabcamp/lexrank

folder2rouge imports files2rouge from https://github.com/pltrdy/files2rouge

bert script uses https://github.com/dmmiller612/bert-extractive-summarizer