--- a +++ b/README.md @@ -0,0 +1,101 @@ +# MIMIC-SPARQL +This repository provides the official mimic-sparql dataset implementation of the following paper: [Knowledge Graph-based Question Answering with Electronic Health Records](https://arxiv.org/abs/2010.09394) accepted at Machine Learning in Health Care (MLHC) 2021. +## Example +``` +NLQ: how many patients were born before the year 2060? + +SQL: select count ( distinct patients."subject_id" ) from patients where patients."dob_year" < "2060" + +SPARQL: select ( count ( distinct ?subject_id ) as ?agg ) where { ?subject_id </dob_year> ?dob_year. filter( ?dob_year < 2060 ). +``` + +## Prerequisites + +1. MIMIC-III +https://mimic.physionet.org/ +2. MIMICSQL +Paper title: Text-to-SQL Generation for Question Answering on Electronic Medical Records. +Dataset and codes: https://github.com/wangpinggl/TREQS +3. ENV +``` +python 3.6 +networkx +rdflib +pandas +numpy +sqlite3 # sqlite is a built-in library from python 2.5. so there is no need to install manually. +requests +``` + Set up ENV using pip +```bash +pip install networkx rdflib pandas numpy requests +``` + +## Datasets + +1. __MIMICSQL*__ +MIMICSQL* is extended version of MIMICSQL. The database consists of 9 table of MIMIC-III. +2. __MIMIC-SPARQL__ +MIMIC-SPARQL is a graph-based counterpart of MIMICSQL*. The knowledge graph of this dataset has 173,096 triples and the max hop is 5. + +## Guide for creating the MIMICSQL* and MIMIC-SPARQL +0. Prepare MIMIC-III and make mimic.db from MIMICSQL +1. Build mimicsql* database from mimicsql database +2. Build mimic-sparql knowlege graph from mimicsql* database +3. Convert mimicsql SQL query to mimicsql* SQL query +4. Convert mimicsql* SQL query to mimic-sparql SPARQL query + + +### 0. Prepare MIMIC-III and mimic.db from MIMICSQL +First, you need to access the MIMIC-III data. This requires certification from https://mimic.physionet.org/ +And then, `mimic.db` is necessary to go to the next step following the https://github.com/wangpinggl/TREQS README.md + +### 1. Build mimicsql* database from mimicsql database +First, you need to save mimic.db under `mimicsql/evaluation/mimic_db` path. +And then, set the current directory in the project root folder, mimic-sparql. +``` +python build_mimicsqlstar_db/build_mimicstar_db_from_mimicsql_db.py +``` +This is to build MIMICSQL* DB and `mimicsqlstar.db` is made. +### 2. Build mimic-sparql knowlege graph from mimicsql* database +Set the current directory in the project root folder, mimic-sparql. +For building mimic-sparql* from mimicsql*, +``` +python build_mimicsparql_kg/build_complex_kg_from_mimicsqlstar_db.py +``` +For building mimic-sparql from mimicsql, +``` +python build_mimicsparql_kg/build_simple_kg_from_mimicsql_db.py +``` +This is to build MIMIC-SPARQL KG and `mimic_sparqlstar_kg.xml` and `mimic_sparql_kg.xml` are made. +### 3. Convert mimicsql SQL query to mimicsql* SQL query +Set the current directory in the project root folder, mimic-sparql. +``` +python convert_mimicsql2sql_dataset.py --dataset_type natural --execution False +python convert_mimicsql2sql_dataset.py --dataset_type template --execution False +``` +If set execution as True, the execution results of both queries are compared with each other. +### 4. Convert mimicsql* SQL query to mimic-sparql SPARQL query +Set the current directory as the project root folder, mimic-sparql. +``` +python convert_sql2sparql_dataset.py --dataset_type natural --complex True --execution False +python convert_sql2sparql_dataset.py --dataset_type natural --complex False --execution False + +python convert_sql2sparql_dataset.py --dataset_type template --complex True --execution False +python convert_sql2sparql_dataset.py --dataset_type template --complex False --execution False +``` +Complex option is for selecting simplied schema (mimic-sparql from mimicsql) or original schema (mimic-sparql* from mimicsql*) + +## Citation +``` +@inproceedings{pmlr-v149-park21a, + title = {Knowledge Graph-based Question Answering with Electronic Health Records}, + author = {Park, Junwoo and Cho, Youngwoo and Lee, Haneol and Choo, Jaegul and Choi, Edward}, + booktitle = {Proceedings of the 6th Machine Learning for Healthcare Conference (MLHC)}, + pages = {36--53}, + year = {2021}, + volume = {149}, + publisher = {PMLR} +} +``` +This bibtex will be changed after being published on PMLR