|
a |
|
b/README.md |
|
|
1 |
# Medknow |
|
|
2 |
This project is a submodule designed to be used in the creation of a disease-specific knowledge base, but could be also used as a standalone module in other projects. It focuses on extracting **biomedical entities** and **relations** between them from free nlp text and structuring them in a way that makes extracting new knowledge and inferring hidden relations easier, utilizing a **graph** database. |
|
|
3 |
|
|
|
4 |
This project has been designed with modularity in mind, allowing the implementation and fast integration of new extractors, such as [ReVerb](http://reverb.cs.washington.edu/) for relation extraction and [MetaMap](https://metamap.nlm.nih.gov/) for concept extractions. Those two are currently being developed alongside the already implemented extractor based on [SemRep](https://semrep.nlm.nih.gov/). |
|
|
5 |
|
|
|
6 |
Currently, the main features of this project are(some under work): |
|
|
7 |
* Different kind of input sources: free text, already extracted relations, concepts etc. |
|
|
8 |
* A variety of knowledge extractors working in a pipeline: **SemRep**, **MetaMap**, **Reverb** |
|
|
9 |
* Multiple persistency options: saving enriched documents to file, entities and relations to .csv, utilizing **Neo4j**. Also, using **Mongodb** for sentence fetching and instead of saving .json to files. |
|
|
10 |
|
|
|
11 |
## Getting Started |
|
|
12 |
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. |
|
|
13 |
|
|
|
14 |
### Knowledge Extraction |
|
|
15 |
This project is based around concept and relation extractions from text [SemRep](https://semrep.nlm.nih.gov/). Follow, instructions on their website in order to set-up a copy on your local machine. |
|
|
16 |
**Note**: You will have to also install **MetaMap** for SemRep to work. |
|
|
17 |
|
|
|
18 |
### Neo4j |
|
|
19 |
If you'd like to persist results in Neo4j you will have to pre-install it on your local machine. More details available on their [website](https://neo4j.com/). |
|
|
20 |
|
|
|
21 |
### MongoDB |
|
|
22 |
If you'd like to keep track of the sentences as found in each document/article, in order to retrieve them later you will have to pre-install MongoDB on your local machine. More details available on their [website](https://www.mongodb.com/). |
|
|
23 |
|
|
|
24 |
### YAJL |
|
|
25 |
In order deal with big .json files in as streams, this system uses the [ijson](https://pypi.python.org/pypi/ijson/) module and more specifically the python bindings for the [yajl](http://lloyd.github.io/yajl/) JSON parser. So you should install this also. |
|
|
26 |
|
|
|
27 |
### Python Modules |
|
|
28 |
This is pretty straightforward, the needed modules are located in *requirements.txt*. You can, either install them individually or better yet use [pip](https://pip.pypa.io/en/stable/) to install them in a bundle, by executing: |
|
|
29 |
``` |
|
|
30 |
pip install -r requirements.txt |
|
|
31 |
``` |
|
|
32 |
after cloning/downloading the project folder. |
|
|
33 |
(Maybe, asked for admin/sudo rights) |
|
|
34 |
|
|
|
35 |
### Usage |
|
|
36 |
The functionalities offered by this module are wrapped in a **Pipeline**, broken down into three phases. These are: |
|
|
37 |
1. *Input*: What type of input do we expect to deal with? Where to read it from? Specific fields in .csv or .json files that we must take into account etc. |
|
|
38 |
2. *Transformations*: What type of transformations to do the input provided? Enrich the input document with concepts and relations using SemRep? MetaMap + Reverb? Transform an edge-list between already existing entities into the correct structure for populating the Neo4j db? |
|
|
39 |
3. *Output*: What to do with the results? Save enriched file in .json? Output .csv files for use by the neo4j import-tool? Directly create/update the Neo4j db? Also, save sentences processed in MongoDB for later usage? |
|
|
40 |
|
|
|
41 |
All of these choices are parameterized in **settings.yaml**, following the .yaml structure. The outline of the available parameters is the following: |
|
|
42 |
- **pipeline**: Mainly *True/False* values for specific keys regarding the previously presented phases, denoting what functions to be completed. |
|
|
43 |
- **load**: Variables regarding the paths of SemRep, input files, as well as, key fields in .json and .csv files where text and other information is stored |
|
|
44 |
- **apis**: API keys used for specific services. |
|
|
45 |
- **neo4j**: Details regarding the connection to an existing Neo4j instance |
|
|
46 |
- **mongo_sentences**: Details regarding the connection to a mongodb collection(If it does not exist in will be created) |
|
|
47 |
- **cache_path**: Path to .json file which is used as a long-term cache when fetching mappings of entities to CUIs (e.g. DRUGBANK-ID -> UMLS_CUI) |
|
|
48 |
- **Output**: Variables and paths regarding the generated results. |
|
|
49 |
|
|
|
50 |
Details on each variable are found in the settings.yaml. An overview of the available keys-values is presented here: |
|
|
51 |
**pipeline**: |
|
|
52 |
- *inp*: What kind of input to we expect. Will specify what part of the 'load' section to read from. Currently supporting the following values: |
|
|
53 |
- **json**: Used for json from the harvester and enriched jsoni generated from this module. |
|
|
54 |
- **edges**: A field containing edges-relations is expected to be found in the file. Used for DOID,DRUGBANK,MESH etc. relations. |
|
|
55 |
- **med_rec**: Would be used for medical records but the main functionality is that it deals with delimited-files. |
|
|
56 |
- **mongo**: Used to read collection of documents from mongo instead of a json file. |
|
|
57 |
- **delete**: In case we want to delete edges from a specific resource. The unwanted edges are denoted from the **resource** value in the *neo4j* field. |
|
|
58 |
- *trans*: What kind of transformations-extractions to do: |
|
|
59 |
- **metamap**: True/False. If we want to extract entities using metamap. TODO: ! MERGE Entities and Treshold ! |
|
|
60 |
- **reverb**: True/False. If we want to extract relations using reverb. TODO: ! Map Entities to UMLS CONCEPTS IN SENTENCE! |
|
|
61 |
- **semrep**: True/False. The main functionality. If we want to use SEMREP to extract relations and entities from text. !! It is meaningful only for json and med_rec, as edges are not supposed to have text field. !! |
|
|
62 |
- **get_concepts_from_edges**: True/False. ! This is for edges file only ! If we want some kind of transformation to be done in the entities found as subjects-objects in the edges file (e.g. fectch concepts from cuis, from DRUGBANK unique ids etc.) |
|
|
63 |
- *out*: Where to write the output |
|
|
64 |
- **json**: True/False. Save the intermediate json generated after all the transformations/extraction are done, before updating the database. |
|
|
65 |
- **csv**: True/False. Create the corresponding node and edge files, to be used by the command-line neo4j import-tool. Not very useful for the time being. |
|
|
66 |
- **neo4j**: True/False. Create/Update the neo4j graph with the entities and relations found in the json generated from the trans steps or the **pre-enriched** json of 'json' or 'edges' input given at the start. |
|
|
67 |
- **mongo_sentences**: True/False. If you want to save index the processed sentences in a mongo. |
|
|
68 |
- **mongo**: True/False. If you want to save the enriched json file in mongo. |
|
|
69 |
|
|
|
70 |
**load**: |
|
|
71 |
- *path*: |
|
|
72 |
- **metamap**: Path to metamap binary. |
|
|
73 |
- **reverb**: Path to reverb binary. |
|
|
74 |
- **semrep**: Path to semrep binary. |
|
|
75 |
- *med_rec*: If the value in pipeline 'inp' is not **med_rec** the following values are irrelevant for the task at hand. |
|
|
76 |
- **inp_path**: Path to delimited file. |
|
|
77 |
- **textfield**: Name of the column where the text is located (e.g. MedicalDiagnosis). |
|
|
78 |
- **sep**: Delimiter value (e.g. \t). |
|
|
79 |
- **idfield**: Name of the column where the ids are found (e.g. patient_id). |
|
|
80 |
- *json*: If the value in pipeline 'inp' is not **json** the following values are irrelevant for the task at hand. |
|
|
81 |
- **inp_path**: Path to json file. |
|
|
82 |
- **docfield**: Outer field of the json file where the documents/articles are located (e.g. documents). |
|
|
83 |
- **textfield**: Name of the field to read text from (e.g. abstractText). |
|
|
84 |
- **idfield**: Name of the column where the ids are found (e.g. pmid. |
|
|
85 |
- **labelfield**: Field where the label of the document is situated (e.g. title). |
|
|
86 |
- *edges*: If the value in pipeline 'inp' is not **edges** the following values are irrelevant for the task at hand. |
|
|
87 |
- **inp_path**: Path to edges file. |
|
|
88 |
- **edge_field**: Name of the outer field where the relations-edges are found (e.g. relations). |
|
|
89 |
- **sub_type**:Type of the subject in the relations. Currently supporting Entity, Article and any new type of nodes. |
|
|
90 |
- **obj_type**:Type of the pbject in the relations. Currently supporting Entity, Article and any new type of nodes. |
|
|
91 |
- **sub_source**: What type of source is needed to transform the subject entity. Currently supporting: UMLS when the entities are cuis, [MSH, DRUGBANK, .. and the rest from the umls rest mapping used accordingly], [Article, Text and None when no transformation is needed on the subject entity given]. |
|
|
92 |
- **obj_source**: What type of source is needed to transform the object entity. Currently supporting: UMLS when the entities are cuis, [MSH, DRUGBANK, .. and the rest from the umls rest mapping used accordingly], [Article, Text and None when no transformation is needed on the object entity given]. |
|
|
93 |
- *mongo*: If the value in pipeline 'inp' is not **mongo** the following values are irrelevant for the task at hand. |
|
|
94 |
- **mongodb**:DB Full uri for reading the json file. If user/pass required pass it here like *https://user:pass@host:port* |
|
|
95 |
- **db**: The name of the database. |
|
|
96 |
- **collection**: The name of the collection |
|
|
97 |
- **docfield**: Outer field of the json file where the documents/articles are located (e.g. documents) when loaded and passed to the pipeline |
|
|
98 |
- **inp_path**: For printing purposes only. Something to understand the collection from which we read the data |
|
|
99 |
|
|
|
100 |
**apis**: API Keys for when calling different services |
|
|
101 |
- **biont**: Bioportal api for fetching uri info of a concept. Not currently in use. |
|
|
102 |
- **umls**: UMLS REST api key. Useful only when the 'inp' in pipeline is **edges** and **get_concepts_from_edges** is True. |
|
|
103 |
|
|
|
104 |
**neo4j**: Variables for connection to an existing and running neo4j graph. If **neo4j** is False in the pipeline the following don't matter. |
|
|
105 |
- **host**: Database url (e.g localhost). |
|
|
106 |
- **port**: Port number (e.g. 7474). |
|
|
107 |
- **user**: Username (e.g. neo4j). |
|
|
108 |
- **password**: Password (e.g. admin). |
|
|
109 |
- **resource**: The resource from which the edges have been generated. This is also used in accordance to the *delete* value in **input**, when we want to delete these kind of edges for provenance reasons. |
|
|
110 |
|
|
|
111 |
**mongo_sentences**: Variables for connection to an existing and running mongodb. If **mongo** is False in the pipeline the following don't matter. Also, if the wanted db-collection is not existing, it will be created. |
|
|
112 |
- **uri**: Full uri needed to connect to the db (e.g. mongodb://user:pass@localhost:27017). |
|
|
113 |
- **db**: Name of the database. |
|
|
114 |
- **collection**: Name of the collection. |
|
|
115 |
|
|
|
116 |
**out**: Which of the following sections will be used is related to whether the corresponding key in the pipeline 'out' field has a True value. If not, they don't matter. |
|
|
117 |
- *json*: |
|
|
118 |
- **out_path**: path where the generated json will be saved. |
|
|
119 |
- **json_doc_field**: Name of the outer field containing the enriched-transformed articles-relations (e.g. mostly documents or relations till now, according to whether we have 'json'(articles) or 'edges'(relations) to process). Better use the same as in the 'edges' or 'json' outerfield accordingly. |
|
|
120 |
- **json_text_field**: For 'articles' or input that has text, the name of the field to save the text to (e.g. text). |
|
|
121 |
- **json_id_field:** For 'articles' or collection of documents, the name of the field to save their id (e.g. id). |
|
|
122 |
- **json_label_field**: For 'articles' or collection of documents, the name of the field to save their label (e.g. title). |
|
|
123 |
- **sent_prefix**: For 'articles' or input that has text, the prefix to be used in the sentence-id generation procedure (e.g. abstract/fullbody). |
|
|
124 |
- *csv*: |
|
|
125 |
- **out_path**: path where the nodes and edges .csvs will be saved. |
|
|
126 |
- neo4j: |
|
|
127 |
- **out_path**: This is just for printing purposes that the save will be perfomed in 'out_path'. Change the variables in the **neo4j** section if you want to configure access to neo4j, not this! (e.g. localhost:7474) |
|
|
128 |
- mongo: |
|
|
129 |
- **mongodb**:DB Full uri for writing the enrichedjson file. If user/pass required pass it here like *https://user:pass@host:port* |
|
|
130 |
- **db**: The name of the database. |
|
|
131 |
- **collection**: The name of the collection |
|
|
132 |
- **out_path**: For printing purposes only. Something to understand the collection in which we write the output. |
|
|
133 |
|
|
|
134 |
|
|
|
135 |
#### !!!! CONFIGURE SETTINGS.YAML BEFORE RUNNING THE SCRIPT !!!! |
|
|
136 |
Finally, after configuration to match your needs simply run: |
|
|
137 |
|
|
|
138 |
```python |
|
|
139 |
python test.py |
|
|
140 |
``` |
|
|
141 |
## Tests |
|
|
142 |
|
|
|
143 |
Currently no tests supported. |
|
|
144 |
|
|
|
145 |
## Questions/Errors |
|
|
146 |
|
|
|
147 |
Bougiatiotis Konstantinos, NCSR ‘DEMOKRITOS’ E-mail: bogas.ko@gmail.com |