Diff of /README.md [000000] .. [944daf]

Switch to unified view

a b/README.md
1
# Medknow
2
This project is a submodule designed to be used in the creation of a disease-specific knowledge base, but could be also used as a standalone module in other projects. It focuses on extracting **biomedical entities** and **relations** between them from free nlp text and structuring them in a way that makes extracting new knowledge and inferring hidden relations easier, utilizing a **graph** database.
3
4
This project has been designed with modularity in mind, allowing the implementation and fast integration of new extractors, such as [ReVerb](http://reverb.cs.washington.edu/) for relation extraction and [MetaMap](https://metamap.nlm.nih.gov/) for concept extractions. Those two are currently being developed alongside the already implemented extractor based on [SemRep](https://semrep.nlm.nih.gov/).
5
6
Currently, the main features of this project are(some under work):
7
* Different kind of input sources: free text, already extracted relations, concepts etc.
8
* A variety of knowledge extractors working in a pipeline: **SemRep**, **MetaMap**, **Reverb**
9
* Multiple persistency options: saving enriched documents to file, entities and relations to .csv, utilizing **Neo4j**. Also, using **Mongodb** for sentence fetching and instead of saving .json to files.
10
11
## Getting Started
12
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
13
14
### Knowledge Extraction
15
This project is based around concept and relation extractions from text [SemRep](https://semrep.nlm.nih.gov/). Follow, instructions on their website in order to set-up a copy on your local machine.
16
**Note**: You will have to also install **MetaMap** for SemRep to work.
17
18
### Neo4j
19
If you'd like to persist results in Neo4j you will have to pre-install it on your local machine. More details available on their [website](https://neo4j.com/).
20
21
### MongoDB
22
If you'd like to keep track of the sentences as found in each document/article, in order to retrieve them later you will have to pre-install MongoDB on your local machine. More details available on their [website](https://www.mongodb.com/).
23
24
### YAJL
25
In order deal with big .json files in as streams, this system uses the [ijson](https://pypi.python.org/pypi/ijson/) module and more specifically the python bindings for the [yajl](http://lloyd.github.io/yajl/) JSON parser. So you should install this also.
26
27
### Python Modules
28
This is pretty straightforward, the needed modules are located in *requirements.txt*. You can, either install them individually or better yet use [pip](https://pip.pypa.io/en/stable/) to install them in a bundle, by executing:
29
```
30
pip install -r requirements.txt
31
```
32
after cloning/downloading the project folder.
33
(Maybe, asked for admin/sudo rights)
34
35
### Usage
36
The functionalities offered by this module are wrapped in a **Pipeline**, broken down into three phases. These are:
37
1. *Input*: What type of input do we expect to deal with? Where to read it from? Specific fields in .csv or .json files that we must take into account etc.
38
2. *Transformations*: What type of transformations to do the input provided? Enrich the input document with concepts and relations using SemRep? MetaMap + Reverb? Transform an edge-list between already existing entities into the correct structure for populating the Neo4j db? 
39
3. *Output*: What to do with the results? Save enriched file in .json? Output .csv files for use by the neo4j import-tool? Directly create/update the Neo4j db? Also, save sentences processed in MongoDB for later usage?
40
41
All of these choices are parameterized in **settings.yaml**, following the .yaml structure. The outline of the available parameters is the following:
42
- **pipeline**: Mainly *True/False* values for specific keys regarding the previously presented phases, denoting what functions to be completed.
43
- **load**: Variables regarding the paths of SemRep, input files, as well as, key fields in .json and .csv files where text and other information is stored
44
- **apis**: API keys used for specific services.
45
- **neo4j**: Details regarding the connection to an existing Neo4j instance
46
- **mongo_sentences**: Details regarding the connection to a mongodb collection(If it does not exist in will be created)
47
- **cache_path**: Path to .json file which is used as a long-term cache when fetching mappings of entities to CUIs (e.g. DRUGBANK-ID -> UMLS_CUI)
48
- **Output**: Variables and paths regarding the generated results.
49
50
Details on each variable are found in the settings.yaml. An overview of the available keys-values is presented here:
51
**pipeline**:
52
 - *inp*:  What kind of input to we expect. Will specify what part of the 'load' section to read from. Currently supporting the following values:
53
    - **json**: Used for json from the harvester and enriched jsoni generated from this module.
54
    -  **edges**: A field containing edges-relations is expected to be found in the file. Used for DOID,DRUGBANK,MESH etc. relations.
55
    -  **med_rec**: Would be used for medical records but the main functionality is that it deals with delimited-files.
56
    -  **mongo**: Used to read collection of documents from mongo instead of a json file.
57
    -  **delete**: In case we want to delete edges from a specific resource. The unwanted edges are denoted from the **resource** value in the *neo4j* field. 
58
- *trans*: What kind of transformations-extractions to do:
59
    - **metamap**: True/False. If we want to extract entities using metamap. TODO: ! MERGE Entities and Treshold ! 
60
    - **reverb**: True/False. If we want to extract relations using reverb. TODO: ! Map Entities to UMLS CONCEPTS IN SENTENCE!
61
    - **semrep**: True/False. The main functionality. If we want to use SEMREP to extract relations and entities from text. !! It is meaningful only for json and med_rec, as edges are not supposed to have text field. !!
62
    - **get_concepts_from_edges**: True/False. ! This is for edges file only ! If we want some kind of transformation to be done in the entities found as subjects-objects in the edges file (e.g. fectch concepts from cuis, from DRUGBANK unique ids etc.)
63
- *out*: Where to write the output
64
    - **json**: True/False. Save the intermediate json generated after all the transformations/extraction are done, before updating the database.
65
    - **csv**: True/False. Create the corresponding node and edge files, to be used by the command-line neo4j import-tool. Not very useful for the time being.
66
    - **neo4j**: True/False. Create/Update the neo4j graph with the entities and relations found in the json generated from the trans steps or the **pre-enriched** json of 'json' or 'edges' input given at the start.
67
    - **mongo_sentences**: True/False. If you want to save index the processed sentences in a mongo.
68
    - **mongo**: True/False. If you want to save the enriched json file in mongo.
69
70
**load**:
71
  - *path*:
72
    - **metamap**: Path to metamap binary.
73
    - **reverb**: Path to reverb binary.
74
    - **semrep**: Path to semrep binary.
75
  - *med_rec*: If the value in pipeline 'inp' is not **med_rec** the following values are irrelevant for the task at hand.
76
    - **inp_path**: Path to delimited file.
77
    - **textfield**: Name of the column where the text is located (e.g. MedicalDiagnosis).
78
    - **sep**: Delimiter value (e.g. \t).
79
    - **idfield**: Name of the column where the ids are found (e.g. patient_id).
80
  - *json*: If the value in pipeline 'inp' is not **json** the following values are irrelevant for the task at hand.
81
    - **inp_path**: Path to json file.
82
    - **docfield**: Outer field of the json file where the documents/articles are located (e.g. documents).
83
    - **textfield**: Name of the field to read text from (e.g. abstractText).
84
    - **idfield**: Name of the column where the ids are found (e.g. pmid.
85
    - **labelfield**: Field where the label of the document is situated (e.g. title).
86
- *edges*: If the value in pipeline 'inp' is not **edges** the following values are irrelevant for the task at hand.
87
    - **inp_path**: Path to edges file.
88
    - **edge_field**: Name of the outer field where the relations-edges are found (e.g. relations).
89
    - **sub_type**:Type of the subject in the relations. Currently supporting Entity, Article and any new type of nodes.
90
    - **obj_type**:Type of the pbject in the relations. Currently supporting Entity, Article and any new type of nodes.
91
    - **sub_source**: What type of source is needed to transform the subject entity. Currently supporting: UMLS when the entities are cuis, [MSH, DRUGBANK, .. and the rest from the umls rest mapping used accordingly], [Article, Text and None when no transformation is needed on the subject entity given].
92
    - **obj_source**: What type of source is needed to transform the object entity. Currently supporting: UMLS when the entities are cuis, [MSH, DRUGBANK, .. and the rest from the umls rest mapping used accordingly], [Article, Text and None when no transformation is needed on the object entity given].
93
- *mongo*:  If the value in pipeline 'inp' is not **mongo** the following values are irrelevant for the task at hand.
94
    - **mongodb**:DB Full uri for reading the json file. If user/pass required pass it here like *https://user:pass@host:port*
95
    - **db**: The name of the database.
96
    - **collection**: The name of the collection
97
    - **docfield**: Outer field of the json file where the documents/articles are located (e.g. documents) when loaded and passed to the pipeline
98
    - **inp_path**: For printing purposes only. Something to understand the collection from which we read the data
99
100
**apis**: API Keys for when calling different services
101
  - **biont**: Bioportal api for fetching uri info of a concept. Not currently in use.
102
  - **umls**: UMLS REST api key. Useful only when the 'inp' in pipeline is **edges** and **get_concepts_from_edges** is True.
103
104
**neo4j**: Variables for connection to an existing and running neo4j graph. If **neo4j** is False in the pipeline the following don't matter.
105
  - **host**: Database url (e.g localhost).
106
  - **port**: Port number (e.g. 7474).
107
  - **user**: Username (e.g. neo4j).
108
  - **password**: Password (e.g. admin).
109
  - **resource**: The resource from which the edges have been generated. This is also used in accordance to the *delete* value in **input**, when we want to delete these kind of edges for provenance reasons.
110
 
111
**mongo_sentences**: Variables for connection to an existing and running mongodb. If **mongo** is False in the pipeline the following don't matter. Also, if the wanted db-collection is not existing, it will be created.
112
  - **uri**: Full uri needed to connect to the db (e.g. mongodb://user:pass@localhost:27017).
113
  - **db**: Name of the database. 
114
  - **collection**: Name of the collection.
115
116
**out**: Which of the following sections will be used is related to whether the corresponding key in the pipeline 'out' field has a True value. If not, they don't matter.
117
- *json*:
118
    - **out_path**: path where the generated json will be saved.
119
    - **json_doc_field**: Name of the outer field containing the enriched-transformed articles-relations (e.g. mostly documents or  relations till now, according to whether we have 'json'(articles) or 'edges'(relations) to process). Better use the same as in the 'edges' or 'json' outerfield accordingly.
120
    - **json_text_field**: For 'articles' or input that has text, the name of the field to save the text to (e.g. text).
121
    - **json_id_field:** For 'articles' or collection of documents, the name of the field to save their id (e.g. id).
122
    - **json_label_field**: For 'articles' or collection of documents, the name of the field to save their label (e.g. title).
123
    - **sent_prefix**: For 'articles' or input that has text, the prefix to be used in the sentence-id generation procedure (e.g. abstract/fullbody).
124
- *csv*:
125
    - **out_path**: path where the nodes and edges .csvs will be saved. 
126
- neo4j:
127
    - **out_path**: This is just for printing purposes that the save will be perfomed in 'out_path'. Change the variables in the **neo4j** section if you want to configure access to neo4j, not this! (e.g. localhost:7474)
128
- mongo:
129
    - **mongodb**:DB Full uri for writing the enrichedjson file. If user/pass required pass it here like *https://user:pass@host:port*
130
    - **db**: The name of the database.
131
    - **collection**: The name of the collection
132
    - **out_path**: For printing purposes only. Something to understand the collection in which we write the output.
133
134
135
#### !!!! CONFIGURE SETTINGS.YAML BEFORE RUNNING THE SCRIPT !!!!
136
Finally, after configuration to match your needs simply run:
137
138
```python
139
python test.py
140
```
141
## Tests
142
143
Currently no tests supported.
144
145
## Questions/Errors
146
147
Bougiatiotis Konstantinos, NCSR ‘DEMOKRITOS’ E-mail: bogas.ko@gmail.com