Diff of /README.md [000000] .. [944daf]

Switch to side-by-side view

--- a
+++ b/README.md
@@ -0,0 +1,147 @@
+# Medknow
+This project is a submodule designed to be used in the creation of a disease-specific knowledge base, but could be also used as a standalone module in other projects. It focuses on extracting **biomedical entities** and **relations** between them from free nlp text and structuring them in a way that makes extracting new knowledge and inferring hidden relations easier, utilizing a **graph** database.
+
+This project has been designed with modularity in mind, allowing the implementation and fast integration of new extractors, such as [ReVerb](http://reverb.cs.washington.edu/) for relation extraction and [MetaMap](https://metamap.nlm.nih.gov/) for concept extractions. Those two are currently being developed alongside the already implemented extractor based on [SemRep](https://semrep.nlm.nih.gov/).
+
+Currently, the main features of this project are(some under work):
+* Different kind of input sources: free text, already extracted relations, concepts etc.
+* A variety of knowledge extractors working in a pipeline: **SemRep**, **MetaMap**, **Reverb**
+* Multiple persistency options: saving enriched documents to file, entities and relations to .csv, utilizing **Neo4j**. Also, using **Mongodb** for sentence fetching and instead of saving .json to files.
+
+## Getting Started
+These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
+
+### Knowledge Extraction
+This project is based around concept and relation extractions from text [SemRep](https://semrep.nlm.nih.gov/). Follow, instructions on their website in order to set-up a copy on your local machine.
+**Note**: You will have to also install **MetaMap** for SemRep to work.
+
+### Neo4j
+If you'd like to persist results in Neo4j you will have to pre-install it on your local machine. More details available on their [website](https://neo4j.com/).
+
+### MongoDB
+If you'd like to keep track of the sentences as found in each document/article, in order to retrieve them later you will have to pre-install MongoDB on your local machine. More details available on their [website](https://www.mongodb.com/).
+
+### YAJL
+In order deal with big .json files in as streams, this system uses the [ijson](https://pypi.python.org/pypi/ijson/) module and more specifically the python bindings for the [yajl](http://lloyd.github.io/yajl/) JSON parser. So you should install this also.
+
+### Python Modules
+This is pretty straightforward, the needed modules are located in *requirements.txt*. You can, either install them individually or better yet use [pip](https://pip.pypa.io/en/stable/) to install them in a bundle, by executing:
+```
+pip install -r requirements.txt
+```
+after cloning/downloading the project folder.
+(Maybe, asked for admin/sudo rights)
+
+### Usage
+The functionalities offered by this module are wrapped in a **Pipeline**, broken down into three phases. These are:
+1. *Input*: What type of input do we expect to deal with? Where to read it from? Specific fields in .csv or .json files that we must take into account etc.
+2. *Transformations*: What type of transformations to do the input provided? Enrich the input document with concepts and relations using SemRep? MetaMap + Reverb? Transform an edge-list between already existing entities into the correct structure for populating the Neo4j db? 
+3. *Output*: What to do with the results? Save enriched file in .json? Output .csv files for use by the neo4j import-tool? Directly create/update the Neo4j db? Also, save sentences processed in MongoDB for later usage?
+
+All of these choices are parameterized in **settings.yaml**, following the .yaml structure. The outline of the available parameters is the following:
+- **pipeline**: Mainly *True/False* values for specific keys regarding the previously presented phases, denoting what functions to be completed.
+- **load**: Variables regarding the paths of SemRep, input files, as well as, key fields in .json and .csv files where text and other information is stored
+- **apis**: API keys used for specific services.
+- **neo4j**: Details regarding the connection to an existing Neo4j instance
+- **mongo_sentences**: Details regarding the connection to a mongodb collection(If it does not exist in will be created)
+- **cache_path**: Path to .json file which is used as a long-term cache when fetching mappings of entities to CUIs (e.g. DRUGBANK-ID -> UMLS_CUI)
+- **Output**: Variables and paths regarding the generated results.
+
+Details on each variable are found in the settings.yaml. An overview of the available keys-values is presented here:
+**pipeline**:
+ - *inp*:  What kind of input to we expect. Will specify what part of the 'load' section to read from. Currently supporting the following values:
+    - **json**: Used for json from the harvester and enriched jsoni generated from this module.
+    -  **edges**: A field containing edges-relations is expected to be found in the file. Used for DOID,DRUGBANK,MESH etc. relations.
+    -  **med_rec**: Would be used for medical records but the main functionality is that it deals with delimited-files.
+    -  **mongo**: Used to read collection of documents from mongo instead of a json file.
+    -  **delete**: In case we want to delete edges from a specific resource. The unwanted edges are denoted from the **resource** value in the *neo4j* field. 
+- *trans*: What kind of transformations-extractions to do:
+    - **metamap**: True/False. If we want to extract entities using metamap. TODO: ! MERGE Entities and Treshold ! 
+    - **reverb**: True/False. If we want to extract relations using reverb. TODO: ! Map Entities to UMLS CONCEPTS IN SENTENCE!
+    - **semrep**: True/False. The main functionality. If we want to use SEMREP to extract relations and entities from text. !! It is meaningful only for json and med_rec, as edges are not supposed to have text field. !!
+    - **get_concepts_from_edges**: True/False. ! This is for edges file only ! If we want some kind of transformation to be done in the entities found as subjects-objects in the edges file (e.g. fectch concepts from cuis, from DRUGBANK unique ids etc.)
+- *out*: Where to write the output
+    - **json**: True/False. Save the intermediate json generated after all the transformations/extraction are done, before updating the database.
+    - **csv**: True/False. Create the corresponding node and edge files, to be used by the command-line neo4j import-tool. Not very useful for the time being.
+    - **neo4j**: True/False. Create/Update the neo4j graph with the entities and relations found in the json generated from the trans steps or the **pre-enriched** json of 'json' or 'edges' input given at the start.
+    - **mongo_sentences**: True/False. If you want to save index the processed sentences in a mongo.
+    - **mongo**: True/False. If you want to save the enriched json file in mongo.
+
+**load**:
+  - *path*:
+    - **metamap**: Path to metamap binary.
+    - **reverb**: Path to reverb binary.
+    - **semrep**: Path to semrep binary.
+  - *med_rec*: If the value in pipeline 'inp' is not **med_rec** the following values are irrelevant for the task at hand.
+    - **inp_path**: Path to delimited file.
+    - **textfield**: Name of the column where the text is located (e.g. MedicalDiagnosis).
+    - **sep**: Delimiter value (e.g. \t).
+    - **idfield**: Name of the column where the ids are found (e.g. patient_id).
+  - *json*: If the value in pipeline 'inp' is not **json** the following values are irrelevant for the task at hand.
+    - **inp_path**: Path to json file.
+    - **docfield**: Outer field of the json file where the documents/articles are located (e.g. documents).
+    - **textfield**: Name of the field to read text from (e.g. abstractText).
+    - **idfield**: Name of the column where the ids are found (e.g. pmid.
+    - **labelfield**: Field where the label of the document is situated (e.g. title).
+- *edges*: If the value in pipeline 'inp' is not **edges** the following values are irrelevant for the task at hand.
+    - **inp_path**: Path to edges file.
+    - **edge_field**: Name of the outer field where the relations-edges are found (e.g. relations).
+    - **sub_type**:Type of the subject in the relations. Currently supporting Entity, Article and any new type of nodes.
+    - **obj_type**:Type of the pbject in the relations. Currently supporting Entity, Article and any new type of nodes.
+    - **sub_source**: What type of source is needed to transform the subject entity. Currently supporting: UMLS when the entities are cuis, [MSH, DRUGBANK, .. and the rest from the umls rest mapping used accordingly], [Article, Text and None when no transformation is needed on the subject entity given].
+    - **obj_source**: What type of source is needed to transform the object entity. Currently supporting: UMLS when the entities are cuis, [MSH, DRUGBANK, .. and the rest from the umls rest mapping used accordingly], [Article, Text and None when no transformation is needed on the object entity given].
+- *mongo*:  If the value in pipeline 'inp' is not **mongo** the following values are irrelevant for the task at hand.
+    - **mongodb**:DB Full uri for reading the json file. If user/pass required pass it here like *https://user:pass@host:port*
+    - **db**: The name of the database.
+    - **collection**: The name of the collection
+    - **docfield**: Outer field of the json file where the documents/articles are located (e.g. documents) when loaded and passed to the pipeline
+    - **inp_path**: For printing purposes only. Something to understand the collection from which we read the data
+
+**apis**: API Keys for when calling different services
+  - **biont**: Bioportal api for fetching uri info of a concept. Not currently in use.
+  - **umls**: UMLS REST api key. Useful only when the 'inp' in pipeline is **edges** and **get_concepts_from_edges** is True.
+
+**neo4j**: Variables for connection to an existing and running neo4j graph. If **neo4j** is False in the pipeline the following don't matter.
+  - **host**: Database url (e.g localhost).
+  - **port**: Port number (e.g. 7474).
+  - **user**: Username (e.g. neo4j).
+  - **password**: Password (e.g. admin).
+  - **resource**: The resource from which the edges have been generated. This is also used in accordance to the *delete* value in **input**, when we want to delete these kind of edges for provenance reasons.
+ 
+**mongo_sentences**: Variables for connection to an existing and running mongodb. If **mongo** is False in the pipeline the following don't matter. Also, if the wanted db-collection is not existing, it will be created.
+  - **uri**: Full uri needed to connect to the db (e.g. mongodb://user:pass@localhost:27017).
+  - **db**: Name of the database. 
+  - **collection**: Name of the collection.
+
+**out**: Which of the following sections will be used is related to whether the corresponding key in the pipeline 'out' field has a True value. If not, they don't matter.
+- *json*:
+    - **out_path**: path where the generated json will be saved.
+    - **json_doc_field**: Name of the outer field containing the enriched-transformed articles-relations (e.g. mostly documents or  relations till now, according to whether we have 'json'(articles) or 'edges'(relations) to process). Better use the same as in the 'edges' or 'json' outerfield accordingly.
+    - **json_text_field**: For 'articles' or input that has text, the name of the field to save the text to (e.g. text).
+    - **json_id_field:** For 'articles' or collection of documents, the name of the field to save their id (e.g. id).
+    - **json_label_field**: For 'articles' or collection of documents, the name of the field to save their label (e.g. title).
+    - **sent_prefix**: For 'articles' or input that has text, the prefix to be used in the sentence-id generation procedure (e.g. abstract/fullbody).
+- *csv*:
+    - **out_path**: path where the nodes and edges .csvs will be saved. 
+- neo4j:
+    - **out_path**: This is just for printing purposes that the save will be perfomed in 'out_path'. Change the variables in the **neo4j** section if you want to configure access to neo4j, not this! (e.g. localhost:7474)
+- mongo:
+    - **mongodb**:DB Full uri for writing the enrichedjson file. If user/pass required pass it here like *https://user:pass@host:port*
+    - **db**: The name of the database.
+    - **collection**: The name of the collection
+    - **out_path**: For printing purposes only. Something to understand the collection in which we write the output.
+
+
+#### !!!! CONFIGURE SETTINGS.YAML BEFORE RUNNING THE SCRIPT !!!!
+Finally, after configuration to match your needs simply run:
+
+```python
+python test.py
+```
+## Tests
+
+Currently no tests supported.
+
+## Questions/Errors
+
+Bougiatiotis Konstantinos, NCSR ‘DEMOKRITOS’ E-mail: bogas.ko@gmail.com