gsec / Git / Diff of /README.md

Models:
MarcoTheBlack/
gsec
Downloads: 1
Diff of /README.md [000000] .. [62e09a]
Switch to side-by-side view

--- a
+++ b/README.md
@@ -0,0 +1,132 @@
+# gsec (Generalized Sequencing Classifier)
+### USC Center for Artificial Intelligence in Society (CAIS++) and Smith Computational Genomics Lab Collaboration.
+### *Under Construction*
+                                                                                
+# Requirements:
+1. Entrez utilities (esearch, etc.)
+2. SRA toolkit (fastq-dump in command line should not throw "not found")
+3. Python 3
+4. After pulling the repo, enter the `gsec/utils` folder in the install location
+and type `make` to compile the kmer counting function.
+5. Clone the repo, enter the directory, and run `pip install .`
+
+# Highly recommended:
+Disable SRA download caching: Follow the instructions at 
+https://standage.github.io/that-darn-cache-configuring-the-sra-toolkit.html OR 
+run `vdb-config --interactive`; go to the cache tab and uncheck "enable local
+file-caching" if it isn't already. 
+
+# Streaming and piping functions:
+usage: compile with `g++ stream_kmers.cpp -o stream_kmers`.
+
+On its own, stream_kmers takes two parameters: the first parameter is the
+k value and the second parameter is the maximum number of reads to count. 
+Next functionality to implement is an auto-stopping based on convergence 
+of kmer-count frequency.
+
+In a pipeline, usage looks like this: `fastq-dump --skip-technical \
+--split-spot -Z SRR5149059 | ./stream_kmers 6 100 > out.txt` to count k=6 
+and with a limit of 100 reads. The specific SRR provided is downloaded but 
+never saved and only the counts file is saved. If you already have an `.sra` 
+file you can run it on that as well. Streaming gives a speed boost in both 
+scenarios.
+
+# Usage
+`gsec train --pos-strat bisulfite-seq --pos-org 'homo sapiens' --neg-strat \
+wgs --neg-org 'homo sapiens' -k 6 -l 10000 -n 100` downloads and builds a 
+classifier to distinguish the positive set from the negative set based on 
+k-mer counts of up to k = 6. The limit flag `-l` means that only the first 
+10k reads from each fastq are processed, and `-n` specifies the number files.
+Keep in mind that some of the attempted downloads fail.
+The failed download SRR identification numbers are recorded in the errors.txt 
+file, which is generated in your current directory. In addition, a 
+`model_summary.txt` is generated in your working directory. The counts files
+and models are stored in a specified location in the gsec install locaiton.
+The final model is downloaded as a .pkl file which can be loaded back into 
+python. 
+
+A unique ID is assigned to each run of `gsec train`, starting with `1`. The
+data is saved under the `1/positive` and `1/negative` folders respectively, 
+and the model is saved as 1.pkl.
+
+
+`python gsec-train.py --pos-strat --pos-org --neg-strat --neg-org -k -l -n`
+- pos-strat: strategy for positive set
+- pos-org: organism for positive set
+- neg-strat: strategy for negative set
+- neg-org: organism for negative set
+- k: maximum size of kmer to count
+- l: limit number of reads to use
+- n: number of srrs to count for each target (if there are less files that 
+match the query, the maximum amount of files matched will be downloaded)
+
+# Project Structure
+```bash
+.
+├── errors.txt
+├── gsec
+│   ├── gsec_apply.py
+│   ├── gsec.py
+│   ├── gsec_train.py
+│   ├── __init__.py
+│   ├── model_building
+│   │   ├── create_model.py
+│   │   ├── create_model_utils.py
+│   │   ├── data
+│   │   │   └── 1
+│   │   │       ├── negative
+│   │   │       │   ├── ERR3523441.txt
+│   │   │       │   ├── ERR3523442.txt
+│   │   │       │   ├── ERR3523446.txt
+│   │   │       │   ├── SRR10000063.txt
+│   │   │       │   ├── SRR10000103.txt
+│   │   │       │   ├── SRR10000110.txt
+│   │   │       │   ├── ...
+│   │   │       └── positive
+│   │   │           ├── ERR3445822.txt
+│   │   │           ├── ERR3674488.txt
+│   │   │           ├── ERR3674489.txt
+│   │   │           ├── ERR3674493.txt
+│   │   │           ├── SRR11348073.txt
+│   │   │           ├── ...
+│   │   │           ├── SRR11494766.txt
+│   │   │           └── SRR8836050.txt
+│   │   ├── __init__.py
+│   │   ├── ModelRunner.py
+│   │   └── __pycache__
+│   │       ├── create_model.cpython-37.pyc
+│   │       ├── create_model_utils.cpython-37.pyc
+│   │       ├── __init__.cpython-37.pyc
+│   │       └── ModelRunner.cpython-37.pyc
+│   ├── models
+│   │   ├── 1.pkl
+│   ├── model_test.py  <<<--- is for debugging
+│   ├── __pycache__
+│   │   ├── gsec_apply.cpython-37.pyc
+│   │   ├── gsec.cpython-37.pyc
+│   │   ├── gsec_train.cpython-37.pyc
+│   │   └── __init__.cpython-37.pyc
+│   └── utils
+│       ├── countkmers.cpp
+│       ├── csv_utils.py
+│       ├── __init__.py
+│       ├── Makefile
+│       ├── __pycache__
+│       │   ├── csv_utils.cpython-37.pyc
+│       │   └── __init__.cpython-37.pyc
+│       ├── stream_kmers
+│       └── stream_kmers.cpp
+├── gsec.egg-info
+│   ├── dependency_links.txt
+│   ├── entry_points.txt
+│   ├── PKG-INFO
+│   ├── requires.txt
+│   ├── SOURCES.txt
+│   └── top_level.txt
+├── MANIFEST.in
+├── models.csv
+├── model_summary.txt
+├── README.md
+└── setup.py
+```
+