Diff of /README.md [000000] .. [62e09a]

Switch to unified view

a b/README.md
1
# gsec (Generalized Sequencing Classifier)
2
### USC Center for Artificial Intelligence in Society (CAIS++) and Smith Computational Genomics Lab Collaboration.
3
### *Under Construction*
4
                                                                                
5
# Requirements:
6
1. Entrez utilities (esearch, etc.)
7
2. SRA toolkit (fastq-dump in command line should not throw "not found")
8
3. Python 3
9
4. After pulling the repo, enter the `gsec/utils` folder in the install location
10
and type `make` to compile the kmer counting function.
11
5. Clone the repo, enter the directory, and run `pip install .`
12
13
# Highly recommended:
14
Disable SRA download caching: Follow the instructions at 
15
https://standage.github.io/that-darn-cache-configuring-the-sra-toolkit.html OR 
16
run `vdb-config --interactive`; go to the cache tab and uncheck "enable local
17
file-caching" if it isn't already. 
18
19
# Streaming and piping functions:
20
usage: compile with `g++ stream_kmers.cpp -o stream_kmers`.
21
22
On its own, stream_kmers takes two parameters: the first parameter is the
23
k value and the second parameter is the maximum number of reads to count. 
24
Next functionality to implement is an auto-stopping based on convergence 
25
of kmer-count frequency.
26
27
In a pipeline, usage looks like this: `fastq-dump --skip-technical \
28
--split-spot -Z SRR5149059 | ./stream_kmers 6 100 > out.txt` to count k=6 
29
and with a limit of 100 reads. The specific SRR provided is downloaded but 
30
never saved and only the counts file is saved. If you already have an `.sra` 
31
file you can run it on that as well. Streaming gives a speed boost in both 
32
scenarios.
33
34
# Usage
35
`gsec train --pos-strat bisulfite-seq --pos-org 'homo sapiens' --neg-strat \
36
wgs --neg-org 'homo sapiens' -k 6 -l 10000 -n 100` downloads and builds a 
37
classifier to distinguish the positive set from the negative set based on 
38
k-mer counts of up to k = 6. The limit flag `-l` means that only the first 
39
10k reads from each fastq are processed, and `-n` specifies the number files.
40
Keep in mind that some of the attempted downloads fail.
41
The failed download SRR identification numbers are recorded in the errors.txt 
42
file, which is generated in your current directory. In addition, a 
43
`model_summary.txt` is generated in your working directory. The counts files
44
and models are stored in a specified location in the gsec install locaiton.
45
The final model is downloaded as a .pkl file which can be loaded back into 
46
python. 
47
48
A unique ID is assigned to each run of `gsec train`, starting with `1`. The
49
data is saved under the `1/positive` and `1/negative` folders respectively, 
50
and the model is saved as 1.pkl.
51
52
53
`python gsec-train.py --pos-strat --pos-org --neg-strat --neg-org -k -l -n`
54
- pos-strat: strategy for positive set
55
- pos-org: organism for positive set
56
- neg-strat: strategy for negative set
57
- neg-org: organism for negative set
58
- k: maximum size of kmer to count
59
- l: limit number of reads to use
60
- n: number of srrs to count for each target (if there are less files that 
61
match the query, the maximum amount of files matched will be downloaded)
62
63
# Project Structure
64
```bash
65
.
66
├── errors.txt
67
├── gsec
68
│   ├── gsec_apply.py
69
│   ├── gsec.py
70
│   ├── gsec_train.py
71
│   ├── __init__.py
72
│   ├── model_building
73
│   │   ├── create_model.py
74
│   │   ├── create_model_utils.py
75
│   │   ├── data
76
│   │   │   └── 1
77
│   │   │       ├── negative
78
│   │   │       │   ├── ERR3523441.txt
79
│   │   │       │   ├── ERR3523442.txt
80
│   │   │       │   ├── ERR3523446.txt
81
│   │   │       │   ├── SRR10000063.txt
82
│   │   │       │   ├── SRR10000103.txt
83
│   │   │       │   ├── SRR10000110.txt
84
│   │   │       │   ├── ...
85
│   │   │       └── positive
86
│   │   │           ├── ERR3445822.txt
87
│   │   │           ├── ERR3674488.txt
88
│   │   │           ├── ERR3674489.txt
89
│   │   │           ├── ERR3674493.txt
90
│   │   │           ├── SRR11348073.txt
91
│   │   │           ├── ...
92
│   │   │           ├── SRR11494766.txt
93
│   │   │           └── SRR8836050.txt
94
│   │   ├── __init__.py
95
│   │   ├── ModelRunner.py
96
│   │   └── __pycache__
97
│   │       ├── create_model.cpython-37.pyc
98
│   │       ├── create_model_utils.cpython-37.pyc
99
│   │       ├── __init__.cpython-37.pyc
100
│   │       └── ModelRunner.cpython-37.pyc
101
│   ├── models
102
│   │   ├── 1.pkl
103
│   ├── model_test.py  <<<--- is for debugging
104
│   ├── __pycache__
105
│   │   ├── gsec_apply.cpython-37.pyc
106
│   │   ├── gsec.cpython-37.pyc
107
│   │   ├── gsec_train.cpython-37.pyc
108
│   │   └── __init__.cpython-37.pyc
109
│   └── utils
110
│       ├── countkmers.cpp
111
│       ├── csv_utils.py
112
│       ├── __init__.py
113
│       ├── Makefile
114
│       ├── __pycache__
115
│       │   ├── csv_utils.cpython-37.pyc
116
│       │   └── __init__.cpython-37.pyc
117
│       ├── stream_kmers
118
│       └── stream_kmers.cpp
119
├── gsec.egg-info
120
│   ├── dependency_links.txt
121
│   ├── entry_points.txt
122
│   ├── PKG-INFO
123
│   ├── requires.txt
124
│   ├── SOURCES.txt
125
│   └── top_level.txt
126
├── MANIFEST.in
127
├── models.csv
128
├── model_summary.txt
129
├── README.md
130
└── setup.py
131
```
132