|
a |
|
b/README.md |
|
|
1 |
# gsec (Generalized Sequencing Classifier) |
|
|
2 |
### USC Center for Artificial Intelligence in Society (CAIS++) and Smith Computational Genomics Lab Collaboration. |
|
|
3 |
### *Under Construction* |
|
|
4 |
|
|
|
5 |
# Requirements: |
|
|
6 |
1. Entrez utilities (esearch, etc.) |
|
|
7 |
2. SRA toolkit (fastq-dump in command line should not throw "not found") |
|
|
8 |
3. Python 3 |
|
|
9 |
4. After pulling the repo, enter the `gsec/utils` folder in the install location |
|
|
10 |
and type `make` to compile the kmer counting function. |
|
|
11 |
5. Clone the repo, enter the directory, and run `pip install .` |
|
|
12 |
|
|
|
13 |
# Highly recommended: |
|
|
14 |
Disable SRA download caching: Follow the instructions at |
|
|
15 |
https://standage.github.io/that-darn-cache-configuring-the-sra-toolkit.html OR |
|
|
16 |
run `vdb-config --interactive`; go to the cache tab and uncheck "enable local |
|
|
17 |
file-caching" if it isn't already. |
|
|
18 |
|
|
|
19 |
# Streaming and piping functions: |
|
|
20 |
usage: compile with `g++ stream_kmers.cpp -o stream_kmers`. |
|
|
21 |
|
|
|
22 |
On its own, stream_kmers takes two parameters: the first parameter is the |
|
|
23 |
k value and the second parameter is the maximum number of reads to count. |
|
|
24 |
Next functionality to implement is an auto-stopping based on convergence |
|
|
25 |
of kmer-count frequency. |
|
|
26 |
|
|
|
27 |
In a pipeline, usage looks like this: `fastq-dump --skip-technical \ |
|
|
28 |
--split-spot -Z SRR5149059 | ./stream_kmers 6 100 > out.txt` to count k=6 |
|
|
29 |
and with a limit of 100 reads. The specific SRR provided is downloaded but |
|
|
30 |
never saved and only the counts file is saved. If you already have an `.sra` |
|
|
31 |
file you can run it on that as well. Streaming gives a speed boost in both |
|
|
32 |
scenarios. |
|
|
33 |
|
|
|
34 |
# Usage |
|
|
35 |
`gsec train --pos-strat bisulfite-seq --pos-org 'homo sapiens' --neg-strat \ |
|
|
36 |
wgs --neg-org 'homo sapiens' -k 6 -l 10000 -n 100` downloads and builds a |
|
|
37 |
classifier to distinguish the positive set from the negative set based on |
|
|
38 |
k-mer counts of up to k = 6. The limit flag `-l` means that only the first |
|
|
39 |
10k reads from each fastq are processed, and `-n` specifies the number files. |
|
|
40 |
Keep in mind that some of the attempted downloads fail. |
|
|
41 |
The failed download SRR identification numbers are recorded in the errors.txt |
|
|
42 |
file, which is generated in your current directory. In addition, a |
|
|
43 |
`model_summary.txt` is generated in your working directory. The counts files |
|
|
44 |
and models are stored in a specified location in the gsec install locaiton. |
|
|
45 |
The final model is downloaded as a .pkl file which can be loaded back into |
|
|
46 |
python. |
|
|
47 |
|
|
|
48 |
A unique ID is assigned to each run of `gsec train`, starting with `1`. The |
|
|
49 |
data is saved under the `1/positive` and `1/negative` folders respectively, |
|
|
50 |
and the model is saved as 1.pkl. |
|
|
51 |
|
|
|
52 |
|
|
|
53 |
`python gsec-train.py --pos-strat --pos-org --neg-strat --neg-org -k -l -n` |
|
|
54 |
- pos-strat: strategy for positive set |
|
|
55 |
- pos-org: organism for positive set |
|
|
56 |
- neg-strat: strategy for negative set |
|
|
57 |
- neg-org: organism for negative set |
|
|
58 |
- k: maximum size of kmer to count |
|
|
59 |
- l: limit number of reads to use |
|
|
60 |
- n: number of srrs to count for each target (if there are less files that |
|
|
61 |
match the query, the maximum amount of files matched will be downloaded) |
|
|
62 |
|
|
|
63 |
# Project Structure |
|
|
64 |
```bash |
|
|
65 |
. |
|
|
66 |
├── errors.txt |
|
|
67 |
├── gsec |
|
|
68 |
│ ├── gsec_apply.py |
|
|
69 |
│ ├── gsec.py |
|
|
70 |
│ ├── gsec_train.py |
|
|
71 |
│ ├── __init__.py |
|
|
72 |
│ ├── model_building |
|
|
73 |
│ │ ├── create_model.py |
|
|
74 |
│ │ ├── create_model_utils.py |
|
|
75 |
│ │ ├── data |
|
|
76 |
│ │ │ └── 1 |
|
|
77 |
│ │ │ ├── negative |
|
|
78 |
│ │ │ │ ├── ERR3523441.txt |
|
|
79 |
│ │ │ │ ├── ERR3523442.txt |
|
|
80 |
│ │ │ │ ├── ERR3523446.txt |
|
|
81 |
│ │ │ │ ├── SRR10000063.txt |
|
|
82 |
│ │ │ │ ├── SRR10000103.txt |
|
|
83 |
│ │ │ │ ├── SRR10000110.txt |
|
|
84 |
│ │ │ │ ├── ... |
|
|
85 |
│ │ │ └── positive |
|
|
86 |
│ │ │ ├── ERR3445822.txt |
|
|
87 |
│ │ │ ├── ERR3674488.txt |
|
|
88 |
│ │ │ ├── ERR3674489.txt |
|
|
89 |
│ │ │ ├── ERR3674493.txt |
|
|
90 |
│ │ │ ├── SRR11348073.txt |
|
|
91 |
│ │ │ ├── ... |
|
|
92 |
│ │ │ ├── SRR11494766.txt |
|
|
93 |
│ │ │ └── SRR8836050.txt |
|
|
94 |
│ │ ├── __init__.py |
|
|
95 |
│ │ ├── ModelRunner.py |
|
|
96 |
│ │ └── __pycache__ |
|
|
97 |
│ │ ├── create_model.cpython-37.pyc |
|
|
98 |
│ │ ├── create_model_utils.cpython-37.pyc |
|
|
99 |
│ │ ├── __init__.cpython-37.pyc |
|
|
100 |
│ │ └── ModelRunner.cpython-37.pyc |
|
|
101 |
│ ├── models |
|
|
102 |
│ │ ├── 1.pkl |
|
|
103 |
│ ├── model_test.py <<<--- is for debugging |
|
|
104 |
│ ├── __pycache__ |
|
|
105 |
│ │ ├── gsec_apply.cpython-37.pyc |
|
|
106 |
│ │ ├── gsec.cpython-37.pyc |
|
|
107 |
│ │ ├── gsec_train.cpython-37.pyc |
|
|
108 |
│ │ └── __init__.cpython-37.pyc |
|
|
109 |
│ └── utils |
|
|
110 |
│ ├── countkmers.cpp |
|
|
111 |
│ ├── csv_utils.py |
|
|
112 |
│ ├── __init__.py |
|
|
113 |
│ ├── Makefile |
|
|
114 |
│ ├── __pycache__ |
|
|
115 |
│ │ ├── csv_utils.cpython-37.pyc |
|
|
116 |
│ │ └── __init__.cpython-37.pyc |
|
|
117 |
│ ├── stream_kmers |
|
|
118 |
│ └── stream_kmers.cpp |
|
|
119 |
├── gsec.egg-info |
|
|
120 |
│ ├── dependency_links.txt |
|
|
121 |
│ ├── entry_points.txt |
|
|
122 |
│ ├── PKG-INFO |
|
|
123 |
│ ├── requires.txt |
|
|
124 |
│ ├── SOURCES.txt |
|
|
125 |
│ └── top_level.txt |
|
|
126 |
├── MANIFEST.in |
|
|
127 |
├── models.csv |
|
|
128 |
├── model_summary.txt |
|
|
129 |
├── README.md |
|
|
130 |
└── setup.py |
|
|
131 |
``` |
|
|
132 |
|