a/README.md b/README.md
1
# exSEEK
1
# exSEEK
2
2
3
exSEEK is an integrated computational framework to discover and evaluate exRNA biomarkers for liquid biopsy.
3
exSEEK is an integrated computational framework to discover and evaluate exRNA biomarkers for liquid biopsy.
4
4
5
![Pipeline of Tutorial](tutorial/pipeline.png)
5
6
7
The exSEEK framework consists of:
6
The exSEEK framework consists of:
8
+ Pre_processing:
7
+ Pre_processing:
9
   
8
   
10
   + Building index with various types of genomes and annotations. [`exseek build-index`]
9
   + Building index with various types of genomes and annotations. [`exseek build-index`]
11
   + Quality control and removing adaptors. [`exseek quality_control`] [`exseek cutadapt`] [`exseek quality_control_clean`]
10
   + Quality control and removing adaptors. [`exseek quality_control`] [`exseek cutadapt`] [`exseek quality_control_clean`]
12
   + Sequential mapping for small/long RNA-seq. [`exseek mapping`]
11
   + Sequential mapping for small/long RNA-seq. [`exseek mapping`]
13
12
14
+ Main function:
13
+ Main function:
15
   
14
   
16
   + Peak calling for recurring fragments of long RNAs. [`exseek bigwig`] [`exseek call_domains`]
15
   + Peak calling for recurring fragments of long RNAs. [`exseek bigwig`] [`exseek call_domains`]
17
   + Counting expression matrix. [`exseek count_matrix`]
16
   + Counting expression matrix. [`exseek count_matrix`]
18
   + Normalization and batch removal. [`exseek normalization`]
17
   + Normalization and batch removal. [`exseek normalization`]
19
   + Feature selection and classification. [`exseek feature_selection`]
18
   + Feature selection and classification. [`exseek feature_selection`]
20
   + Biomarker evaluation. [`exseek feature_selection`]
19
   + Biomarker evaluation. [`exseek feature_selection`]
21
20
22
Table of Contents:
21
Table of Contents:
23
22
24
* [Installation](#installation)
23
* [Installation](#installation)
25
* [Usage](#usage)
24
* [Usage](#usage)
26
  * [Index preparing](#index-preparing)
25
  * [Index preparing](#index-preparing)
27
  * [Small RNA-seq mapping](#small-rna-seq-mapping)
26
  * [Small RNA-seq mapping](#small-rna-seq-mapping)
28
  * [Peak calling](#peak-calling)
27
  * [Peak calling](#peak-calling)
29
  * [Long RNA-seq mapping](#long-rna-seq-mapping)
28
  * [Long RNA-seq mapping](#long-rna-seq-mapping)
30
  * [Counting expression matrix](#counting-expression-matrix)
29
  * [Counting expression matrix](#counting-expression-matrix)
31
  * [Normalization and batch removal](#normalization-and-batch-removal)
30
  * [Normalization and batch removal](#normalization-and-batch-removal)
32
  * [Feature selection and biomarker evaluation](#feature-selection-and-biomarker-evaluation)
31
  * [Feature selection and biomarker evaluation](#feature-selection-and-biomarker-evaluation)
33
* [Copyright and License Information](#copyright-and-license-information)
32
* [Copyright and License Information](#copyright-and-license-information)
34
* [Citation](#citation)
33
* [Citation](#citation)
35
* [Tutorial](tutorial/)
34
* [Tutorial](tutorial/)
36
35
37
36
38
## Installation
37
## Installation
39
38
40
For easy installation, you can use the [exSEEK image](https://hub.docker.com/r/ltbyshi/exseek) of [docker](https://docs.docker.com/) with all dependencies installed:
39
For easy installation, you can use the [exSEEK image](https://hub.docker.com/r/ltbyshi/exseek) of [docker](https://docs.docker.com/) with all dependencies installed:
41
```bash
40
```bash
42
docker pull ltbyshi/exseek
41
docker pull ltbyshi/exseek
43
```
42
```
44
43
45
All required software and packages are already installed in docker, so there are no more requirements. To test the installation and get information about the command-line interface of exSEEK, you can execute:
44
All required software and packages are already installed in docker, so there are no more requirements. To test the installation and get information about the command-line interface of exSEEK, you can execute:
46
```bash
45
```bash
47
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek exseek.py -h
46
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek exseek.py -h
48
```
47
```
49
The -v flag mounts the current working directory `$PWD` into the `/workspace` in docker image, so you can easily check the output files in `/workspace` directory after exiting docker.
48
The -v flag mounts the current working directory `$PWD` into the `/workspace` in docker image, so you can easily check the output files in `/workspace` directory after exiting docker.
50
49
51
You can create a bash script named `exseek` and set the script executable: 
50
You can create a bash script named `exseek` and set the script executable: 
52
```bash
51
```bash
53
#! /bin/bash
52
#! /bin/bash
54
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek exseek.py "$@"
53
docker run --rm -it -v $PWD:/workspace -w /workspace ltbyshi/exseek exseek.py "$@"
55
```
54
```
56
After adding the file to one of the directories in the `$PATH` variable, you can simply run: `exseek`.
55
After adding the file to one of the directories in the `$PATH` variable, you can simply run: `exseek`.
57
56
58
57
59
A helper message is shown:
58
A helper message is shown:
60
```bash
59
```bash
61
usage: exseek.py [-h] --dataset DATASET [--config-dir CONFIG_DIR] [--cluster]
60
usage: exseek.py [-h] --dataset DATASET [--config-dir CONFIG_DIR] [--cluster]
62
                 [--cluster-config CLUSTER_CONFIG]
61
                 [--cluster-config CLUSTER_CONFIG]
63
                 [--cluster-command CLUSTER_COMMAND] [--singularity]  
62
                 [--cluster-command CLUSTER_COMMAND] [--singularity]  
64
                 {build_index,quality_control,cutadapt,quality_control_clean,mapping,
63
                 {build_index,quality_control,cutadapt,quality_control_clean,mapping,
65
                 call_domains,count_matrix,normalization,feature_selection}
64
                 call_domains,count_matrix,normalization,feature_selection}
66
65
67
exseek main program
66
exseek main program
68
67
69
positional arguments:
68
positional arguments:
70
  {build_index,quality_control,cutadapt,quality_control_clean,mapping,
69
  {build_index,quality_control,cutadapt,quality_control_clean,mapping,
71
  call_domains,count_matrix,normalization,feature_selection}
70
  call_domains,count_matrix,normalization,feature_selection}
72
71
73
optional arguments:
72
optional arguments:
74
  -h, --help                                    show this help message and exit
73
  -h, --help                                    show this help message and exit
75
  --dataset DATASET, -d DATASET                 dataset name
74
  --dataset DATASET, -d DATASET                 dataset name
76
  --config-dir CONFIG_DIR, -c CONFIG_DIR        directory for configuration files
75
  --config-dir CONFIG_DIR, -c CONFIG_DIR        directory for configuration files
77
  --cluster                                     submit to cluster
76
  --cluster                                     submit to cluster
78
  --cluster-config CLUSTER_CONFIG               cluster configuration file ({config_dir}/cluster.yaml by default)
77
  --cluster-config CLUSTER_CONFIG               cluster configuration file ({config_dir}/cluster.yaml by default)
79
  --cluster-command CLUSTER_COMMAND             command for submitting job to cluster (default read from
78
  --cluster-command CLUSTER_COMMAND             command for submitting job to cluster (default read from
80
                                                {config_dir}/cluster_command.txt
79
                                                {config_dir}/cluster_command.txt
81
  --singularity                                 use singularity
80
  --singularity                                 use singularity
82
```
81
```
83
82
84
The basic usage of exSEEK is:
83
The basic usage of exSEEK is:
85
```bash
84
```bash
86
exseek ${step_name} -d ${dataset}
85
exseek ${step_name} -d ${dataset}
87
```
86
```
88
87
89
> **Note:**
88
 **Note:**
90
> * `${step_name}` is one of the step listed in 'positional arguments'.
89
 * `${step_name}` is one of the step listed in 'positional arguments'.
91
> * `${dataset}` is the name of your dataset that should match the prefix of your configuration file described in the following section.
90
 * `${dataset}` is the name of your dataset that should match the prefix of your configuration file described in the following section.
92
91
93
92
94
93
95
## Usage
94
## Usage
96
95
97
You can use the provided `example_data` to run exSEEK:
96
You can use the provided `example_data` to run exSEEK:
98
```bash
97
```bash
99
cp /apps/example_data /workspace
98
cp /apps/example_data /workspace
100
```
99
```
101
100
102
The `example_data` folder has the following structure:
101
The `example_data` folder has the following structure:
103
```
102
```
104
example_data/
103
example_data/
105
├── config
104
├── config
106
|   ├── example.yaml
105
|   ├── example.yaml
107
|   ├── default_config.yaml
106
|   ├── default_config.yaml
108
│   └── machine_learning.yaml
107
│   └── machine_learning.yaml
109
108
110
├── data
109
├── data
111
│   └── example
110
│   └── example
112
|       ├── fastq
111
|       ├── fastq
113
│       ├── batch_info.txt
112
│       ├── batch_info.txt
114
│       ├── compare_groups.yaml
113
│       ├── compare_groups.yaml
115
│       ├── sample_classes.txt
114
│       ├── sample_classes.txt
116
│       └── sample_ids.txt
115
│       └── sample_ids.txt
117
└── output
116
└── output
118
    └── example
117
    └── example
119
        └── ...
118
        └── ...
120
```
119
```
121
120
122
> **Note:**
121
 **Note:**
123
> * `config/example.yaml`: configuration file with frequently adjusted parameters, such as file paths and mapping parameters.
122
 * `config/example.yaml`: configuration file with frequently adjusted parameters, such as file paths and mapping parameters.
124
> * `config/default_config.yaml`: configuration file with additional detailed parameters for each step. The default file is not supposed to be changed. If you want to adjust parameters contained in this file, it is **recommended** to add your adjusted parameters in `config/example.yaml`.
123
 * `config/default_config.yaml`: configuration file with additional detailed parameters for each step. The default file is not supposed to be changed. If you want to adjust parameters contained in this file, it is **recommended** to add your adjusted parameters in `config/example.yaml`.
125
> * `config/machine_learning.yaml`: configuration file with parameters used for feature selection and classification steps. It is **recommended** to add your adjusted parameters in `config/example.yaml` if you want to adjust parameters contained in this file.
124
 * `config/machine_learning.yaml`: configuration file with parameters used for feature selection and classification steps. It is **recommended** to add your adjusted parameters in `config/example.yaml` if you want to adjust parameters contained in this file.
126
> * `data/example/batch_info.txt`: table of batch information.
125
 * `data/example/batch_info.txt`: table of batch information.
127
> * `data/example/compare_groups.yaml`: table for definition of positive and negative samples.
126
 * `data/example/compare_groups.yaml`: table for definition of positive and negative samples.
128
> * `data/example/sample_classes.txt`: table of sample labels.
127
 * `data/example/sample_classes.txt`: table of sample labels.
129
> * `output/example/`: output folder.
128
 * `output/example/`: output folder.
130
129
131
---
130
---
132
131
133
### Index preparing
132
### Index preparing
134
133
135
exSEEK docker contains a variety of commonly used genomes and annotations. Besides of RNA types extracted from GENCODE V27, exSEEK can also analyze rRNA from NCBI refSeq 109, miRNA from miRBase, piRNA from piRNABank, circRNA from circBase, lncRNA and TUCP from mitranscriptome, repeats from UCSC Genome Browser (rmsk) and promoter and enhancer from ChromHMM tracks. You can download these `.fa`, `.gtf` and `.bed` files here and build index with these genomes and annotations.
134
exSEEK docker contains a variety of commonly used genomes and annotations. Besides of RNA types extracted from GENCODE V27, exSEEK can also analyze rRNA from NCBI refSeq 109, miRNA from miRBase, piRNA from piRNABank, circRNA from circBase, lncRNA and TUCP from mitranscriptome, repeats from UCSC Genome Browser (rmsk) and promoter and enhancer from ChromHMM tracks. You can download these `.fa`, `.gtf` and `.bed` files here and build index with these genomes and annotations.
136
```bash
135
```bash
137
136
138
```
137
```
139
138
140
For mapping small RNA-seq, the index of each transcript type can be built with bowtie2, and the index used for mapping long RNA-seq can be built with STAR. You can get these index by executing:
139
For mapping small RNA-seq, the index of each transcript type can be built with bowtie2, and the index used for mapping long RNA-seq can be built with STAR. You can get these index by executing:
141
```bash
140
```bash
142
exseek build_index -d example
141
exseek build_index -d example
143
```
142
```
144
It might take hours to generate the index. It is **recommended** to specify the number of threads in `config/example.yaml` file by adding `threads: N`, or you can simply add `-j N` parameter in the exseek command.
143
It might take hours to generate the index. It is **recommended** to specify the number of threads in `config/example.yaml` file by adding `threads: N`, or you can simply add `-j N` parameter in the exseek command.
145
The output folder is `genome/hg38/index/`.
144
The output folder is `genome/hg38/index/`.
146
145
147
The detailed information for each transcript type is in `genome/hg38/transcript_table/` directory.
146
The detailed information for each transcript type is in `genome/hg38/transcript_table/` directory.
148
The summary for transcript types is listed below:
147
The summary for transcript types is listed below:
149
| RNA type | Number of transcripts |
148
| RNA type | Number of transcripts |
150
| :--- | :--- |
149
| :--- | :--- |
151
| rRNA | 37 |
150
| rRNA | 37 |
152
| mature_miRNA | 2656 |
151
| mature_miRNA | 2656 |
153
| miRNA | 1917 |
152
| miRNA | 1917 |
154
| piRNA | 23410 |
153
| piRNA | 23410 |
155
| snoRNA | 943 |
154
| snoRNA | 943 |
156
| snRNA | 1900 |
155
| snRNA | 1900 |
157
| srpRNA | 680 |
156
| srpRNA | 680 |
158
| tRNA | 649 |
157
| tRNA | 649 |
159
| mRNA | 19836 |
158
| mRNA | 19836 |
160
| lncRNA | 15778 |
159
| lncRNA | 15778 |
161
| TUCP | 3730 |
160
| TUCP | 3730 |
162
| Y_RNA | 756 |
161
| Y_RNA | 756 |
163
| univec | 6093 |
162
| univec | 6093 |
164
| spikein_long | 92 |
163
| spikein_long | 92 |
165
| spikein_small | 52 |
164
| spikein_small | 52 |
166
165
167
---
166
---
168
167
169
### Small RNA-seq mapping
168
### Small RNA-seq mapping
170
169
171
#### Quality control \(before adaptor removal\)
170
#### Quality control \(before adaptor removal\)
172
You can check reads quality with FastQC by running:
171
You can check reads quality with FastQC by running:
173
172
174
```bash
173
```bash
175
exseek quality_control -d example
174
exseek quality_control -d example
176
```
175
```
177
> **Note:**
176
 **Note:**
178
> * The detailed results for each sample are in folder `output/example/fastqc/`. 
177
 * The detailed results for each sample are in folder `output/example/fastqc/`. 
179
> * You can quickly check the summary results for all samples with the `fastqc.txt` file in `output/example/summary/fastqc_data/multiqc_fastqc.txt`.
178
 * You can quickly check the summary results for all samples with the `fastqc.txt` file in `output/example/summary/fastqc_data/multiqc_fastqc.txt`.
180
179
181
#### Remove adapter
180
#### Remove adapter
182
exSEEK removes reads adaptor with cutadapt software. You can change the adaptor sequences in `config/example.yaml` file.
181
exSEEK removes reads adaptor with cutadapt software. You can change the adaptor sequences in `config/example.yaml` file.
183
182
184
```bash
183
```bash
185
exseek cutadapt -d example
184
exseek cutadapt -d example
186
```
185
```
187
> **Note:**
186
 **Note:**
188
> * You can check the additional parameters for cutadapt in `config/default_config.yaml` file. 
187
 * You can check the additional parameters for cutadapt in `config/default_config.yaml` file. 
189
> * You can check the adaptor revmoval summary with `output/example/summary/cutadapt.txt` file.
188
 * You can check the adaptor revmoval summary with `output/example/summary/cutadapt.txt` file.
190
189
191
#### Quality control \(after adapter removal\)
190
#### Quality control \(after adapter removal\)
192
```bash
191
```bash
193
exseek quality_control_clean -d example
192
exseek quality_control_clean -d example
194
```
193
```
195
> **Note:**
194
 **Note:**
196
> * The detailed results for each sample are in folder `output/example/fastqc_clean/`. 
195
 * The detailed results for each sample are in folder `output/example/fastqc_clean/`. 
197
> * You can quickly check the summary results for all samples with the `fastqc.txt` file in `output/example/summary/fastqc_clean_data/multiqc_fastqc.txt`.
196
 * You can quickly check the summary results for all samples with the `fastqc.txt` file in `output/example/summary/fastqc_clean_data/multiqc_fastqc.txt`.
198
197
199
#### Update sequential mapping order
198
#### Update sequential mapping order
200
exSEEK allows user-defined sequential mapping, which is particularly useful for small RNA-seq samples because short RNA reads are more likely to be mapped to multiple locations. The default mapping order is set as `rna_types` variable in `config/default_config.yaml`:
199
exSEEK allows user-defined sequential mapping, which is particularly useful for small RNA-seq samples because short RNA reads are more likely to be mapped to multiple locations. The default mapping order is set as `rna_types` variable in `config/default_config.yaml`:
201
```yaml
200
```yaml
202
rna_types: [univec, rRNA, lncRNA, mature_miRNA, miRNA, mRNA, 
201
rna_types: [univec, rRNA, lncRNA, mature_miRNA, miRNA, mRNA, 
203
  piRNA, snoRNA, snRNA, srpRNA, tRNA, tucpRNA, Y_RNA]
202
  piRNA, snoRNA, snRNA, srpRNA, tRNA, tucpRNA, Y_RNA]
204
```
203
```
205
204
206
You can change the mapping order based on the confidence of each RNA type in your samples by adding a `rna_types` variable in `config/example.yaml`. For example, you can add spike-in sequences as the first RNA type:
205
You can change the mapping order based on the confidence of each RNA type in your samples by adding a `rna_types` variable in `config/example.yaml`. For example, you can add spike-in sequences as the first RNA type:
207
```yaml
206
```yaml
208
rna_types: [spikein_small, univec, rRNA, lncRNA, mature_miRNA, miRNA, 
207
rna_types: [spikein_small, univec, rRNA, lncRNA, mature_miRNA, miRNA, 
209
  mRNA, piRNA, snoRNA, snRNA, srpRNA, tRNA, tucpRNA, Y_RNA]
208
  mRNA, piRNA, snoRNA, snRNA, srpRNA, tRNA, tucpRNA, Y_RNA]
210
```
209
```
211
210
212
#### Add new reference sequence
211
#### Add new reference sequence
213
212
214
If a new RNA type is added, you should also add a sequence file in FASTA format: `${genome_dir}/fasta/${rna_type}.fa`. Then build a FASTA index \(`${genome_dir}/fasta/${rna_type}.fa.fai`\):
213
If a new RNA type is added, you should also add a sequence file in FASTA format: `${genome_dir}/fasta/${rna_type}.fa`. Then build a FASTA index \(`${genome_dir}/fasta/${rna_type}.fa.fai`\):
215
```bash
214
```bash
216
samtools faidx ${genome_dir}/fasta/${rna_type}.fa
215
samtools faidx ${genome_dir}/fasta/${rna_type}.fa
217
```
216
```
218
217
219
Then build a bowtie2 index \(`${genome_dir}/index/bowtie2/${rna_type}`\):
218
Then build a bowtie2 index \(`${genome_dir}/index/bowtie2/${rna_type}`\):
220
```bash
219
```bash
221
bowtie2-build ${genome_dir}/fasta/${rna_type}.fa ${genome_dir}/index/bowtie2/${rna_type}
220
bowtie2-build ${genome_dir}/fasta/${rna_type}.fa ${genome_dir}/index/bowtie2/${rna_type}
222
```
221
```
223
222
224
#### Mapping
223
#### Mapping
225
exSEEK provides bowtie2 for mapping small RNA-seq. You can specify the `paired_end` parameter as `false` or `true` in `config/example.yaml`. It is **recommended** to specify the number of threads in `config/example.yaml` file by adding `threads_mapping: N`, or you can simply add `-j N` parameter in the exseek command. The other parameters for mapping can be found in `config/default_config.yaml`.
224
exSEEK provides bowtie2 for mapping small RNA-seq. You can specify the `paired_end` parameter as `false` or `true` in `config/example.yaml`. It is **recommended** to specify the number of threads in `config/example.yaml` file by adding `threads_mapping: N`, or you can simply add `-j N` parameter in the exseek command. The other parameters for mapping can be found in `config/default_config.yaml`.
226
```bash
225
```bash
227
exseek mapping -d example
226
exseek mapping -d example
228
```
227
```
229
228
230
> **Note:**
229
 **Note:**
231
> * Make sure that the parameter `small_rna` is ***`True`*** in `config/example.yaml`.
230
 * Make sure that the parameter `small_rna` is ***`True`*** in `config/example.yaml`.
232
> * The output folder `output/example/gbam` contains genome bam files.
231
 * The output folder `output/example/gbam` contains genome bam files.
233
> * The output folder `output/example/tbam` contains transcriptome bam files for all types of RNA.
232
 * The output folder `output/example/tbam` contains transcriptome bam files for all types of RNA.
234
> * The output folders `output/example/stats/mapped_read_length*/` contain the summary of read length distribution for each RNA type.
233
 * The output folders `output/example/stats/mapped_read_length*/` contain the summary of read length distribution for each RNA type.
235
> * The output file `output/example/summary/read_counts.txt` is the summary of read counts mapped to each RNA type for all samples.
234
 * The output file `output/example/summary/read_counts.txt` is the summary of read counts mapped to each RNA type for all samples.
236
235
237
---
236
---
238
237
239
### Peak calling
238
### Peak calling
240
exSEEK provides local maximum-based peak calling methods for identifying recurring fragments (which are recurrently detected among samples,defined as domains) of long exRNAs (such as mRNA, srpRNA, and lncRNA). These called domains can be combined into the expression matrix and serve as potential biomarker candidates.
239
exSEEK provides local maximum-based peak calling methods for identifying recurring fragments (which are recurrently detected among samples,defined as domains) of long exRNAs (such as mRNA, srpRNA, and lncRNA). These called domains can be combined into the expression matrix and serve as potential biomarker candidates.
241
```bash
240
```bash
242
exseek bigwig -d example
241
exseek bigwig -d example
243
exseek call_domains -d example
242
exseek call_domains -d example
244
```
243
```
245
244
246
**Notes:**
245
**Notes:**
247
* Domain calling parameters in `config/default_config.yaml`:
246
* Domain calling parameters in `config/default_config.yaml`:
248
> * `bin_size: 20`: size of bins for calculating read coverage.
247
 * `bin_size: 20`: size of bins for calculating read coverage.
249
> * `cov_threshold: 0.05`: The proportion of samples that have the called peak. Peaks with cov_threshold higher than 0.05 are defined as domains. 
248
 * `cov_threshold: 0.05`: The proportion of samples that have the called peak. Peaks with cov_threshold higher than 0.05 are defined as domains. 
250
249
251
* Output files:
250
* Output files:
252
> * `output/example/domains_localmax_recurrence/recurrence.bed` contains all recurring peaks (domains).
251
 * `output/example/domains_localmax_recurrence/recurrence.bed` contains all recurring peaks (domains).
253
> * `output/example/domains_localmax/domains.bed` contains filtered (domains shorter than 10nt are filtered out) and merged domains. 
252
 * `output/example/domains_localmax/domains.bed` contains filtered (domains shorter than 10nt are filtered out) and merged domains. 
254
253
255
The `recurrence.bed` file looks like this:
254
The `recurrence.bed` file looks like this:
256
| Transcript ID | TransStart | TransEnd | X | Frequency | Strand |
255
| Transcript ID | TransStart | TransEnd | X | Frequency | Strand |
257
| :--- | :--- | :--- | :--- | :--- | :--- |
256
| :--- | :--- | :--- | :--- | :--- | :--- |
258
| ENST00000365118.2 | 0 | 30 | X | 8 | + |
257
| ENST00000365118.2 | 0 | 30 | X | 8 | + |
259
| ENST00000365223.1 | 0 | 61 | X | 12 | + |
258
| ENST00000365223.1 | 0 | 61 | X | 12 | + |
260
| ENST00000365436.1 | 69 | 92 | X | 2 | + |
259
| ENST00000365436.1 | 69 | 92 | X | 2 | + |
261
| ENST00000366365.2 | 236 | 261 | X | 1 | + |
260
| ENST00000366365.2 | 236 | 261 | X | 1 | + |
262
261
263
The `domains.bed` file looks like this:
262
The `domains.bed` file looks like this:
264
| Transcript ID | TransStart | TransEnd | filtered_merged Peak_ID | Weighted_ave_frequency | Strand |
263
| Transcript ID | TransStart | TransEnd | filtered_merged Peak_ID | Weighted_ave_frequency | Strand |
265
| :--- | :--- | :--- | :--- | :--- | :--- |
264
| :--- | :--- | :--- | :--- | :--- | :--- |
266
| ENST00000006015.3 | 1506 | 1523 | peak_1 | 14.2353 | + |
265
| ENST00000006015.3 | 1506 | 1523 | peak_1 | 14.2353 | + |
267
| ENST00000006015.3 | 1971 |1986 | peak_2 | 10 | + |
266
| ENST00000006015.3 | 1971 |1986 | peak_2 | 10 | + |
268
| ENST00000008938.4 | 20 | 35 | peak_3 | 7 | + |
267
| ENST00000008938.4 | 20 | 35 | peak_3 | 7 | + |
269
| ENST00000025301.3 | 8580 | 8597 | peak_4 | 37.2353 | + |
268
| ENST00000025301.3 | 8580 | 8597 | peak_4 | 37.2353 | + |
270
| ENST00000192788.5 | 2649 | 2665 | peak_5 | 72.5625 | + |
269
| ENST00000192788.5 | 2649 | 2665 | peak_5 | 72.5625 | + |
271
270
272
---
271
---
273
272
274
### Long RNA-seq mapping
273
### Long RNA-seq mapping
275
274
276
The methods for long RNA-seq mapping are very similar to **Small RNA-seq mapping**. You can use the above command lines for long RNA-seq by setting ***`small_rna`*** to ***`False`*** in file `config/example.yaml`. It is **recommended** to specify the number of threads in `config/example.yaml` file by adding `threads_mapping: N`, or you can simply add `-j N` parameter in the exseek command. There is no peak calling step for long RNA-seq datasets because recurring fragment (domain) is not a distinctive feature of extracellular long RNA-seq datasets. 
275
The methods for long RNA-seq mapping are very similar to **Small RNA-seq mapping**. You can use the above command lines for long RNA-seq by setting ***`small_rna`*** to ***`False`*** in file `config/example.yaml`. It is **recommended** to specify the number of threads in `config/example.yaml` file by adding `threads_mapping: N`, or you can simply add `-j N` parameter in the exseek command. There is no peak calling step for long RNA-seq datasets because recurring fragment (domain) is not a distinctive feature of extracellular long RNA-seq datasets. 
277
276
278
---
277
---
279
278
280
### Counting expression matrix
279
### Counting expression matrix
281
exSEEK use featureCounts for counting expression matrix. 
280
exSEEK use featureCounts for counting expression matrix. 
282
```bash
281
```bash
283
exseek.py count_matrix -d example
282
exseek.py count_matrix -d example
284
```
283
```
285
284
286
**Notes:**
285
**Notes:**
287
* For small RNA-seq, the ouput folder `output/example/count_matrix/` contains 4 types of expression matrix:
286
* For small RNA-seq, the ouput folder `output/example/count_matrix/` contains 4 types of expression matrix:
288
> | Name | Transcript type |
287
 | Name | Transcript type |
289
> | :--- | :--- |
288
 | :--- | :--- |
290
> | transcript.txt | all full_length transcripts |
289
 | transcript.txt | all full_length transcripts |
291
> | transcript_mirna.txt | only miRNA |
290
 | transcript_mirna.txt | only miRNA |
292
> | long_fragments.txt | recurring peaks (domain) |
291
 | long_fragments.txt | recurring peaks (domain) |
293
> | mirna_and_long_fragments.txt | miRNA and recurring peaks (domain) |
292
 | mirna_and_long_fragments.txt | miRNA and recurring peaks (domain) |
294
293
295
* For long RNA-seq, the ouput folder `output/example/count_matrix/` contains 2 types of expression matrix:
294
* For long RNA-seq, the ouput folder `output/example/count_matrix/` contains 2 types of expression matrix:
296
> | Name | Transcript type |
295
 | Name | Transcript type |
297
> | :--- | :--- |
296
 | :--- | :--- |
298
> | featurecounts.txt | genome_long_rna |
297
 | featurecounts.txt | genome_long_rna |
299
> | circRNA.txt | circRNA |
298
 | circRNA.txt | circRNA |
300
299
301
---
300
---
302
301
303
### Normalization and batch removal
302
### Normalization and batch removal
304
303
305
exSEEK supports 5 normalization methods and 4 batch removal methods in `config/example.yaml`:
304
exSEEK supports 5 normalization methods and 4 batch removal methods in `config/example.yaml`:
306
```yaml
305
```yaml
307
normalization_method: ["TMM", "RLE", "CPM", "CPM_top", "null"]
306
normalization_method: ["TMM", "RLE", "CPM", "CPM_top", "null"]
308
batch_removal_method: ["null", "ComBat", "limma", "RUV", "null"]
307
batch_removal_method: ["null", "ComBat", "limma", "RUV", "null"]
309
count_method: [transcript, transcript_mirna, long_fragments, mirna_and_long_fragments, featurecounts(for long RNA-seq)]
308
count_method: [transcript, transcript_mirna, long_fragments, mirna_and_long_fragments, featurecounts(for long RNA-seq)]
310
batch_index: 1
309
batch_index: 1
311
```
310
```
312
311
313
You can get the normalized expression matrix generated by any combinations of normalization and batch removal methods by executing:
312
You can get the normalized expression matrix generated by any combinations of normalization and batch removal methods by executing:
314
```bash
313
```bash
315
exseek normalization -d example
314
exseek normalization -d example
316
```
315
```
317
316
318
> **Notes:**
317
 **Notes:**
319
> * When the method name is set to `null`, the step is skipped.
318
 * When the method name is set to `null`, the step is skipped.
320
> * You can specify the expression matrix type to be normalized via the `count_method` variable.
319
 * You can specify the expression matrix type to be normalized via the `count_method` variable.
321
> * `batch_index` is the column index of `data/example/batch_info.txt` to be used for ComBat batch removal.
320
 * `batch_index` is the column index of `data/example/batch_info.txt` to be used for ComBat batch removal.
322
> * The name pattern of **output** files in folder `output/example/matrix_processing` is: **`filter.null.Norm_${normalization_method}.Batch_${batch_removal_method}_${batch_index}.${count_method}.txt`**.
321
 * The name pattern of **output** files in folder `output/example/matrix_processing` is: **`filter.null.Norm_${normalization_method}.Batch_${batch_removal_method}_${batch_index}.${count_method}.txt`**.
323
322
324
You can choose the best combination methods based on the ***UCA*** score and then ***mKNN*** score, which is summarized in folder: `output/example/select_preprocess_method/uca_score/` and `output/example/select_preprocess_method/knn_score`.
323
You can choose the best combination methods based on the ***UCA*** score and then ***mKNN*** score, which is summarized in folder: `output/example/select_preprocess_method/uca_score/` and `output/example/select_preprocess_method/knn_score`.
325
324
326
The UCA metric quantifies the separation of samples from different biological groups, while the mKNN metric measures the uniformity of the distribution of samples from different batches. For a perfectly corrected expression matrix, both the ***UCA*** score and the ***mKNN*** score approach ***1***.
325
The UCA metric quantifies the separation of samples from different biological groups, while the mKNN metric measures the uniformity of the distribution of samples from different batches. For a perfectly corrected expression matrix, both the ***UCA*** score and the ***mKNN*** score approach ***1***.
327
326
328
The UCA score files look like this:
327
The UCA score files look like this:
329
| preprocess_method | uca_score |
328
| preprocess_method | uca_score |
330
| :--- | :--- |
329
| :--- | :--- |
331
| filter.null.Norm_CPM_top.Batch_limma_1 | 0.578 |
330
| filter.null.Norm_CPM_top.Batch_limma_1 | 0.578 |
332
| filter.null.Norm_CPM.Batch_limma_1 | 0.563 |
331
| filter.null.Norm_CPM.Batch_limma_1 | 0.563 |
333
| filter.null.Norm_CPM_top.Batch_ComBat_1 | 0.563 |
332
| filter.null.Norm_CPM_top.Batch_ComBat_1 | 0.563 |
334
| filter.null.Norm_CPM_top.Batch_RUV_1 | 0.564 | 
333
| filter.null.Norm_CPM_top.Batch_RUV_1 | 0.564 | 
335
334
336
And the mKNN score files look like this:
335
And the mKNN score files look like this:
337
| preprocess_method | knn_score |
336
| preprocess_method | knn_score |
338
| :--- | :--- |
337
| :--- | :--- |
339
| filter.null.Norm_CPM_top.Batch_limma_1 | 0.940 |
338
| filter.null.Norm_CPM_top.Batch_limma_1 | 0.940 |
340
| filter.null.Norm_CPM.Batch_limma_1 | 0.936 |
339
| filter.null.Norm_CPM.Batch_limma_1 | 0.936 |
341
| filter.null.Norm_CPM_top.Batch_ComBat_1 | 0.936 |
340
| filter.null.Norm_CPM_top.Batch_ComBat_1 | 0.936 |
342
| filter.null.Norm_CPM_top.Batch_RUV_1 | 0.927 | 
341
| filter.null.Norm_CPM_top.Batch_RUV_1 | 0.927 | 
343
342
344
***Alternatively, you can simply get the best-performance combination method (the highest averaged UCA and mKNN score) listed in `output/example/select_preprocess_method/combined_score/${count_method}/selected_methods.txt`.***
343
***Alternatively, you can simply get the best-performance combination method (the highest averaged UCA and mKNN score) listed in `output/example/select_preprocess_method/combined_score/${count_method}/selected_methods.txt`.***
345
344
346
After deciding the most proper combination of normalization and batch removal methods, you can specify the exact normalization and batch removal method by adjusting `normalization_method` and `batch_removal_method` parameters in `config/sample.yaml`, and the generated matrix will be used for the next step.
345
After deciding the most proper combination of normalization and batch removal methods, you can specify the exact normalization and batch removal method by adjusting `normalization_method` and `batch_removal_method` parameters in `config/sample.yaml`, and the generated matrix will be used for the next step.
347
346
348
---
347
---
349
348
350
### Feature selection and biomarker evaluation
349
### Feature selection and biomarker evaluation
351
350
352
This step identifies and evaluates exRNA biomarker panels selected by various combinations of feature selection methods and machine learning classifiers. 
351
This step identifies and evaluates exRNA biomarker panels selected by various combinations of feature selection methods and machine learning classifiers. 
353
352
354
exSEEK supported feature selection and classification methods:
353
exSEEK supported feature selection and classification methods:
355
```yaml
354
```yaml
356
selector: [DiffExp_TTest, RandomForest, LogRegL1, LogRegL2, SIS, ReliefF, SURF, MultiSURF]
355
selector: [DiffExp_TTest, RandomForest, LogRegL1, LogRegL2, SIS, ReliefF, SURF, MultiSURF]
357
356
358
classifier: [LogRegL2, RandomForest, RBFSVM, DecisionTree, MLP]
357
classifier: [LogRegL2, RandomForest, RBFSVM, DecisionTree, MLP]
359
```
358
```
360
359
361
You can evaluate all combinations of feature selection and classification methods based on the cross-validation results by running:
360
You can evaluate all combinations of feature selection and classification methods based on the cross-validation results by running:
362
```bash
361
```bash
363
exseek feature_selection -d example
362
exseek feature_selection -d example
364
```
363
```
365
364
366
> **Note:**
365
 **Note:**
367
> * You can adjust the maximum number of selected features `n_features_to_select` in `config/example.yaml`.
366
 * You can adjust the maximum number of selected features `n_features_to_select` in `config/example.yaml`.
368
> * You can setup the comparison groups for classification in `data/config/compare_groups.yaml`.
367
 * You can setup the comparison groups for classification in `data/config/compare_groups.yaml`.
369
> * The detailed parameters of machine learning can be found in `config/default_congfig.yaml`. 
368
 * The detailed parameters of machine learning can be found in `config/default_congfig.yaml`. 
370
> * The cross-validation results and trained models for individual combinations are in this directory:
369
 * The cross-validation results and trained models for individual combinations are in this directory:
371
**`output/example/cross_validation/filter.null.Norm_${normalization_method}.Batch_${batch_removal_method}_${batch_index}.${count_method}/${compare_group}/${classifier}.${n_select}.${selector}.${fold_change_filter_direction}`**.
370
**`output/example/cross_validation/filter.null.Norm_${normalization_method}.Batch_${batch_removal_method}_${batch_index}.${count_method}/${compare_group}/${classifier}.${n_select}.${selector}.${fold_change_filter_direction}`**.
372
> * Selected features (biomarker panels) for each model can be found in `features.txt` in the above-mentioned directory.
371
 * Selected features (biomarker panels) for each model can be found in `features.txt` in the above-mentioned directory.
373
372
374
Three summary files will be generated in this step:
373
Three summary files will be generated in this step:
375
```bash
374
```bash
376
 output/example/summary/cross_validation/metrics.test.txt
375
 output/example/summary/cross_validation/metrics.test.txt
377
 output/example/summary/cross_validation/metrics.train.txt
376
 output/example/summary/cross_validation/metrics.train.txt
378
 output/sxample/summary/cross_validation/feature_stability.txt
377
 output/sxample/summary/cross_validation/feature_stability.txt
379
 ```
378
 ```
380
379
381
You can choose the most proper combination and its identified features (biomarker panel) base on ***ROC_AUC*** and ***feature stability*** score summarized in the above three files.
380
You can choose the most proper combination and its identified features (biomarker panel) base on ***ROC_AUC*** and ***feature stability*** score summarized in the above three files.
382
381
383
The `metrics.*.txt` file looks like:
382
The `metrics.*.txt` file looks like:
384
383
385
| classifier | n_features | selector | fold_change_direction | compare_group | filter_method | imputation | normalization | batch_removal | count_method | preprocess_method | split | accuracy | average_precision | f1_score | precision | recall | roc_auc |
384
| classifier | n_features | selector | fold_change_direction | compare_group | filter_method | imputation | normalization | batch_removal | count_method | preprocess_method | split | accuracy | average_precision | f1_score | precision | recall | roc_auc |
386
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
385
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
387
| LogRegL2 | 5 | MaxFeatures_RandomForest | any | Normal-HCC | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_1 | 1 | 0.928 | 0.916 | 0.800 | 1.000 | 0.666 | 0.969 |
386
| LogRegL2 | 5 | MaxFeatures_RandomForest | any | Normal-HCC | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_1 | 1 | 0.928 | 0.916 | 0.800 | 1.000 | 0.666 | 0.969 |
388
| LogRegL2 | 5 | DiffExp_TTest| any | Normal-HCC | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_10 | 1 | 0.928 | 0.743 | 0.800 | 1.000 | 0.666 | 0.696 |
387
| LogRegL2 | 5 | DiffExp_TTest| any | Normal-HCC | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_10 | 1 | 0.928 | 0.743 | 0.800 | 1.000 | 0.666 | 0.696 |
389
388
390
The `feature_stability.txt` file looks like:
389
The `feature_stability.txt` file looks like:
391
390
392
| classifier | n_features | selector | fold_change_direction | compare_group | filter_method | imputation | normalization | batch_removal | count_method | preprocess_method | feature_stability
391
| classifier | n_features | selector | fold_change_direction | compare_group | filter_method | imputation | normalization | batch_removal | count_method | preprocess_method | feature_stability
393
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
392
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
394
| LogRegL2 | 5 | DiffExp_TTest | any | Normal-HCC | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_1 | 0.450 |
393
| LogRegL2 | 5 | DiffExp_TTest | any | Normal-HCC | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_1 | 0.450 |
395
| RBFSVM | 5 | DiffExp_TTest | any | Normal-stage_A | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_1 | 0.473 |
394
| RBFSVM | 5 | DiffExp_TTest | any | Normal-stage_A | filter | null | Norm_RLE | Batch_limma_1 | mirna_and_domains_rna | filter.null.Norm_RLE.Batch_limma_1 | 0.473 |
396
 
395
 
397
396
398
397
399
## Copyright and License Information
398
## Copyright and License Information
400
399
401
Copyright (C) 2019 Tsinghua University, Beijing, China 
400
Copyright (C) 2019 Tsinghua University, Beijing, China 
402
401
403
This program is licensed with commercial restriction use license. Please see the [LICENSE](https://github.com/lulab/exSEEK_docs/blob/master/LICENSE) file for details.
402
This program is licensed with commercial restriction use license. Please see the [LICENSE](https://github.com/lulab/exSEEK_docs/blob/master/LICENSE) file for details.