|
a |
|
b/README.md |
|
|
1 |
# ML-Final-Project |
|
|
2 |
|
|
|
3 |
## Introduction |
|
|
4 |
|
|
|
5 |
The inspiration for this project comes from the publication: Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers |
|
|
6 |
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126698/) |
|
|
7 |
|
|
|
8 |
The data used for this analysis is Illumina Hiseq2500 RNA-seq data found at NCBI SRA under the project ID SRP117020. Samples were obtained from 130 patients diagnosed with non-small cell lung cancer (NSCLC). |
|
|
9 |
The data contains sequences with distribution of poor to well differentiated adenocarcinomas and squamous cell cancers |
|
|
10 |
|
|
|
11 |
The reference genome used for alignment can be found here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/. |
|
|
12 |
|
|
|
13 |
### Project Overview |
|
|
14 |
|
|
|
15 |
```mermaid |
|
|
16 |
graph TD; |
|
|
17 |
RNA_seq_FASTQs-->Reference_Genome_Alignment_HISAT2; |
|
|
18 |
RNA_seq_FASTQs-->FastQC; |
|
|
19 |
Reference_Genome_Alignment_HISAT2-->Transcript_Quantification/Merging_StringTie; |
|
|
20 |
Transcript_Quantification/Merging_StringTie-->Differential_Expression_DESeq2; |
|
|
21 |
Differential_Expression_DESeq2-->Pathway_Analysis_KOBAS; |
|
|
22 |
Differential_Expression_DESeq2-->SCLC/NSCLC_Gene_Classification; |
|
|
23 |
Pathway_Analysis_KOBAS-->SCLC/NSCLC_Gene_Classification; |
|
|
24 |
Differential_Expression_DESeq2-->Dataset_Annotation; |
|
|
25 |
Pathway_Analysis_KOBAS-->Dataset_Annotation; |
|
|
26 |
Dataset_Annotation-->Data_Preprocessing; |
|
|
27 |
Data_Preprocessing-->Maching_Learning_Models; |
|
|
28 |
Maching_Learning_Models-->Logistic_Regression; |
|
|
29 |
Maching_Learning_Models-->Random_Forest; |
|
|
30 |
Maching_Learning_Models-->Gradient_Boost; |
|
|
31 |
Maching_Learning_Models-->AdaBoost; |
|
|
32 |
Maching_Learning_Models-->KNN; |
|
|
33 |
``` |
|
|
34 |
|
|
|
35 |
#### Data Preparation (Part I) |
|
|
36 |
|
|
|
37 |
Preparation of the data for the ML algorithms includes a pipeline containing various bioinformatics tools: |
|
|
38 |
```mermaid |
|
|
39 |
graph TD; |
|
|
40 |
SRA_accessions.txt-->SRA-Toolkits/prefetch; |
|
|
41 |
SRA-Toolkits/prefetch-->.sra_files; |
|
|
42 |
.sra_files-->fastq-dump; |
|
|
43 |
fastq-dump-->untrimmed_fastqs; |
|
|
44 |
untrimmed_fastqs-->FastQC; |
|
|
45 |
FastQC-->MultiQC |
|
|
46 |
MultiQC-->untrimmed_quality_report |
|
|
47 |
untrimmed_fastqs-->fastp; |
|
|
48 |
fastp-->trimmed_fastq_files; |
|
|
49 |
trimmed_fastq_files-->FastQC2; |
|
|
50 |
FastQC2-->MultiQC2; |
|
|
51 |
MultiQC2-->trimmed_quality_report |
|
|
52 |
trimmed_fastq_files-->HISAT2; |
|
|
53 |
``` |
|
|
54 |
|
|
|
55 |
* To use SRA-Toolkits prefetch to retrieve all samples: |
|
|
56 |
``` |
|
|
57 |
while read accession; do |
|
|
58 |
prefetch "$accession" |
|
|
59 |
done < SRA_accessions.txt` |
|
|
60 |
``` |
|
|
61 |
|
|
|
62 |
* To convert all paired sample .sra files to fastq files: |
|
|
63 |
``` |
|
|
64 |
for dir in SRR*/; do |
|
|
65 |
echo "Processing $dir" |
|
|
66 |
fastq-dump --split-files --gzip "$dir/${dir%/}.sra" |
|
|
67 |
done |
|
|
68 |
``` |
|
|
69 |
|
|
|
70 |
|
|
|
71 |
* To trim all fastq samples: |
|
|
72 |
``` |
|
|
73 |
for file in *_1.fastq; do |
|
|
74 |
base=$(basename "$file" "_1.fastq") |
|
|
75 |
fastp -i "${base}_1.fastq" -I "${base}_2.fastq" -o "../FASTQ_SAMPLES_TRIMMED/${base}_1_trimmed.fq" -O "../FASTQ_SAMPLES_TRIMMED/${base}_2_trimmed.fq" |
|
|
76 |
done |
|
|
77 |
``` |
|
|
78 |
|
|
|
79 |
* Align reads to the reference genome using `alignments.sh` script: |
|
|
80 |
``` |
|
|
81 |
chmod +x alignment.sh |
|
|
82 |
./alignment.sh |
|
|
83 |
``` |
|
|
84 |
|