The inspiration for this project comes from the publication: Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126698/)
The data used for this analysis is Illumina Hiseq2500 RNA-seq data found at NCBI SRA under the project ID SRP117020. Samples were obtained from 130 patients diagnosed with non-small cell lung cancer (NSCLC).
The data contains sequences with distribution of poor to well differentiated adenocarcinomas and squamous cell cancers
The reference genome used for alignment can be found here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/.
graph TD;
RNA_seq_FASTQs-->Reference_Genome_Alignment_HISAT2;
RNA_seq_FASTQs-->FastQC;
Reference_Genome_Alignment_HISAT2-->Transcript_Quantification/Merging_StringTie;
Transcript_Quantification/Merging_StringTie-->Differential_Expression_DESeq2;
Differential_Expression_DESeq2-->Pathway_Analysis_KOBAS;
Differential_Expression_DESeq2-->SCLC/NSCLC_Gene_Classification;
Pathway_Analysis_KOBAS-->SCLC/NSCLC_Gene_Classification;
Differential_Expression_DESeq2-->Dataset_Annotation;
Pathway_Analysis_KOBAS-->Dataset_Annotation;
Dataset_Annotation-->Data_Preprocessing;
Data_Preprocessing-->Maching_Learning_Models;
Maching_Learning_Models-->Logistic_Regression;
Maching_Learning_Models-->Random_Forest;
Maching_Learning_Models-->Gradient_Boost;
Maching_Learning_Models-->AdaBoost;
Maching_Learning_Models-->KNN;
Preparation of the data for the ML algorithms includes a pipeline containing various bioinformatics tools:
graph TD;
SRA_accessions.txt-->SRA-Toolkits/prefetch;
SRA-Toolkits/prefetch-->.sra_files;
.sra_files-->fastq-dump;
fastq-dump-->untrimmed_fastqs;
untrimmed_fastqs-->FastQC;
FastQC-->MultiQC
MultiQC-->untrimmed_quality_report
untrimmed_fastqs-->fastp;
fastp-->trimmed_fastq_files;
trimmed_fastq_files-->FastQC2;
FastQC2-->MultiQC2;
MultiQC2-->trimmed_quality_report
trimmed_fastq_files-->HISAT2;
while read accession; do
prefetch "$accession"
done < SRA_accessions.txt`
for dir in SRR*/; do
echo "Processing $dir"
fastq-dump --split-files --gzip "$dir/${dir%/}.sra"
done
for file in *_1.fastq; do
base=$(basename "$file" "_1.fastq")
fastp -i "${base}_1.fastq" -I "${base}_2.fastq" -o "../FASTQ_SAMPLES_TRIMMED/${base}_1_trimmed.fq" -O "../FASTQ_SAMPLES_TRIMMED/${base}_2_trimmed.fq"
done
alignments.sh
script:chmod +x alignment.sh
./alignment.sh