All required datasets are under 'datasets' or '5-Datasets' directories. For large data files that cannot be uploaded to GitHub, they are under FigShare doi:10.6084/m9.figshare.19126814.
All scripts are under 'scripts' directory. For each step, follow the corresponding R (or perl) script.
Install required R packages:
install.packages("pacman")
install.packages("gtools")
install.packages("verification")
install.packages("doParallel")
install.packages("foreach")
install.packages("magrittr")
install.packages("tibble")
install.packages("data.table")
install.packages("dplyr")
install.packages("reshape2")
install.packages("WGCNA")
install.packages("mediation")
install.packages("tidyverse")
install.packages("ranger")
Dimensionality reduction for metagenomic data was performed using ssGSEA algorithm. Two options are:
Input: metagenome.gct (a z-score normalized abundance table in gctx format with KEGG KO ids in rows and samples in columns), KEGG_modules.gmt (KEGG gene-module mapping file in gmt format)
Script: Rscript ssgsea-cli.R -i metagenome.gct -o 1_metagenome -d KEGG_modules.gmt --minoverlap 2
Output: 1_metagenome-combined.gct, 1_metagenome-fdr-pvalues.gct, 1_metagenome-parameters.gct, 1_metagenome-scores.gct (Both 1_metagenome-scores.gct and 1_metagenome-combined.gct contain the module-level metagenome profile)
Dimensionality reduction for metabolomic data was performed using WGCNA.
Input: metabolome.txt (a z-score normalized abundance table with compound ID in rows and samples in column)
Script: 1.WGCNA_metabolomics.r
Output: 1_metaB.module_assign.txt, 1_metaB.module_eigengene.txt
Dimensionality reduction for host transcriptomic data was performed using WGCNA.
Input: transcriptome.txt (a vst normalized gene expression table with gene symbol in rows and samples in column)
Script: 1.WGCNA_transcriptomics.r
Output: 1_hostT.module_assign.txt, 1_hostT.module_eigengene.txt
The dimensionality reduced multi-omic data profiles (1_metagenome-combined.gct, 1_metaB.module_eigengene.txt, 1_hostT.module_eigengene.txt) was generated to be used for downstream analyses.
Differential metagenomic modules were obtained by: 1) obtaining effect size of each KOs in association with disease in a general linear model adjusting demographic confounders, 2) ranking the features by their effect sizes, and 3) comparing the ranks of features within or outside each module in a Wilcoxon rank-sum test. The differential modules were then associated with sputum neutrophil or eosinophil percentages, and assigned as 'NEU' or 'EOS' if specifically significantly correlated with sputum neutrophil or eosinophil percentages.
Input: metagenome.gct, KEGG_modules.gmt, metadata.txt (containing a column named SampleID, variables of confounders, and a column indicating disease state), meta.mediation.NEU.txt or meta.mediation.EOS.txt (containing a column named SampleID, variables of confounders, a column indicating NEU or EOS)
Script: 2.significant.metaG.modules.r
Output: 2_significant_metaG_modules.RData (a R data file containing the lists of metagenomic modules associated with COPD and specifically correlated with NEU or EOS)
Differential MetaB, HostT and HostP modules/features were obtained by associating the modules/features with COPD in a general linear model adjusting demographic confounders. The differential modules were then associated with sputum neutrophil or eosinophil percentages, and assigned as 'NEU' or 'EOS' if specifically significantly correlated with sputum neutrophil or eosinophil percentages.
Input: 1_metaB.module_eigengene.txt (module-level metabolome profile output from step 1), metadata.txt, meta.mediation.NEU.txt or meta.mediation.EOS.txt
Script: 2.significant.metaB.modules.r
Output: 2_significant_metaB_modules.RData (a R data file containing the list of metabolomic modules associated with COPD and specifically correlated with NEU or EOS)
Input: 1_hostT.module_eigengene.txt (module-level transcriptome profile output from step 1), metadata.txt, meta.mediation.NEU.txt or meta.mediation.EOS.txt
Script: 2.significant.hostT.modules.r
Output: 2_significant_hostT_modules.RData (a R data file containing the list of metabolomic modules associated with COPD and specifically correlated with NEU or EOS)
Input: sputum_proteome.txt (feature-level proteome profile), metadata.txt, meta.mediation.NEU.txt or meta.mediation.EOS.txt
Script: 2.significant.hostP.modules.r
Output: 2_significant_hostP_modules.RData (a R data file containing the list of protein features associated with COPD and specifically correlated with NEU or EOS)
This step generates MetaG, MetaB, HostT modules and HostP features differentially abundant between COPD and healthy controls, and specifically associated with sputum neutrophilic or eosinophilic inflammation.
Mediation analysis was performed sequentially from MetaG-MetaB, MetaB-HostT, HostT-HostP with sputum neutrophil or eosinophil percentage, respectively. Take the analysis for neutrophil as an example:
Input: 1_metaG-combined.gct (module-level metagenome profile output from step 1), 1_metaB.module_eigengene.txt, meta.mediation.NEU.txt, 2_significant_metaG_modules.RData (significant metaG modules output from step 2), 2_significant_metaB_modules.RData (significant metaB modules output from step 2)
Script: 3.mediation_metaG.2.metaB.r
Output: 3_MetaG_affects_NEU_through_MetaB.txt (metaG-metaB-NEU mediation analysis results, including P-value and proportion of mediation effects)
Input: 1_metaB.module_eigengene.txt, 1_hostT.module_eigengene.txt, meta.mediation.NEU.txt, 2_significant_metaB_modules.RData (significant metaB modules output from step 2), 2_significant_hostT_modules.RData (significant hostT modules output from step 2)
Script: 3.mediation_metaB.2.hostT.r
Output: 3_MetaB_affects_NEU_through_HostT.txt (metaB-hostT-NEU mediation analysis results, including P-value and proportion of mediation effects)
Input: 1_hostT.module_eigengene.txt, sputum_proteome.txt, meta.mediation.NEU.txt, 2_significant_hostT_modules.RData (significant hostT modules output from step 2), 2_significant_hostP_features.RData (significant hostP features output from step 2)
Script: 3.mediation_hostT.2.hostP.r
Output: 3_HostT_affects_NEU_through_HostP.txt (hostT-hostP-NEU mediation analysis results, including P-value and proportion of mediation effects)
MetaG-MetaB: Biological links were identified if genes in the MetaG module were involved in metabolic reaction for the metabolites in the MetaB module, or if the metabolites in the MetaB module were present in the MetaG module, based on KEGG and MetaCyc databases.
Input: ko01000.keg (downloaded from KEGG), metacyc_reactions.txt (downloaded and reformatted from MetaCyc)
Script: 4.KO2cmpd.r
Output: KO2EC.list.RData (R data file linking KO IDs and EC), EC2CMPD.lists.RData (R data file linking EC IDs and compounds), KO2CMPD.lists.RData (R data file linking KO IDs and compounds)
1. Generate metabolite sdf file from SMILES string list for MetaCyc compounds:
obabel metacyc.smi –O metacyc.sdf --gen3D
2. Generate fastsearch index for the metacyc.sdf file:
obabel metacyc.sdf -ofs
3. For each metabolite in metabolomic data generate its own sdf file:
obabel metabo_$i.smi -O metabo_$i.sdf --gen3D
4. Search metabolites against metacyc.fs index to get top 10 hits:
obabel metacyc.fs –otxt –s metabo_$i.sdf –at 10 –aa > metabo_$i.results
Input: cmpd2metabo.txt (curated matching information between metabolite IDs and MetaCyc compound IDs), KO2CMPD.lists.RData (generated from above step)
Script: 4.KO2metabo.r
Output: KO2METABO.lists.RData (R data file linking KO IDs to metabolite IDs)
Input: 3_MetaG_affects_NEU_through_MetaB.txt (significant MetaG-MetaB links obtained in step 3), KEGG_modules.tab (KEGG Module-KO mapping file), KO2METABO.lists.RData (R data file linking KO IDs to metabolite IDs generated from above step), 1_metaB_module_assign.txt, metabolome.txt, metagenome.gct, Metabo.KEGGModule.match.txt (file storing KEGG compound ID-module mapping information)
Script: 4.MetaG.MetaB.link.r
Output: 4_MetaG.MetaB.modules.NEU.linked.txt (linked MetaG-MetaB modules and the MetaG and MetaB features that link the two modules)
Input: 3_MetaB_affects_NEU_through_HostT.txt (significant MetaB-HostT links obtained in step 3), metabolome.txt, 1_metaB.module_assign.txt, transcriptome.txt, 1_hostT.module_assign.txt, metabo2CIDm.txt (compound-CIDm ID mapping file manually curated based on STITCH database), all_CIDm_targets.txt (CIDm ID-host target gene mapping file manually curated based on STITCH database)
Script: 4.MetaB.HostT.link.r
Output: 4_MetaB.HostT.modules.NEU.linked.txt (linked MetaB-HostT modules and the MetaB and HostT features that link the two modules)
Input: 3_HostT_affects_NEU_through_HostP.txt (significant HostT-HostP links obtained in step 3), transcriptome.txt, sputum_proteome.txt, 1_hostT.module_assign.txt, protein_info.txt, human_pathway.gmt (KEGG/MetaBase pathway-gene ID mapping file)
Script: 4.HostT.HostP.link.r
Output: 4_HostT.module_HostP.protein.NEU.linked.txt (linked MetaB-HostT modules and the gene/protein ID that links the two modules)
Leave-one-species-out analysis was performed to identify driver taxa for the MetaG-MetaB associations by recalculating module-level associations with each species (using bin-based or gene-based taxonomy) iteratively excluded one at a time.
generating files of speciesX-combined.gct for the dimensionality reduced MetaG profiles with speciesX left out.
Generate metagenomic KO-level profiles with each species removed one at a time.
Input: geneDepth.txt (gene-level abundance file with the first column being gene IDs), ko.txt (KO-gene mapping file), bin_membership.txt (gene-bin mapping file), bin_species.txt (species-bin mapping file)
Script: 5.LOSO.KO.r
Output: 5_LOSO_ko.abund (directory containing all ko.abund_rm.speciesX.gct files, one for each species removed)
Input: 4_MetaG.MetaB.modules.NEU.linked.txt (the linked MetaG-MetaB modules obtained in step 4), 1_metaG-combined.gct, 1_metaB.module_eigengene.txt, LOSO_metaG_DR (containing all dimensionality reduced MetaG profiles named as speciesX-combined.gct, one for each species excluded)
Script: 5.LOSO.delta.r
Output: 5_LOSO_NEU.delta.spearman.r.txt (containing the delta spearman's rho when each species was iteratively removed)
Input: 5_LOSO_ko.abund (output in last step), metagenome.gct (null KO-level profile), metadata.txt
Script: 5.KOSO.KO.zscore.r
Output: 5_LOSO.KO.zscore.txt (the KO by species matrix table with z-scores)
Perform random forest analysis using each linked MetaG-MetaB-HostT set to predict sputum neutrophil or eosinophil percentage. Take neutrophil as an example:
We first aggregated MetaG-MetaB and MetaB-HostT links to the full-path of MetaG-MetaB-HostT, by linking up 'KO-metabolite-host gene' feature-level information.
Then we performed a random forest regression between each linked set of MetaG-MetaB-HostT modules and NEU or EOS, and ranked the module sets by model performance scores.
Input: 4_MetaG.MetaB.modules.NEU.linked.txt, 4_MetaB.HostT.modules.NEU.linked.txt, meta.mediation.NEU.txt or meta.mediation.EOS.txt, 1_metaG-combined.gct, 1_metaB.module_eigengene.txt, 1_hostT.module_eigengene.txt
Script: 6.random_forest.r
Output: 6_NEU_prediction.performance_byLinks.rf.txt (containing RMSE, RSQ and MAE scores for each linked MetaG-MetaB-HostT set)