scPanel

scPanel selects a sparse gene panel from responsive cell population(s) for patient-level classification in single cell RNA sequencing (scRNA-seq) data.

This repository is the official code implementation of the paper scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets, published in Briefings in Bioinformatics.

🔬 Key Features

Identify cell populations that are responding to the perturbation (e.g., disease, drugs)
Selects a minimal set of genes that can discriminate two different status in the selected cell population(s)
Trains patient-level ML/DL classifiers that can predict patients in two different status

Specifically, patients are split into training and testing sets. In the training set, cell populations responsive to perturbations are scored by quantifying how well each cell population is separated between two conditions. With the selected population, Support Vector Machine Recursive Feature Elimination (SVM-RFE) is applied to identify a minimal number of genes with high predictive power. The number of genes in the panel is automatically decided in a data-driven way to avoid bias from manual inspection. Using the selected cell population(s) and corresponding gene panel(s), scPanel constructs a patient-level classifier with the training data and evaluates its performance in the testing data to validate the power of identified genes. All the data splitting involved in scPanel is done at the patient level so that the importance of the selected cell population, genes, and the performance of corresponding classifiers are genearalizable to all patients.

💡 Why scPanel is better:

Cost-Effective: Reduces the cost of sequencing with a small number of genes needed for assay
Automated: Decides the number of genes in the panel automatically in a data-driven manner
Generalizable: Ensures robust and transferable results by patient-level splitting and evaluation
Deep Learning Enabled: Supports advanced deep classifiers, i.e., Graph Attention Networks (GATs), for capturing robust gene representations
Interoperable: Fully compatible with Scanpy/Anndata framework

⚙️ Documentation

Documentation is being actively updated. Check the current version (23/08/2024) here:
📘 [[scPanel Documentation]](https://scpanel.readthedocs.io/en/latest/autoapi/scpanel/index.html)

Method Overview

scPanel follows a three-step pipeline:

Identify responsive cell population
Identify a sparse gene panel
Patient-level classification

Usage

📦 Installation

You can install scPanel directly via pip:

pip install scpanel

🧬 Input scRNA-seq data

Quality control and preprocess data using standard workflow.
Annotate cell populations.
Input AnnData Object to scPanel.

🚀 Functions in scPanel

preprocess(adata, ct_col, y_col, pt_col, class_map)
standardize metadata
split_train_test(adata, out_dir, min_cells=20, min_samples=3, test_pt_size=0.2, random_state=3467, verbose=0)
split patients into 1) training set for cell type selection, gene panel identification, and classifiers training, 2) testing set to evaluate the performance of classifiers and validate the predictive power of the gene panel.
cell_type_score(adata_train_dict, out_dir, ncpus, n_iterations, sample_n_cell, n_iterations=100, verbose=False)
calculate cell type responsive score (AUC) for each cell population annotated.
plot_cell_type_score(AUC, AUC_all)
visualize cell type responsive score (AUC)
select_celltype(adata_train_dict, out_dir, celltype_selected)
prepare anndata for gene panel selection
split_n_folds(adata_train, nfold, out_dir=None, random_state=2349)
split data into multiple folds for gene selection
gene_score(adata_train, train_index_list, val_index_list, sample_weight_list, metric, out_dir, ncpus, step=0.03, verbose = False)
scoring genes by their predictive power
decide_k(adata_train, n_genes_plot=100)
automatically decide the number of genes selected for patient classification
plot_gene_score(adata_train, n_genes_plot = 200, width=5, height=4, k=None)
visualize the gene score
select_gene(adata_train, top_n_feat, out_dir=None, step=0.03, n_genes_plot=100, verbose=0)
select the top K (returned by decide_k) genes from the training set
transform_adata(adata_train, adata_test_dict, selected_gene)
subset the training and testing set with the selected cell population and genes
models_train(adata_train_final, search_grid, out_dir=None, param_grid=None)
train classifiers with LR, KNN, RF, SVM, GAT
models_predict(clfs, adata_test_final, out_dir=None)
predict the probabilities of cells in the testing set
pt_pred(adata_test_final, cell_pred_col = 'median_pred_score', num_bootstrap=None)
predict the patient label in the testing set
plot_roc_curve(adata_test_final, sample_id, cell_pred_col, ncols = 4, hspace = 0.25, wspace = None, ax = None, scatter_kws = None, legend_kws = None)
visulize the aggregation of cell-level probabilities to patient-level label using area under the curve
plot_violin(adata, cell_pred_col = 'median_pred_score', dot_size = 2, ax=None, palette=None, xticklabels_color=False, text_kws={})
visualize the patient-level prediction

Citation

If you use scPanel in your work, please cite the scPanel publication:

Xie, Yi, et al. "scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets." Briefings in Bioinformatics 25.6 (2024): bbae482.

@article{xie2024scpanel,
  title={scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets},
  author={Xie, Yi and Yang, Jianfei and Ouyang, John F and Petretto, Enrico},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={6},
  pages={bbae482},
  year={2024},
  publisher={Oxford University Press}
}