Diff of /README.md [000000] .. [d90ecf]

Switch to unified view

a b/README.md
1
# scPanel
2
3
scPanel selects **a sparse gene panel** from responsive cell population(s) for **patient-level classification** in single cell RNA sequencing (**scRNA-seq**) data. 
4
5
This repository is the official code implementation of the paper [scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets](https://academic.oup.com/bib/article/25/6/bbae482/7796623), published in *Briefings in Bioinformatics*.
6
7
### 🔬 Key Features
8
9
- Identify **cell populations** that are responding to the perturbation (e.g., disease, drugs)
10
11
- Selects a **minimal set of genes** that can discriminate two different status in the selected cell population(s)
12
13
- Trains **patient-level ML/DL classifiers** that can predict patients in two different status
14
15
![Framework Overview](./framework.png)
16
17
Specifically, patients are split into training and testing sets. In the training set, cell populations responsive to perturbations are scored by quantifying how well each cell population is separated between two conditions. With the selected population, Support Vector Machine Recursive Feature Elimination (*SVM*-*RFE*) is applied to identify a minimal number of genes with high predictive power. The number of genes in the panel is automatically decided in a data-driven way to avoid bias from manual inspection. Using the selected cell population(s) and corresponding gene panel(s), scPanel constructs a patient-level classifier with the training data and evaluates its performance in the testing data to validate the power of identified genes. All the data splitting involved in scPanel is done at the patient level so that the importance of the selected cell population, genes, and the performance of corresponding classifiers are genearalizable to all patients.
18
19
### 💡 Why scPanel is better:
20
21
- **Cost-Effective**: Reduces the cost of sequencing with a small number of genes needed for assay
22
23
- **Automated**: Decides the number of genes in the panel automatically in a data-driven manner
24
25
- **Generalizable**: Ensures robust and transferable results by patient-level splitting and evaluation
26
27
- **Deep Learning Enabled**: Supports advanced deep classifiers, i.e., Graph Attention Networks (GATs), for capturing robust gene representations
28
29
- **Interoperable**: Fully compatible with Scanpy/Anndata framework
30
31
### ⚙️ Documentation
32
Documentation is being actively updated. Check the current version (23/08/2024) here:   
33
📘 [[scPanel Documentation]](https://scpanel.readthedocs.io/en/latest/autoapi/scpanel/index.html)
34
35
## Method Overview
36
37
scPanel follows a three-step pipeline:
38
39
1. Identify responsive cell population
40
41
2. Identify a sparse gene panel
42
43
3. Patient-level classification
44
45
## Usage
46
47
### 📦 Installation
48
49
You can install `scPanel` directly via pip:
50
51
```bash
52
pip install scpanel
53
```
54
55
### 🧬 Input scRNA-seq data
56
57
1. Quality control and preprocess data using standard workflow.
58
59
2. Annotate cell populations.
60
61
3. Input AnnData Object to scPanel.
62
63
### 🚀 Functions in scPanel
64
65
- `preprocess`(adata, ct_col, y_col, pt_col, class_map)
66
  
67
  - standardize metadata
68
69
- `split_train_test`(adata, out_dir, min_cells=20, min_samples=3, test_pt_size=0.2, random_state=3467, verbose=0)
70
  
71
  - split patients into 1) training set for cell type selection, gene panel identification, and classifiers training, 2) testing set to evaluate the performance of classifiers and validate the predictive power of the gene panel.
72
73
- `cell_type_score`(adata_train_dict, out_dir, ncpus, n_iterations, sample_n_cell, n_iterations=100, verbose=False)
74
  
75
  - calculate cell type responsive score (AUC) for each cell population annotated.
76
77
- `plot_cell_type_score`(AUC, AUC_all)
78
  
79
  - visualize cell type responsive score (AUC)
80
81
- `select_celltype`(adata_train_dict, out_dir, celltype_selected)
82
  
83
  - prepare anndata for gene panel selection
84
85
- `split_n_folds`(adata_train, nfold, out_dir=None, random_state=2349)
86
  
87
  - split data into multiple folds for gene selection
88
89
- `gene_score`(adata_train, train_index_list, val_index_list, sample_weight_list, metric, out_dir, ncpus, step=0.03, verbose = False)
90
  
91
  - scoring genes by their predictive power
92
93
- `decide_k`(adata_train, n_genes_plot=100)
94
  
95
  - automatically decide the number of genes selected for patient classification
96
97
- `plot_gene_score`(adata_train, n_genes_plot = 200, width=5, height=4, k=None)
98
  
99
  - visualize the gene score
100
101
- `select_gene`(adata_train, top_n_feat, out_dir=None, step=0.03, n_genes_plot=100, verbose=0)
102
  
103
  - select the top K (returned by `decide_k`) genes from the training set
104
105
- `transform_adata`(adata_train, adata_test_dict, selected_gene)
106
  
107
  - subset the training and testing set with the selected cell population and genes
108
109
- `models_train`(adata_train_final, search_grid, out_dir=None, param_grid=None)
110
  
111
  - train classifiers with LR, KNN, RF, SVM, GAT
112
113
- `models_predict`(clfs, adata_test_final, out_dir=None)
114
  
115
  - predict the probabilities of cells in the testing set
116
117
- `pt_pred`(adata_test_final, cell_pred_col = 'median_pred_score', num_bootstrap=None)
118
  
119
  - predict the patient label in the testing set
120
121
- `plot_roc_curve`(adata_test_final, sample_id, cell_pred_col, ncols = 4, hspace = 0.25, wspace = None, ax = None, scatter_kws = None, legend_kws = None)
122
  
123
  - visulize the aggregation of cell-level probabilities to patient-level label using area under the curve
124
125
- `plot_violin`(adata, cell_pred_col = 'median_pred_score', dot_size = 2, ax=None, palette=None, xticklabels_color=False, text_kws={})
126
  
127
  - visualize the patient-level prediction
128
129
## Citation
130
131
If you use `scPanel` in your work, please cite the `scPanel` publication:
132
133
Xie, Yi, et al. "scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets." Briefings in Bioinformatics 25.6 (2024): bbae482.
134
135
```
136
@article{xie2024scpanel,
137
  title={scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets},
138
  author={Xie, Yi and Yang, Jianfei and Ouyang, John F and Petretto, Enrico},
139
  journal={Briefings in Bioinformatics},
140
  volume={25},
141
  number={6},
142
  pages={bbae482},
143
  year={2024},
144
  publisher={Oxford University Press}
145
}
146
```