ComBat [Johnson2007]_ is one of the most widely used tool for batch effect correction for microarray transcriptomic data. :func:`pycombat_norm` is a Python implementation of ComBat. Strictly following the same mathematical framework, :func:`pycombat_norm` has results similar to those of ComBat in terms of batch effects correction. Additionally, :func:`pycombat_norm` is as fast, if not faster, than the original implementation in R.
This minimal usage example illustrates how to use pycombat_norm in a default setting, and shows some results on ovarian cancer data. The data we use is freely available on NCBI Gene Expression Omnibus, namely:
- GSE18520
- GSE66957
- GSE69428
The corresponding expression files are stored on InMoose repository in the data subfolder.
# import libraries from inmoose.pycombat import pycombat_norm import pandas as pd import matplotlib.pyplot as plt # prepare data # the datasets are dataframes where: # the indices correspond to the gene names # the column names correspond to the sample names # Any number (>=2) of datasets can be treated at once dataset_1 = pd.read_pickle("data/GSE18520.pickle") dataset_2 = pd.read_pickle("data/GSE66957.pickle") dataset_3 = pd.read_pickle("data/GSE69428.pickle") # merge all three datasets into a single one, keeping only common genes df_expression = pd.concat([dataset_1,dataset_2,dataset_3],join="inner",axis=1) # plot raw data plt.boxplot(df_expression) plt.show()
Gene expression by sample in the raw data (colored by dataset).
# generate the list of batches datasets = [dataset_1, dataset_2, dataset_3] batch = [j for j,ds in enumerate(datasets) for _ in range(len(ds.columns))] # run pycombat_norm df_corrected = pycombat_norm(df_expression, batch) # visualize results plt.boxplot(df_corrected) plt.show()
Gene expression by sample in the batch effect-corrected data (colored by dataset).
The data we used for the example above contain tumor samples and healthy samples. A simple PCA on the raw expression data shows that, instead of clustering by sample type, data cluster by dataset.
PCA on the raw expression data, colored by tumor sample (blue and yellow) and healthy sample (pink).
However, after correcting batch effects with pycombat_norm, the same PCA shows two clusters, corresponding respectively to tumor and healthy samples.
PCA on the batch effect-corrected expression data, colored by tumor sample (blue and yellow) and healthy sample (pink).