[ea0fd6]: / docs / source / pycombatnorm.rst

Download this file

102 lines (73 with data), 3.3 kB

pycombat_norm: batch effect corrections for microarray data

ComBat [Johnson2007]_ is one of the most widely used tool for batch effect correction for microarray transcriptomic data. :func:`pycombat_norm` is a Python implementation of ComBat. Strictly following the same mathematical framework, :func:`pycombat_norm` has results similar to those of ComBat in terms of batch effects correction. Additionally, :func:`pycombat_norm` is as fast, if not faster, than the original implementation in R.

Minimal usage example

This minimal usage example illustrates how to use pycombat_norm in a default setting, and shows some results on ovarian cancer data. The data we use is freely available on NCBI Gene Expression Omnibus, namely:

  • GSE18520
  • GSE66957
  • GSE69428

The corresponding expression files are stored on InMoose repository in the data subfolder.

# import libraries
from inmoose.pycombat import pycombat_norm
import pandas as pd
import matplotlib.pyplot as plt

# prepare data
# the datasets are dataframes where:
   # the indices correspond to the gene names
   # the column names correspond to the sample names
# Any number (>=2) of datasets can be treated at once
dataset_1 = pd.read_pickle("data/GSE18520.pickle")
dataset_2 = pd.read_pickle("data/GSE66957.pickle")
dataset_3 = pd.read_pickle("data/GSE69428.pickle")

# merge all three datasets into a single one, keeping only common genes
df_expression = pd.concat([dataset_1,dataset_2,dataset_3],join="inner",axis=1)

# plot raw data
plt.boxplot(df_expression)
plt.show()
Distribution of raw data

Gene expression by sample in the raw data (colored by dataset).

# generate the list of batches
datasets = [dataset_1, dataset_2, dataset_3]
batch = [j for j,ds in enumerate(datasets) for _ in range(len(ds.columns))]

# run pycombat_norm
df_corrected = pycombat_norm(df_expression, batch)

# visualize results
plt.boxplot(df_corrected)
plt.show()
Distribution of corrected data

Gene expression by sample in the batch effect-corrected data (colored by dataset).

Biological Insight

The data we used for the example above contain tumor samples and healthy samples. A simple PCA on the raw expression data shows that, instead of clustering by sample type, data cluster by dataset.

PCA for the raw data

PCA on the raw expression data, colored by tumor sample (blue and yellow) and healthy sample (pink).

However, after correcting batch effects with pycombat_norm, the same PCA shows two clusters, corresponding respectively to tumor and healthy samples.

PCA for data corrected for batch effects

PCA on the batch effect-corrected expression data, colored by tumor sample (blue and yellow) and healthy sample (pink).

Code documentation