|
a |
|
b/docs/source/diffexp.rst |
|
|
1 |
================================ |
|
|
2 |
Differential Expression Analysis |
|
|
3 |
================================ |
|
|
4 |
|
|
|
5 |
InMoose offers a Python port of the well-known R Bioconductor packages: |
|
|
6 |
|
|
|
7 |
- `DESeq2 package <https://bioconductor.org/packages/release/bioc/html/DESeq2.html>`_ [Love2014]_ in module :doc:`deseq`. |
|
|
8 |
- `edgeR package <https://bioconductor.org/packages/release/bioc/html/edgeR.html>`_ [Chen2016]_ in module :doc:`edgepy`. |
|
|
9 |
- `limma package <https://bioconductor.org/packages/release/bioc/html/limma.html>`_ [Ritchie2015]_. in module :doc:`limma`. |
|
|
10 |
|
|
|
11 |
Note that not all features of the R packages are necessarily ported. Extending |
|
|
12 |
the functionality of these modules will be based on user requests, so do not |
|
|
13 |
hesitate to open an issue if your favorite feature is missing. |
|
|
14 |
|
|
|
15 |
In addition, InMoose provides a meta-analysis feature to combine the results |
|
|
16 |
from different differential expression analysis tools. |
|
|
17 |
|
|
|
18 |
Please refer to [Colange2024]_ for a detailed comparison of InMoose |
|
|
19 |
implementation with the original R implementations. |
|
|
20 |
|
|
|
21 |
Differential Expression Meta-Analysis |
|
|
22 |
===================================== |
|
|
23 |
|
|
|
24 |
.. currentmodule:: inmoose.diffexp |
|
|
25 |
|
|
|
26 |
We illustrate the differential expression meta-analysis capabilities of InMoose |
|
|
27 |
along two approaches: |
|
|
28 |
|
|
|
29 |
- the Aggregate Data (AD) approach consists in running classical differential |
|
|
30 |
expression tools on individual cohorts then combining the results through |
|
|
31 |
*e.g.* random-effect models. |
|
|
32 |
- the Individual Sample Data (ISD) consists in merging individual cohorts into a |
|
|
33 |
large meta-cohort, accounting for batch effects to eliminate inter-cohort |
|
|
34 |
biases, then running a classical differential expression analysis on the |
|
|
35 |
resulting meta-cohort. |
|
|
36 |
|
|
|
37 |
We start by simulating RNA-Seq data, using the :mod:`sim` module of InMoose. |
|
|
38 |
|
|
|
39 |
.. repl:: |
|
|
40 |
import numpy as np |
|
|
41 |
import pandas as pd |
|
|
42 |
from inmoose.sim import sim_rnaseq |
|
|
43 |
|
|
|
44 |
# number of genes |
|
|
45 |
N = 10000 |
|
|
46 |
# number of samples |
|
|
47 |
M = 2000 |
|
|
48 |
assert M % 10 == 0 |
|
|
49 |
P = M // 10 # 10% of M, helper variable |
|
|
50 |
|
|
|
51 |
# 3 batches: 20% 30% 50% of the samples |
|
|
52 |
batch = (2 * P) * [0] + (3 * P) * [1] + (5 * P) * [2] |
|
|
53 |
batch = np.array([f"batch{b}" for b in batch]) |
|
|
54 |
batch0 = batch == "batch0" |
|
|
55 |
batch1 = batch == "batch1" |
|
|
56 |
batch2 = batch == "batch2" |
|
|
57 |
|
|
|
58 |
# 2 condition groups |
|
|
59 |
# - group 1: 50% batch 1, 33% batch 2, 60% batch 3 |
|
|
60 |
# - group 2: 50% batch 1, 67% batch 2, 40% batch 3 |
|
|
61 |
group = P * [0] + P * [1] + P * [0] + (2 * P) * [1] + (2 * P) * [0] + (3 * P) * [1] |
|
|
62 |
group = np.array([f"group{g}" for g in group]) |
|
|
63 |
assert len(batch) == M and len(group) == M |
|
|
64 |
|
|
|
65 |
# store clinical metadata (i.e. batch and group) as a DataFrame |
|
|
66 |
clinical = pd.DataFrame({"batch": batch, "group": group}) |
|
|
67 |
clinical.index = [f"sample{i}" for i in range(M)] |
|
|
68 |
|
|
|
69 |
# simulate data |
|
|
70 |
# random_state passes a seed to the PRNG for reproducibility |
|
|
71 |
counts = sim_rnaseq(N, M, batch=batch, group=group, random_state=42).T |
|
|
72 |
|
|
|
73 |
We then run the two meta-analysis approaches on the obtained data. |
|
|
74 |
|
|
|
75 |
.. repl:: |
|
|
76 |
from inmoose.deseq2 import DESeq, DESeqDataSet |
|
|
77 |
from inmoose.diffexp import meta_de |
|
|
78 |
from inmoose.pycombat import pycombat_seq |
|
|
79 |
|
|
|
80 |
# run AD meta-analysis on batches 1, 2 and 3 |
|
|
81 |
# first run the differential expression analysis on each batch individually |
|
|
82 |
cohorts = [DESeqDataSet(counts[b], clinical.loc[b], design="~ group") |
|
|
83 |
for b in [batch0, batch1, batch2]] |
|
|
84 |
individual_de = [DESeq(c).results() for c in cohorts] |
|
|
85 |
# then aggregate the results |
|
|
86 |
ad = meta_de([de for de in individual_de]) |
|
|
87 |
|
|
|
88 |
# run ISD meta-analysis on merged batches |
|
|
89 |
# first correct batch effects to properly merge batches |
|
|
90 |
# transpositions account for the difference of formats for pycombat_seq and deseq2 |
|
|
91 |
harmonized_counts = pycombat_seq(counts.T, batch).T |
|
|
92 |
# then run the differential expression analysis on the merged batches |
|
|
93 |
isd = DESeq(DESeqDataSet(harmonized_counts, clinical, design="~ group")).results() |
|
|
94 |
|
|
|
95 |
We can now compare the results obtained by the two approaches. |
|
|
96 |
|
|
|
97 |
.. repl-quiet:: |
|
|
98 |
from matplotlib import rcParams |
|
|
99 |
# repl default config replaces '.' by '-' in the savefig.directory :/ |
|
|
100 |
rcParams['savefig.directory'] = rcParams['savefig.directory'].replace("readthedocs-org", "readthedocs.org") |
|
|
101 |
|
|
|
102 |
.. repl:: |
|
|
103 |
import matplotlib.pyplot as plt |
|
|
104 |
from scipy.stats import pearsonr |
|
|
105 |
|
|
|
106 |
ax = plt.gca() |
|
|
107 |
ax.axline((0,0), slope=1, ls="--", lw=0.3, c=".4") |
|
|
108 |
ax.scatter(isd["log2FoldChange"], ad["combined logFC"], s=3) |
|
|
109 |
corr = pearsonr(isd["log2FoldChange"], ad["combined logFC"]).statistic |
|
|
110 |
ax.annotate(f"Pearson correlation: {100*corr:.1f}%", (2, -4), size="small") |
|
|
111 |
ax.set_title("ISD", loc="center") |
|
|
112 |
ax.set_ylabel("AD", loc="center") |
|
|
113 |
|
|
|
114 |
plt.show() |
|
|
115 |
|
|
|
116 |
It is possible to combine results obtained from different tools, as long as the |
|
|
117 |
results of the differential expression analysis are stored as |
|
|
118 |
:class:`DEResults`. All three modules :doc:`limma`, :doc:`edgepy` and |
|
|
119 |
:doc:`deseq` return sub-classes of :class:`DEResults`, thus allowing users to |
|
|
120 |
perform cross-technology meta-analysis (*e.g.* by combining results from |
|
|
121 |
:doc:`limma` with results from :doc:`deseq`). |
|
|
122 |
|
|
|
123 |
References |
|
|
124 |
========== |
|
|
125 |
|
|
|
126 |
.. [Colange2024] M. Colange, G. Appé, L. Meunier, S. Weill, A. Nordor, A. |
|
|
127 |
Behdenna. 2024. Differential Expression Analysis with InMoose, the Integrated |
|
|
128 |
Multi-Omic Open-Source Environment in Python. *Bioarxiv*. |
|
|
129 |
:doi:`10.1101/2024.11.14.623578` |
|
|
130 |
|
|
|
131 |
Code documentation |
|
|
132 |
================== |
|
|
133 |
|
|
|
134 |
.. autosummary:: |
|
|
135 |
:toctree: generated/ |
|
|
136 |
|
|
|
137 |
~DEResults.DEResults |
|
|
138 |
|
|
|
139 |
meta_de |