Diff of /docs/source/diffexp.rst [000000] .. [ea0fd6]

Switch to unified view

a b/docs/source/diffexp.rst
1
================================
2
Differential Expression Analysis
3
================================
4
5
InMoose offers a Python port of the well-known R Bioconductor packages:
6
7
- `DESeq2 package <https://bioconductor.org/packages/release/bioc/html/DESeq2.html>`_ [Love2014]_ in module :doc:`deseq`.
8
- `edgeR package <https://bioconductor.org/packages/release/bioc/html/edgeR.html>`_ [Chen2016]_ in module :doc:`edgepy`.
9
- `limma package <https://bioconductor.org/packages/release/bioc/html/limma.html>`_ [Ritchie2015]_. in module :doc:`limma`.
10
11
Note that not all features of the R packages are necessarily ported. Extending
12
the functionality of these modules will be based on user requests, so do not
13
hesitate to open an issue if your favorite feature is missing.
14
15
In addition, InMoose provides a meta-analysis feature to combine the results
16
from different differential expression analysis tools.
17
18
Please refer to [Colange2024]_ for a detailed comparison of InMoose
19
implementation with the original R implementations.
20
21
Differential Expression Meta-Analysis
22
=====================================
23
24
.. currentmodule:: inmoose.diffexp
25
26
We illustrate the differential expression meta-analysis capabilities of InMoose
27
along two approaches:
28
29
- the Aggregate Data (AD) approach consists in running classical differential
30
  expression tools on individual cohorts then combining the results through
31
  *e.g.* random-effect models.
32
- the Individual Sample Data (ISD) consists in merging individual cohorts into a
33
  large meta-cohort, accounting for batch effects to eliminate inter-cohort
34
  biases, then running a classical differential expression analysis on the
35
  resulting meta-cohort.
36
37
We start by simulating RNA-Seq data, using the :mod:`sim` module of InMoose.
38
39
.. repl::
40
   import numpy as np
41
   import pandas as pd
42
   from inmoose.sim import sim_rnaseq
43
44
   # number of genes
45
   N = 10000
46
   # number of samples
47
   M = 2000
48
   assert M % 10 == 0
49
   P = M // 10  # 10% of M, helper variable
50
51
   # 3 batches: 20% 30% 50% of the samples
52
   batch = (2 * P) * [0] + (3 * P) * [1] + (5 * P) * [2]
53
   batch = np.array([f"batch{b}" for b in batch])
54
   batch0 = batch == "batch0"
55
   batch1 = batch == "batch1"
56
   batch2 = batch == "batch2"
57
58
   # 2 condition groups
59
   #   - group 1: 50% batch 1, 33% batch 2, 60% batch 3
60
   #   - group 2: 50% batch 1, 67% batch 2, 40% batch 3
61
   group = P * [0] + P * [1] + P * [0] + (2 * P) * [1] + (2 * P) * [0] + (3 * P) * [1]
62
   group = np.array([f"group{g}" for g in group])
63
   assert len(batch) == M and len(group) == M
64
65
   # store clinical metadata (i.e. batch and group) as a DataFrame
66
   clinical = pd.DataFrame({"batch": batch, "group": group})
67
   clinical.index = [f"sample{i}" for i in range(M)]
68
69
   # simulate data
70
   # random_state passes a seed to the PRNG for reproducibility
71
   counts = sim_rnaseq(N, M, batch=batch, group=group, random_state=42).T
72
73
We then run the two meta-analysis approaches on the obtained data.
74
75
.. repl::
76
   from inmoose.deseq2 import DESeq, DESeqDataSet
77
   from inmoose.diffexp import meta_de
78
   from inmoose.pycombat import pycombat_seq
79
80
   # run AD meta-analysis on batches 1, 2 and 3
81
   # first run the differential expression analysis on each batch individually
82
   cohorts = [DESeqDataSet(counts[b], clinical.loc[b], design="~ group")
83
              for b in [batch0, batch1, batch2]]
84
   individual_de = [DESeq(c).results() for c in cohorts]
85
   # then aggregate the results
86
   ad = meta_de([de for de in individual_de])
87
88
   # run ISD meta-analysis on merged batches
89
   # first correct batch effects to properly merge batches
90
   # transpositions account for the difference of formats for pycombat_seq and deseq2
91
   harmonized_counts = pycombat_seq(counts.T, batch).T
92
   # then run the differential expression analysis on the merged batches
93
   isd = DESeq(DESeqDataSet(harmonized_counts, clinical, design="~ group")).results()
94
95
We can now compare the results obtained by the two approaches.
96
97
.. repl-quiet::
98
   from matplotlib import rcParams
99
   # repl default config replaces '.' by '-' in the savefig.directory :/
100
   rcParams['savefig.directory'] = rcParams['savefig.directory'].replace("readthedocs-org", "readthedocs.org")
101
102
.. repl::
103
   import matplotlib.pyplot as plt
104
   from scipy.stats import pearsonr
105
106
   ax = plt.gca()
107
   ax.axline((0,0), slope=1, ls="--", lw=0.3, c=".4")
108
   ax.scatter(isd["log2FoldChange"], ad["combined logFC"], s=3)
109
   corr = pearsonr(isd["log2FoldChange"], ad["combined logFC"]).statistic
110
   ax.annotate(f"Pearson correlation: {100*corr:.1f}%", (2, -4), size="small")
111
   ax.set_title("ISD", loc="center")
112
   ax.set_ylabel("AD", loc="center")
113
114
   plt.show()
115
116
It is possible to combine results obtained from different tools, as long as the
117
results of the differential expression analysis are stored as
118
:class:`DEResults`.  All three modules :doc:`limma`, :doc:`edgepy` and
119
:doc:`deseq` return sub-classes of :class:`DEResults`, thus allowing users to
120
perform cross-technology meta-analysis (*e.g.* by combining results from
121
:doc:`limma` with results from :doc:`deseq`).
122
123
References
124
==========
125
126
.. [Colange2024] M. Colange, G. Appé, L. Meunier, S. Weill, A. Nordor, A.
127
   Behdenna. 2024. Differential Expression Analysis with InMoose, the Integrated
128
   Multi-Omic Open-Source Environment in Python. *Bioarxiv*.
129
   :doi:`10.1101/2024.11.14.623578`
130
131
Code documentation
132
==================
133
134
.. autosummary::
135
   :toctree: generated/
136
137
   ~DEResults.DEResults
138
139
   meta_de