a b/docs/source/deseq.rst
1
======
2
deseq2
3
======
4
5
.. currentmodule:: inmoose.deseq2
6
7
This module is a partial port of the R Bioconductor `DESeq2 package
8
<https://bioconductor.org/packages/release/bioc/html/DESeq2.html>`_ [Love2014]_.
9
10
.. repl::
11
   from inmoose.deseq2 import *
12
   import patsy
13
   import numpy as np
14
   import matplotlib.pyplot as plt
15
   import pandas as pd
16
   import scipy
17
18
19
.. repl-quiet::
20
   pd.options.display.max_columns = 10
21
   from matplotlib import rcParams
22
   # repl default config replaces '.' by '-' in the savefig.directory :/
23
   rcParams['savefig.directory'] = rcParams['savefig.directory'].replace("readthedocs-org", "readthedocs.org")
24
25
Quick start
26
===========
27
28
Here we show the most basic steps for a differential expression analysis. There
29
are a variety of steps upstream of DESeq2 that result in the generation of
30
counts or estimated counts for each sample, which we will discuss in the
31
sections below. This code chunk assumes that you have a count matrix called
32
:code:`cts` and a table of sample information called :code:`coldata`.  The
33
:code:`design` indicates how to model the samples, here, that we want to measure
34
the effect of the condition, controlling for batch differences. The two factor
35
variables :code:`batch` and :code:`condition` should  be columns of
36
:code:`coldata`:
37
38
>>> dds = DESeqDataSet(countData = cts,
39
...                    clinicalData = coldata,
40
...                    design = "~ batch + condition")
41
>>> dds = DESeq(dds)
42
>>> # list the coefficients
43
>>> dds.resultsNames()
44
>>> res = dds.results(name = "condition_trt_vs_untrt")
45
>>> # or to shrink the log fold changes association with condition
46
>>> res = dds.lfcShrink(coef = "condition_trt_vs_untrt", type = "apeglm")
47
48
49
Input data
50
==========
51
52
Why un-normalized counts?
53
-------------------------
54
55
As input, the DESeq2 package expects count data as obtained, *e.g.* from RNA-seq
56
or another high-throughput sequencing experiment, in the form of a matrix of
57
integer values. The value in the :math:`i`-th row and the :math:`j`-th column of
58
the matrix tells how many reads can be assigned to gene :math:`i` in sample
59
:math:`j`.  Analogously, for other types of assays, the rows of the matrix might
60
correspond *e.g.* to binding regions (with ChIP-Seq) or peptide sequences (with
61
quantitative mass spectrometry). We will list method for obtaining count
62
matrices in sections below.
63
64
The values in the matrix should be un-normalized counts or estimated counts of
65
sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq).
66
The `RNA-seq workflow <http://www.bioconductor.org/help/workflows/rnaseqGene/>`_
67
describes multiple techniques for preparing such count matrices.  It is
68
important to provide count matrices as input for DESeq2's statistical model
69
[Love2014]_ to hold, as only the count values allow assessing the measurement
70
precision correctly. The DESeq2 model internally corrects for library size, so
71
transformed or normalized values such as counts scaled by library size should
72
not be used as input.
73
74
The DESeqDataSet
75
----------------
76
77
The object class used by the DESeq2 package to store the read counts and the
78
intermediate estimated quantities during statistical analysis is the
79
:class:`~DESeqDataSet.DESeqDataSet`, which will usually be represented in the
80
code here as an object :code:`dds`.
81
82
A technical detail is that the :class:`~DESeqDataSet.DESeqDataSet` class extends
83
the :class:`AnnData` class of the `anndata
84
<https://github.com/scverse/anndata>`_ package.
85
86
A :class:`~DESeqDataSet.DESeqDataSet` object must have an associated **design
87
matrix**.  The design matrix expresses the variables which will be used in
88
modeling. The matrix is built from a formula starting with a tilde (~) followed
89
by the variables with plus signs between them (it will be coerced into a formula
90
if it is not already). The design can be changed later, however then all
91
differential analysis steps should be repeated, as the design formula is used to
92
estimate the dispersions and to estimate the log2 fold changes of the model.
93
94
.. note::
95
   In order to benefit from the default settings of the package, you should put
96
   the variable of interest at the end of the formula and make sure the control
97
   level is the first level.
98
99
We will now show 2 ways of constructing a :class:`~DESeqDataSet.DESeqDataSet`,
100
depending on what pipeline was used upstream of DESeq2 to generated counts or
101
estimated counts:
102
103
  1. From a count matrix :ref:`countmat`
104
  2. From an :class:`AnnData` object :ref:`ad`
105
106
.. note::
107
   The original R package allows to build a :class:`~DESeqDataSet.DESeqDataSet`
108
   from transcript abundance files and htseq count files, but those features
109
   have not yet been ported into `inmoose`.
110
111
.. _countmat:
112
113
Count matrix input
114
------------------
115
116
The constructor :meth:`.DESeqDataSet.__init__` can be used if you already have a
117
matrix of read counts prepared from another source. Another method for quickly
118
producing count matrices from alignment files is the :code:`featureCounts`
119
function [Liao2013]_ in the `Rsubread
120
<http://bioconductor.org/packages/Rsubread>`_ package.  To use the constructor
121
of :meth:`.DESeqDataSet.__init__`, the user should provide the counts matrix,
122
the information about the samples (the rows of the count matrix) as a data
123
frame, and the design formula.
124
125
To demonstrate the construction of :class:`~DESeqDataSet.DESeqDataSet` from a
126
count matrix, we will read in count data from the `pasilla
127
<http://bioconductor.org/packages/pasilla>`_ package, available in `inmoose`. We
128
read in a count matrix, which we will name :code:`cts`, and the sample
129
information table, which we will name :code:`sample_data`.  Further below we
130
describe how to extract these objects from, *e.g.* :code:`featureCounts`
131
output:
132
133
.. repl::
134
135
   import importlib.resources
136
   from inmoose.utils import Factor
137
138
   data_dir = importlib.resources.files("inmoose.data.pasilla")
139
   pasCts = data_dir.joinpath("pasilla_gene_counts.tsv")
140
   pasAnno = data_dir.joinpath("pasilla_sample_annotation.csv")
141
   cts = pd.read_csv(pasCts, sep='\t', index_col=0)
142
   sample_data = pd.read_csv(pasAnno, index_col=0)
143
   sample_data = sample_data[["condition", "type"]]
144
   sample_data["condition"] = Factor(sample_data["condition"])
145
   sample_data["type"] = Factor(sample_data["type"])
146
147
We examine the count matrix and sample data to see if they are consistent in
148
terms of sample order:
149
150
.. repl::
151
   cts.head(2)
152
   sample_data
153
154
Note that these are not in the same order with respect to samples!
155
156
It is absolutely critical that the rows of the count matrix and the rows of
157
the sample data (information about samples) are in the same order.  DESeq2 will
158
not make guesses as to which row of the count matrix belongs to which row of
159
the sample data, these must be provided to DESeq2 already in consistent order.
160
161
As they are not in the correct order as given, we need to re-arrange one or the
162
other so that they are consistent in terms of sample order (if we do not, later
163
functions would produce an error). We additionally need to chop off the `"fb"`
164
of the row names of :code:`sample_data`, so the naming is consistent:
165
166
.. repl::
167
168
   sample_data.index = [i[:-2] for i in sample_data.index]
169
   all(sample_data.index.isin(cts.columns))
170
   all(sample_data.index == cts.columns)
171
   sample_data = sample_data.reindex(cts.columns)
172
   all(sample_data.index == cts.columns)
173
174
If you have used the :code:`featureCounts` function [Liao2013]_ in the `Rsubread
175
<http://bioconductor.org/packages/Rsubread>`_ package, the matrix of read counts
176
can be directly provided from the `"counts"` element in the list output.  The
177
count matrix and sample data can typically be read into Python from flat files
178
using import functions from :code:`pandas` or :code:`numpy`.
179
180
With the count matrix, :code:`cts`, and the sample information,
181
:code:`sample_data`, we can construct a :class:`~DESeqDataSet.DESeqDataSet`:
182
183
.. repl::
184
185
   dds = DESeqDataSet(countData = cts.T,
186
                      clinicalData = sample_data,
187
                      design = "~ condition")
188
   dds
189
190
If you have additional feature data, it can be added to the
191
:class:`~DESeqDataSet.DESeqDataSet` by adding to the metadata columns of a newly
192
constructed object. (Here we add redundant data just for demonstration, as the
193
gene names are already the rownames of the :code:`dds`.):
194
195
.. repl::
196
197
   featureData = dds.var.index
198
   dds.var["featureData"] = featureData
199
   dds.var.head()
200
201
202
.. _ad:
203
204
:class:`AnnData` input
205
----------------------
206
207
If one has already created or obtained an :class:`AnnData`, it can be easily
208
input into DESeq2 as follows. First we load the module containing the `airway`
209
dataset:
210
211
.. repl::
212
213
   from inmoose.data.airway import airway
214
   ad = airway()
215
216
The constructor function below shows the generation of a
217
:class:`~DESeqDataSet.DESeqDataSet` from an :class:`AnnData` :code:`ad`:
218
219
.. repl::
220
221
   ddsAD = DESeqDataSet(ad, design = "~ cell + dex")
222
   ddsAD
223
224
225
Pre-filtering
226
-------------
227
228
While it is not necessary to pre-filter low count genes before running the
229
DESeq2 functions, there are two reasons which make pre-filtering useful: by
230
removing rows in which there are very few reads, we reduce the memory size of
231
the :code:`dds` data object, and we increase the speed of the transformation and
232
testing functions within DESeq2. It can also improve visualizations, as features
233
with no information for differential expression are not plotted.
234
235
Here we perform a minimal pre-filtering to keep only rows that have at least 10
236
reads total. Note that more strict filtering to increase power is
237
*automatically* applied via :ref:`independent filtering<indfilt>` on the mean of
238
normalized counts within the :meth:`.DESeqDataSet.results` function:
239
240
.. repl::
241
242
   keep = dds.counts().sum(axis=1) >= 10
243
   dds = dds[keep,:]
244
245
Alternatively, a popular filter is to ensure at least :code:`X` samples with a
246
count of 10 or more, where :code:`X` can be chosen as the sample size of the
247
smallest group of samples:
248
249
>>> keep = (dds.counts() >= 10).sum(axis=1) >= X
250
>>> dds = dds[keep,:]
251
252
.. _factorLevels:
253
254
Note on factor levels
255
---------------------
256
257
By default, Python will choose a *reference level* for factors based on
258
alphabetical order, it chooses the first value as the reference. Then, if you 
259
never tell the DESeq2 functions which level you want to compare against 
260
(*e.g.* which level represents the control group), the comparisons will 
261
be based on the alphabetical order of the levels. There are two
262
solutions: you can either explicitly tell :meth:`.DESeqDataSet.results` which
263
comparison to make using the :code:`contrast` argument (this will be shown
264
later), or you can explicitly set the factors levels by specifying the desired 
265
reference value first. In order to see the change of reference levels reflected 
266
in the results names, you need to either run :func:`DESeq` or 
267
:func:`nbinomWaldTest` / :func:`nbinomLRT` after the re-leveling operation.  
268
Setting the factor levels can be done with the:code:`reorder_categories` function:
269
270
.. repl::
271
272
   dds.obs["condition"] = dds.obs["condition"].cat.reorder_categories(["untreated", "treated"])
273
274
If you need to subset the columns of a :class:`~DESeqDataSet.DESeqDataSet`,
275
*i.e.* when removing certain samples from the analysis, it is possible that all
276
the samples for one or more levels of a variable in the design formula would be
277
removed. In this case, the :code:`remove_unused_categories` function can be used
278
to remove those levels which do not have samples in the current
279
:class:`~DESeqDataSet.DESeqDataSet`:
280
281
.. repl::
282
283
   dds.obs["condition"] = dds.obs["condition"].cat.remove_unused_categories()
284
285
Collapsing technical replicates
286
-------------------------------
287
288
DESeq2 provides a function :func:`collapseReplicates` which can assist in
289
combining the counts from technical replicates into single columns of the count
290
matrix. The term *technical replicate* implies multiple sequencing runs of the
291
same library.  You should not collapse biological replicates using this
292
function.  See the manual page for an example of the use of
293
:func:`collapseReplicates`.
294
295
About the pasilla dataset
296
-------------------------
297
298
We continue with the :doc:`/pasilla` data constructed from the count matrix
299
method above. This data set is from an experiment on *Drosophila melanogaster*
300
cell cultures and investigated the effect of RNAi knock-down of the splicing
301
factor *pasilla* [Brooks2011]_.  The detailed transcript of the production of
302
the `pasilla <http://bioconductor.org/packages/pasilla>`_ data is provided in
303
the vignette of the data package `pasilla
304
<http://bioconductor.org/packages/pasilla>`_.
305
306
.. _de:
307
308
Differential expression analysis
309
================================
310
311
The standard differential expression analysis steps are wrapped into a single
312
function :func:`DESeq`. The estimation steps performed by this function are
313
described :ref:`below<theory>`, in the manual page for :func:`DESeq` and in the
314
Methods section of the DESeq2 publication [Love2014]_.
315
316
Results tables are generated using the function :meth:`.DESeqDataSet.results`,
317
which extracts a results table with log2 fold changes, *p*-values and adjusted
318
*p*-values. With no additional arguments to :meth:`.DESeqDataSet.results`, the
319
log2 fold change and Wald test *p*-value will be for the **last variable** in
320
the design formula, and if this is a factor, the comparison will be the **last
321
level** of this variable over the **reference level** (see previous :ref:`note
322
on factor levels<factorlevels>`).  However, the order of the variables of the
323
design does not matter so long as the user specifies the comparison to build a
324
results table for, using the :code:`name` or :code:`contrast` arguments of
325
:meth:`.DESeqDataSet.results`.
326
327
Details about the comparison are printed to the console, directly above the
328
results table. The text, `condition treated vs untreated`, tells you that the
329
estimates are of the logarithmic fold change log2(treated/untreated):
330
331
.. repl::
332
333
   dds.design = "~ condition"
334
   dds = DESeq(dds)
335
   res = dds.results()
336
   res.head()
337
338
Note that we could have specified the coefficient or contrast we want to build a
339
results table for, using either of the following equivalent commands:
340
341
>>> res = dds.results(name="condition_treated_vs_untreated")
342
>>> res = dds.results(contrast=["condition","treated","untreated"])
343
344
One exception to the equivalence of these two commands, is that, using
345
:code:`contrast` will additionally set to 0 the estimated LFC in a comparison of
346
two groups, where all of the counts in the two groups are equal to 0 (while
347
other groups have positive counts). As this may be a desired feature to have the
348
LFC in these cases set to 0, one can use :code:`contrast` to build these results
349
tables.  More information about extracting specific coefficients from a fitted
350
:class:`~DESeqDataSet.DESeqDataSet` object can be found in the help page
351
:meth:`.DESeqDataSet.results`.  The use of the :code:`contrast` argument is also
352
further discussed :ref:`below<contrasts>`.
353
354
.. _lfcShrink:
355
356
Log fold change shrinkage for visualization and ranking
357
-------------------------------------------------------
358
359
Shrinkage of effect size (LFC estimates) is useful for visualization and ranking
360
of genes. To shrink the LFC, we pass the :code:`dds` object to the function
361
:func:`lfcShrink`. Below we specify to use the :code:`apeglm` method for effect
362
size shrinkage [Zhu2018]_, which improves on the previous estimator.
363
364
We provide the :code:`dds` object and the name or number of the coefficient we
365
want to shrink, where the number refers to the order of the coefficient as it
366
appears in :code:`dds.resultsNames()`:
367
368
>>> dds.resultsNames()
369
>>> resLFC = dds.lfcShrink(coef="condition_treated_vs_untreated", type="apeglm")
370
>>> resLFC
371
372
Shrinkage estimation is discussed more in a :ref:`later section<shrink>`.
373
374
*p*-values and adjusted *p*-values
375
----------------------------------
376
377
We can order our results table by the smallest *p*-value:
378
379
.. repl::
380
   resOrdered = res.sort_values(by="pvalue")
381
382
We can summarize some basic tallies using the :meth:`.DESeqResults.summary`
383
function:
384
385
.. repl::
386
   print(res.summary())
387
388
How many adjusted *p*-values were less than 0.1:
389
390
.. repl::
391
   (res.padj < 0.1).sum()
392
393
The :meth:`.DESeqDataSet.results` function contains a number of arguments to
394
customize the results table which is generated. You can read about these
395
arguments by looking up the documentation of :meth:`.DESeqDataSet.results`.
396
Note that the :meth:`.DESeqDataSet.results` function automatically performs
397
independent filtering based on the mean of normalized counts for each gene,
398
optimizing the number of genes which will have an adjusted *p*-value below a
399
given FDR cutoff, :code:`alpha`.  Independent filtering is further discussed
400
:ref:`below<indfilt>`.  By default the argument :code:`alpha` is set to 0.1.  If
401
the adjusted *p*-value cutoff will be a value other than 0.1, :code:`alpha`
402
should be set to that value:
403
404
.. repl::
405
   res05 = dds.results(alpha=0.05)
406
   print(res05.summary())
407
   (res05.padj < 0.05).sum()
408
409
.. _IHW:
410
411
..
412
  Independent hypothesis weighting
413
  --------------------------------
414
415
  A generalization of the idea of *p*-value filtering is to *weight* hypotheses to
416
  optimize power. A Bioconductor package, `IHW
417
  <http://bioconductor.org/packages/IHW>`_, is available that implements the
418
  method of *Independent Hypothesis Weighting* [Ignatiadis2016]_.  Here we show
419
  the use of *IHW* for *p*-value adjustment of DESeq2 results.  For more details,
420
  please see the vignette of the `IHW <http://bioconductor.org/packages/IHW>`_
421
  package. The *IHW* result object is stored in the metadata::
422
423
    # (unevaluated code chunk)
424
    library("IHW")
425
    resIHW <- results(dds, filterFun=ihw)
426
    summary(resIHW)
427
    sum(resIHW$padj < 0.1, na.rm=TRUE)
428
    metadata(resIHW)$ihwResult
429
430
  .. note::
431
     If the results of independent hypothesis weighting are used in published
432
     research, please cite:
433
434
      Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W. (2016)
435
      Data-driven hypothesis weighting increases detection power in genome-scale multiple testing.
436
      *Nature Methods*, **13**:7.
437
      :doi:`http://dx.doi.org/10.1038/nmeth.3885`
438
439
  For advanced users, note that all the values calculated by the DESeq2 package
440
  are stored in the :class:`~DESeqDataSet.DESeqDataSet` object or the
441
  :class:`DESeqResults` object, and access to these values is discussed
442
  :ref:`below<access>`.
443
444
Exploring and exporting results
445
===============================
446
447
MA-plot
448
-------
449
450
In DESeq2, the function :meth:`.DESeqResults.plotMA` shows the log2 fold changes
451
attributable to a given variable over the mean of normalized counts for all the
452
samples in the :class:`~DESeqDataSet.DESeqDataSet`.  Points will be colored red
453
if the adjusted *p*-value is less than 0.1.  Points which fall out of the window
454
are plotted as open triangles pointing either up or down:
455
456
.. repl::
457
   res.plotMA(ylim=[-2,2])
458
   plt.show()
459
460
It is more useful visualize the MA-plot for the shrunken log2 fold changes,
461
which remove the noise associated with log2 fold changes from low count genes
462
without requiring arbitrary filtering thresholds:
463
464
.. repl::
465
   resLFC.plotMA(ylim=[-2,2])
466
   plt.show()
467
468
..
469
  After calling :meth:`.DESeqResults.plotMA`, one can use the function
470
  :func:`identify` to interactively detect the row number of individual genes by
471
  clicking on the plot.  One can then recover the gene identifiers by saving the
472
  resulting indices::
473
474
    idx = identify(res.baseMean, res.log2FoldChange)
475
    res.index[idx]
476
477
.. _shrink:
478
479
Alternative shrinkage estimators
480
--------------------------------
481
482
The moderated log fold changes proposed by [Love2014]_ use a normal prior
483
distribution, centered on zero and with a scale that is fit to the data. The
484
shrunken log fold changes are useful for ranking and visualization, without the
485
need for arbitrary filters on low count genes. The normal prior can sometimes
486
produce too strong of shrinkage for certain datasets. In DESeq2 version 1.18, we
487
include two additional adaptive shrinkage estimators, available via the
488
:code:`type` argument of :func:`lfcShrink`.
489
490
The options for :code:`type` are:
491
492
* :code:`apeglm` is the adaptive t prior shrinkage estimator from the `apeglm
493
  <http://bioconductor.org/packages/apeglm>`_ package [Zhu2018]_. As of version
494
  1.28.0, it is the default estimator.
495
* :code:`ashr` is the adaptive shrinkage estimator from the `ashr
496
  <https://github.com/stephens999/ashr>`_ package [Stephens2016]_.  Here DESeq2
497
  uses the ashr option to fit a mixture of Normal distributions to form the
498
  prior, with :code:`method="shrinkage"`.
499
* :code:`normal`: is the the original DESeq2 shrinkage estimator, an adaptive
500
  Normal distribution as prior.
501
502
If the shrinkage estimator :code:`apeglm` is used in published research, please
503
cite [Zhu2018]_.
504
505
If the shrinkage estimator :code:`ashr` is used in published research, please
506
cite [Stephens2016]_.
507
508
In the LFC shrinkage code above, we specified
509
:code:`coef="condition_treated_vs_untreated"`. We can also just specify the
510
coefficient by the order that it appears in :code:`dds.resultsNames()`, in this
511
case :code:`coef=2`. For more details explaining how the shrinkage estimators
512
differ, and what kinds of designs, contrasts and output is provided by each, see
513
the :ref:`extended section on shrinkage estimators<moreshrink>`:
514
515
>>> dds.resultsNames()
516
>>> # because we are interested in treated vs untreated, we set 'coef=2'
517
>>> resNorm = dds.lfcShrink(coef=2, type="normal")
518
>>> resAsh = dds.lfcShrink(coef=2, type="ashr")
519
520
..
521
  ```{r fig.width=8, fig.height=3}
522
  par(mfrow=c(1,3), mar=c(4,4,2,1))
523
  xlim <- c(1,1e5); ylim <- c(-3,3)
524
  plotMA(resLFC, xlim=xlim, ylim=ylim, main="apeglm")
525
  plotMA(resNorm, xlim=xlim, ylim=ylim, main="normal")
526
  plotMA(resAsh, xlim=xlim, ylim=ylim, main="ashr")
527
  ```
528
529
..
530
  .. note::
531
     We have sped up the :code:`apeglm` method so it takes roughly about the same
532
     amount of time as :code:`normal`, e.g. ~5 seconds for the :code:`pasilla`
533
     dataset of ~10,000 genes and 7 samples.  If fast shrinkage estimation of LFC
534
     is needed, **but the posterior standard deviation is not needed**, setting
535
     :code:`apeMethod="nbinomC"` will produce a ~10x speedup, but the
536
     :code:`lfcSE` column will be returned with :code:`NA`.  A variant of this
537
     fast method, :code:`apeMethod="nbinomC*"` includes random starts.
538
539
540
.. note::
541
   If there is unwanted variation present in the data (*e.g.* batch effects) it
542
   is always recommended to correct for this, which can be accommodated in
543
   DESeq2 by including in the design any known batch variables or by using
544
   functions/packages such as :func:`.pycombat_seq`, :code:`svaseq` in `sva
545
   <http://bioconductor.org/packages/sva>`_ [Leek2014]_ or the :code:`RUV`
546
   functions in `RUVSeq <http://bioconductor.org/packages/RUVSeq>`_ [Risso2014]_
547
   to estimate variables that capture the unwanted variation.  In addition, the
548
   ashr developers have a `specific method <https://github.com/dcgerard/vicar>`_
549
   for accounting for unwanted variation in combination with ashr [Gerard2020]_.
550
551
..
552
  Plot counts
553
  -----------
554
555
  It can also be useful to examine the counts of reads for a single gene across
556
  the groups. A simple function for making this plot is :func:`plotCounts`, which
557
  normalizes counts by the estimated size factors (or normalization factors if
558
  these were used) and adds a pseudocount of 1/2 to allow for log scale plotting.
559
  The counts are grouped by the variables in :code:`intgroup`, where more than one
560
  variable can be specified. Here we specify the gene which had the smallest
561
  *p*-value from the results table created above. You can select the gene to plot
562
  by rowname or by numeric index::
563
564
    plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
565
566
  For customized plotting, an argument :code:`returnData` specifies that the
567
  function should only return a data frame for plotting with :code:`ggplot`::
568
569
    d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition",
570
                    returnData=TRUE)
571
    library("ggplot2")
572
    ggplot(d, aes(x=condition, y=count)) +
573
      geom_point(position=position_jitter(w=0.1,h=0)) +
574
      scale_y_log10(breaks=c(25,100,400))
575
576
577
More information on results columns
578
-----------------------------------
579
580
Information about which variables and tests were used can be found by inspecting
581
the attribute :code:`description` on the :class:`~results.DESeqResults` object:
582
583
.. repl::
584
   res.description
585
586
For a particular gene, a log2 fold change of -1 for :code:`condition treated vs
587
untreated` means that the treatment induces a multiplicative change in observed
588
gene expression level of :math:`2^{-1} = 0.5` compared to the untreated
589
condition. If the variable of interest is continuous-valued, then the reported
590
log2 fold change is per unit of change of that variable.
591
592
.. _pvaluesNA:
593
594
.. admonition:: Note on *p*-values set to :code:`NA`
595
596
   Some values in the results table can be set to :code:`NA` for one of the
597
   following reasons:
598
599
   * If within a row, all samples have zero counts, the :code:`baseMean` column
600
     will be zero, and the log2 fold change estimates, *p*-value and adjusted
601
     *p*-value will all be set to :code:`NA`.
602
   * If a row contains a sample with an extreme count outlier then the *p*-value
603
     and adjusted *p*-value will be set to :code:`NA`.  These outlier counts are
604
     detected by Cook's distance.  Customization of this outlier filtering and
605
     description of functionality for replacement of outlier counts and
606
     refitting is described :ref:`below<outlier>`.
607
   * If a row is filtered by automatic independent filtering, for having a low
608
     mean normalized count, then only the adjusted *p*-value will be set to
609
     :code:`NA`.  Description and customization of independent filtering is
610
     described :ref:`below<indfilt>`.
611
612
..
613
  Rich visualization and reporting of results
614
  -------------------------------------------
615
616
  **ReportingTools** An HTML report of the results with plots and
617
  sortable/filterable columns can be generated using the `ReportingTools
618
  <http://bioconductor.org/packages/ReportingTools>`_ package on a
619
  :class:`~DESeqDataSet.DESeqDataSet` that has been processed by the :func:`DESeq`
620
  function.  For a code example, see the *RNA-seq differential expression*
621
  vignette at the `ReportingTools
622
  <http://bioconductor.org/packages/ReportingTools>`_ page, or the manual page for
623
  the :meth:`publish` method for the :class:`~DESeqDataSet.DESeqDataSet` class.
624
625
  **regionReport** An HTML and PDF summary of the results with plots can also be
626
  generated using the `regionReport
627
  <http://bioconductor.org/packages/regionReport>`_ package. The
628
  :func:`DESeq2Report` function should be run on a
629
  :class:`~DESeqDataSet.DESeqDataSet` that has been processed by the :func:`DESeq`
630
  function.  For more details see the manual page for :func:`DESeq2Report` and an
631
  example vignette in the `regionReport
632
  <http://bioconductor.org/packages/regionReport>`_ package.
633
634
  **Glimma** Interactive visualization of DESeq2 output, including MA-plots (also
635
  called MD-plot) can be generated using the `Glimma
636
  <http://bioconductor.org/packages/Glimma>`_ package. See the manual page for
637
  *glMDPlot.DESeqResults*.
638
639
  **pcaExplorer** Interactive visualization of DESeq2 output, including PCA plots,
640
  boxplots of counts and other useful summaries can be generated using the
641
  `pcaExplorer <http://bioconductor.org/packages/pcaExplorer>`_ package. See the
642
  *Launching the application* section of the package vignette.
643
644
  **iSEE** Provides functions for creating an interactive Shiny-based graphical
645
  user interface for exploring data stored in SummarizedExperiment objects,
646
  including row- and column-level metadata. Particular attention is given to
647
  single-cell data in a SingleCellExperiment object with visualization of
648
  dimensionality reduction results.  `iSEE
649
  <https://bioconductor.org/packages/iSEE>`_ is on Bioconductor.  An example
650
  wrapper function for converting a :class:`~DESeqDataSet.DESeqDataSet` to a
651
  SingleCellExperiment object for use with *iSEE* can be found at the following
652
  gist, written by Federico Marini:
653
654
  * <https://gist.github.com/federicomarini/4a543eebc7e7091d9169111f76d59de1>
655
656
  **DEvis** DEvis is a powerful, integrated solution for the analysis of
657
  differential expression data. This package includes an array of tools for
658
  manipulating and aggregating data, as well as a wide range of customizable
659
  visualizations, and project management functionality that simplify RNA-Seq
660
  analysis and provide a variety of ways of exploring and analyzing data.  *DEvis*
661
  can be found on `CRAN <https://cran.r-project.org/package=DEVis>` and `GitHub
662
  <https://github.com/price0416/DEvis>`.
663
664
665
Exporting results to CSV files
666
------------------------------
667
668
A plain-text file of the results can be exported using the method
669
:meth:`.DESeqResults.to_csv`. We suggest using a descriptive file name
670
indicating the variable and level which were tested:
671
672
.. repl::
673
   resOrdered.to_csv("condition_treated_results.csv")
674
675
Exporting only the results which pass an adjusted *p*-value threshold can be
676
accomplished by subsetting the results:
677
678
.. repl::
679
   resSig = resOrdered[resOrdered.padj < 0.1]
680
   resSig.head()
681
682
683
Multi-factor designs
684
====================
685
686
Experiments with more than one factor influencing the counts can be analyzed
687
using design formula that include the additional variables.  In fact, DESeq2 can
688
analyze any possible experimental design that can be expressed with fixed
689
effects terms (multiple factors, designs with interactions, designs with
690
continuous variables, splines, and so on are all possible).
691
692
By adding variables to the design, one can control for additional variation in
693
the counts. For example, if the condition samples are balanced across
694
experimental batches, by including the :code:`batch` factor to the design, one
695
can increase the sensitivity for finding differences due to :code:`condition`.
696
There are multiple ways to analyze experiments when the additional variables are
697
of interest and not just controlling factors (see :ref:`section on
698
interactions<interactions>`).
699
700
**Experiments with many samples**: in experiments with many samples (e.g. 50,
701
100, etc.) it is highly likely that there will be technical variation affecting
702
the observed counts. Failing to model this additional technical variation will
703
lead to spurious results. Many methods exist that can be used to model technical
704
variation, which can be easily included in the DESeq2 design to control for
705
technical variation which estimating effects of interest. See the `RNA-seq
706
workflow <http://www.bioconductor.org/help/workflows/rnaseqGene>`_ for examples
707
of using RUV or SVA in combination with DESeq2.  For more details on why it is
708
important to control for technical variation in large sample experiments, see
709
the following `thread
710
<https://twitter.com/mikelove/status/1513468597288452097>`_, also archived `here
711
<https://htmlpreview.github.io/?https://github.com/frederikziebell/science_tweetorials/blob/master/DESeq2_many_samples.html>`_
712
by Frederik Ziebell.
713
714
The data in the :doc:`/pasilla` module have a condition of interest (the column
715
:code:`condition`), as well as information on the type of sequencing which was
716
performed (the column :code:`type`), as we can see below:
717
718
.. repl::
719
   dds.obs
720
721
We create a copy of the :class:`~DESeqDataSet.DESeqDataSet`, so that we can
722
rerun the analysis using a multi-factor design:
723
724
.. repl::
725
   ddsMF = dds.copy()
726
727
We change the categories of :code:`type` so it only contains letters.  Be
728
careful when changing level names to use the same order as the current
729
categories:
730
731
.. repl::
732
   import re
733
   dds.obs["type"].cat.categories
734
   dds.obs["type"].cat.rename_categories([re.sub("-.*", "", c) for c in dds.obs["type"].cat.categories])
735
   dds.obs["type"].dtype.categories
736
737
We can account for the different types of sequencing, and get a clearer picture
738
of the differences attributable to the treatment.  As :code:`condition` is the
739
variable of interest, we put it at the end of the formula. Thus the
740
:meth:`.DESeqDataSet.results` function will by default pull the
741
:code:`condition` results unless :code:`contrast` or :code:`name` arguments are
742
specified.
743
744
Then we can re-run :func:`DESeq`:
745
746
.. repl::
747
   ddsMF.design = "~ type + condition"
748
   ddsMF = DESeq(ddsMF)
749
750
Again, we access the results using the :meth:`.DESeqDataSet.results` function:
751
752
.. repl::
753
   resMF = ddsMF.results()
754
   resMF.head()
755
756
It is also possible to retrieve the log2 fold changes, *p*-values and adjusted
757
*p*-values of variables other than the last one in the design.  While in this
758
case, :code:`type` is not biologically interesting as it indicates differences
759
across sequencing protocols, for other hypothetical designs, such as
760
:code:`~genotype + condition + genotype:condition`, we may actually be
761
interested in the difference in baseline expression across genotype, which is
762
not the last variable in the design.
763
764
In any case, the :code:`contrast` argument of the function
765
:meth:`.DESeqDataSet.results` takes a string list of length three: the name of
766
the variable, the name of the factor level for the numerator of the log2 ratio,
767
and the name of the factor level for the denominator.  The :code:`contrast`
768
argument can also take other forms, as described in the help page for
769
:meth:`.DESeqDataSet.results` and :ref:`below<contrasts>`:
770
771
.. repl::
772
   resMFType = ddsMF.results(contrast=("type", "single", "paired"))
773
   resMFType.head()
774
775
If the variable is continuous or an interaction term (see :ref:`section on
776
interactions<interactions>`) then the results can be extracted using the
777
:code:`name` argument to :meth:`.DESeqDataSet.results`, where the name is one of
778
elements returned by :code:`dds.resultsNames()`.
779
780
.. _transform:
781
782
..
783
  Data transformations and visualization
784
  ======================================
785
786
  Count data transformations
787
  --------------------------
788
789
  In order to test for differential expression, we operate on raw counts and use
790
  discrete distributions as described in the previous section on differential
791
  expression.  However for other downstream analyses -- *e.g.* for visualization
792
  or clustering -- it might be useful to work with transformed versions of the
793
  count data.
794
795
  Maybe the most obvious choice of transformation is the logarithm.  Since count
796
  values for a gene can be zero in some conditions (and non-zero in others), some
797
  advocate the use of *pseudocounts*, i.e. transformations of the form :math:`y =
798
  \log(n + n_0)`, where :math:`n` represents the count values and :math:`n_0` is a
799
  positive constant.
800
801
  In this section, we discuss two alternative approaches that offer more
802
  theoretical justification and a rational way of choosing parameters equivalent
803
  to :math:`n_0` above.  One makes use of the concept of variance stabilizing
804
  transformations (VST) [Tibshirani1988]_ [sagmb2003]_ [Anders2010]_, and the
805
  other is the *regularized logarithm* or *rlog*, which incorporates a prior on
806
  the sample differences [Love2014]_.  Both transformations produce transformed
807
  data on the log2 scale which has been normalized with respect to library size or
808
  other normalization factors.
809
810
  The point of these two transformations, the VST and the rlog, is to remove the
811
  dependence of the variance on the mean, particularly the high variance of the
812
  logarithm of count data when the mean is low. Both VST and rlog use the
813
  experiment-wide trend of variance over mean, in order to transform the data to
814
  remove the experiment-wide trend. Note that we do not require or desire that all
815
  the genes have *exactly* the same variance after transformation. Indeed, in a
816
  figure below, you will see that after the transformations the genes with the
817
  same mean do not have exactly the same standard deviations, but that the
818
  experiment-wide trend has flattened. It is those genes with row variance above
819
  the trend which will allow us to cluster samples into interesting groups.
820
821
  .. admonition:: Note on running time:
822
823
     If you have many samples (*e.g.* 100s), the :func:`rlog` function might take
824
     too long, and so the :func:`vst` function will be a faster choice.  The rlog
825
     and VST have similar properties, but the rlog requires fitting a shrinkage
826
     term for each sample and each gene which takes time. See the DESeq2 paper for
827
     more discussion on the differences [Love2014]_.
828
829
  Blind dispersion estimation
830
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
831
832
  The two functions, :func:`vst` and :func:`rlog`: have an argument :code:`blind`,
833
  for whether the transformation should be blind to the sample information
834
  specified by the design formula. When :code:`blind` equals :code:`True` (the
835
  default), the functions will re-estimate the dispersions using only an
836
  intercept.  This setting should be used in order to compare samples in a manner
837
  wholly unbiased by the information about experimental groups, for example to
838
  perform sample QA (quality assurance) as demonstrated below.
839
840
  However, blind dispersion estimation is not the appropriate choice if one
841
  expects that many or the majority of genes will have large differences in counts
842
  which are explainable by the experimental design, and one wishes to transform
843
  the data for downstream analysis. In this case, using blind dispersion
844
  estimation will lead to large estimates of dispersion, as it attributes
845
  differences due to experimental design as unwanted *noise*, and will result in
846
  overly shrinking the transformed values towards each other.  By setting
847
  :code:`blind` to :code:`False`, the dispersions already estimated will be used
848
  to perform transformations, or if not present, they will be estimated using the
849
  current design formula. Note that only the fitted dispersion estimates from
850
  mean-dispersion trend line are used in the transformation (the global dependence
851
  of dispersion on mean for the entire experiment).  So setting :code:`blind` to
852
  :code:`False` is still for the most part not using the information about which
853
  samples were in which experimental group in applying the transformation.
854
855
  Extracting transformed values
856
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
857
858
  These transformation functions return an object of class :class:`DESeqTransform`
859
  which is a subclass of :class:`AnnData`.  For ~20 samples, running on a newly
860
  created :class:`~DESeqDataSet.DESeqDataSet`, :func:`rlog` may take 30 seconds,
861
  while :func:`vst` takes less than 1 second.  The running times are shorter when
862
  using `blind=False` and if the function :func:`DESeq` has already been run,
863
  because then it is not necessary to re-estimate the dispersion values.  The
864
  matrix of normalized values is stored in the :attr:`X` attribute:
865
866
  >>> vsd = vst(dds, blind=FALSE)
867
  >>> rld = rlog(dds, blind=FALSE)
868
  >>> vsd.X.head(3)
869
870
  Variance stabilizing transformation
871
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
872
873
  Above, we used a parametric fit for the dispersion. In this case, the
874
  closed-form expression for the variance stabilizing transformation is used by
875
  the :func:`vst` function. If a local fit is used (option
876
  :code:`fitType="locfit"` to :meth:`.DESeqDataSet.estimateDispersions`) a
877
  numerical integration is used instead. The transformed data should be
878
  approximated variance stabilized and also includes correction for size factors
879
  or normalization factors. The transformed data is on the log2 scale for large
880
  counts.
881
882
  Regularized log transformation
883
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
884
885
  The function :func:`rlog`, stands for *regularized log*, transforming the
886
  original count data to the log2 scale by fitting a model with a term for each
887
  sample and a prior distribution on the coefficients which is estimated from the
888
  data. This is the same kind of shrinkage (sometimes referred to as
889
  regularization, or moderation) of log fold changes used by :func:`DESeq` and
890
  :func:`nbinomWaldTest`. The resulting data contains elements defined as:
891
892
  .. math::
893
     \log_2(q_{ij}) = \beta_{i0} + \beta_{ij}
894
895
  where :math:`q_{ij}` is a parameter proportional to the expected true
896
  concentration of fragments for gene :math:`i` and sample :math:`j` (see formula
897
  :ref:`below<theory>`), :math:`\beta_{i0}` is an intercept which does not undergo
898
  shrinkage, and :math:`\beta_{ij}` is the sample-specific effect which is shrunk
899
  toward zero based on the dispersion-mean trend over the entire dataset. The
900
  trend typically captures high dispersions for low counts, and therefore these
901
  genes exhibit higher shrinkage from the :func:`rlog`.
902
903
  Note that, as :math:`q_{ij}` represents the part of the mean value
904
  :math:`\mu_{ij}` after the size factor :math:`s_j` has been divided out, it is
905
  clear that the rlog transformation inherently accounts for differences in
906
  sequencing depth. Without priors, this design matrix would lead to a non-unique
907
  solution, however the addition of a prior on non-intercept betas allows for a
908
  unique solution to be found.
909
910
  Effects of transformations on the variance
911
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
912
913
  The figure below plots the standard deviation of the transformed data, across
914
  samples, against the mean, using the shifted logarithm transformation, the
915
  regularized log transformation and the variance stabilizing transformation.  The
916
  shifted logarithm has elevated standard deviation in the lower count range, and
917
  the regularized log to a lesser extent, while for the variance stabilized data
918
  the standard deviation is roughly constant along the whole dynamic range.
919
920
  Note that the vertical axis in such plots is the square root of the variance
921
  over all samples, so including the variance due to the experimental conditions.
922
  While a flat curve of the square root of variance over the mean may seem like
923
  the goal of such transformations, this may be unreasonable in the case of
924
  datasets with many true differences due to the experimental conditions.
925
926
  ..
927
    ```{r meansd}
928
    # this gives log2(n + 1)
929
    ntd <- normTransform(dds)
930
    library("vsn")
931
    meanSdPlot(assay(ntd))
932
    meanSdPlot(assay(vsd))
933
    meanSdPlot(assay(rld))
934
    ```
935
936
  Data quality assessment by sample clustering and visualization
937
  --------------------------------------------------------------
938
939
  Data quality assessment and quality control (*i.e.* the removal of
940
  insufficiently good data) are essential steps of any data analysis. These steps
941
  should typically be performed very early in the analysis of a new data set,
942
  preceding or in parallel to the differential expression testing.
943
944
  We define the term *quality* as *fitness for purpose*.  Our purpose is the
945
  detection of differentially expressed genes, and we are looking in particular
946
  for samples whose experimental treatment suffered from an anormality that
947
  renders the data points obtained from these particular samples detrimental to
948
  our purpose.
949
950
  Heatmap of the count matrix
951
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
952
953
  To explore a count matrix, it is often instructive to look at it as a heatmap.
954
  Below we show how to produce such a heatmap for various transformations of the
955
  data.
956
957
  TODO use clustermap instead of pheatmap
958
959
  .. repl::
960
961
     from seaborn import clustermap
962
     select = np.sort(dds.counts(normalized=True).mean(axis=0))[::-1][:20]
963
     df = dds.obs[["condition", "type"]]
964
965
  ..
966
    ```{r heatmap}
967
    library("pheatmap")
968
    select <- order(rowMeans(counts(dds,normalized=TRUE)),
969
                    decreasing=TRUE)[1:20]
970
    df <- as.data.frame(colData(dds)[,c("condition","type")])
971
    pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
972
             cluster_cols=FALSE, annotation_col=df)
973
    pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
974
             cluster_cols=FALSE, annotation_col=df)
975
    pheatmap(assay(rld)[select,], cluster_rows=FALSE, show_rownames=FALSE,
976
             cluster_cols=FALSE, annotation_col=df)
977
    ```
978
979
  Heatmap of the sample-to-sample distances
980
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
981
982
  Another use of the transformed data is sample clustering. Here, we apply the
983
  :func:`dist` function to the transpose of the transformed count matrix to get
984
  sample-to-sample distances::
985
986
    sampleDists = dist(vsd.layers.T)
987
988
  A heatmap of this distance matrix gives us an overview over similarities and
989
  dissimilarities between samples.  We have to provide a hierarchical clustering
990
  :code:`hc` to the heatmap function based on the sample distances, or else the
991
  heatmap function would calculate a clustering based on the distances between the
992
  rows/columns of the distance matrix.
993
994
  TODO
995
996
  ..
997
    ```{r figHeatmapSamples, fig.height=4, fig.width=6}
998
    library("RColorBrewer")
999
    sampleDistMatrix <- as.matrix(sampleDists)
1000
    rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
1001
    colnames(sampleDistMatrix) <- NULL
1002
    colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
1003
    pheatmap(sampleDistMatrix,
1004
             clustering_distance_rows=sampleDists,
1005
             clustering_distance_cols=sampleDists,
1006
             col=colors)
1007
    ```
1008
1009
  Principal component plot of the samples
1010
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1011
1012
  Related to the distance matrix is the PCA plot, which shows the samples in the
1013
  2D plane spanned by their first two principal components. This type of plot is
1014
  useful for visualizing the overall effect of experimental covariates and batch
1015
  effects.
1016
1017
  TODO
1018
1019
  ..
1020
    ```{r figPCA}
1021
    plotPCA(vsd, intgroup=c("condition", "type"))
1022
    ```
1023
1024
  It is also possible to customize the PCA plot using the
1025
  *ggplot* function.
1026
1027
  TODO
1028
1029
  ..
1030
    ```{r figPCA2}
1031
    pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE)
1032
    percentVar <- round(100 * attr(pcaData, "percentVar"))
1033
    ggplot(pcaData, aes(PC1, PC2, color=condition, shape=type)) +
1034
      geom_point(size=3) +
1035
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
1036
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
1037
      coord_fixed()
1038
    ```
1039
1040
Variations to the standard workflow
1041
===================================
1042
1043
Wald test individual steps
1044
--------------------------
1045
1046
The function :func:`DESeq` runs the following functions in order:
1047
1048
>>> dds = dds.estimateSizeFactors()
1049
>>> dds = dds.estimateDispersions(dds)
1050
>>> dds = nbinomWaldTest(dds)
1051
1052
Control features for estimating size factors
1053
--------------------------------------------
1054
1055
In some experiments, it may not be appropriate to assume that a minority of
1056
features (genes) are affected greatly by the condition, such that the standard
1057
median-ratio method for estimating the size factors will not provide correct
1058
inference (the log fold changes for features that were truly un-changing will
1059
not centered on zero). This is a difficult inference problem for any method, but
1060
there is an important feature that can be used: the :code:`controlGenes`
1061
argument of :meth:`.DESeqDataSet.estimateSizeFactors`. If there is any prior
1062
information about features (genes) that should not be changing with respect to
1063
the condition, providing this set of features to :code:`controlGenes` will
1064
ensure that the log fold changes for these features will be centered around 0.
1065
The paradigm then becomes:
1066
1067
>>> dds = dds.estimateSizeFactors(controlGenes=ctrlGenes)
1068
>>> dds = DESeq(dds)
1069
1070
.. _contrasts:
1071
1072
Contrasts
1073
---------
1074
1075
A contrast is a linear combination of estimated log2 fold changes, which can be
1076
used to test if differences between groups are equal to zero.  The simplest use
1077
case for contrasts is an experimental design containing a factor with three
1078
levels, say A, B and C.  Contrasts enable the user to generate results for all 3
1079
possible differences: log2 fold change of B vs A, of C vs A, and of C vs B.  The
1080
:code:`contrast` argument of :meth:`.DESeqDataSet.results` function is used to
1081
extract test results of log2 fold changes of interest, for example:
1082
1083
>>> dds.results(contrast=["condition","C","B"])
1084
1085
Log2 fold changes can also be added and subtracted by providing a list to the
1086
:code:`contrast` argument which has two elements: the names of the log2 fold
1087
changes to add, and the names of the log2 fold changes to subtract. The names
1088
used in the list should come from :code:`dds.resultsNames()`.  Alternatively, a
1089
numeric vector of the length of :code:`dds.resultsNames()` can be provided, for
1090
manually specifying the linear combination of terms. A `tutorial
1091
<https://github.com/tavareshugo/tutorial_DESeq2_contrasts>`_ describing the use
1092
of numeric contrasts for DESeq2 explains a general approach to comparing across
1093
groups of samples.  Demonstrations of the use of contrasts for various designs
1094
can be found in the examples section of the help page
1095
:meth:`.DESeqDataSet.results`.  The mathematical formula that is used to
1096
generate the contrasts can be found :ref:`below<theory>`.
1097
1098
.. _interactions:
1099
1100
Interactions
1101
------------
1102
1103
Interaction terms can be added to the design formula, in order to test, for
1104
example, if the log2 fold change attributable to a given condition is
1105
*different* based on another factor, for example if the condition effect differs
1106
across genotype.
1107
1108
Preliminary remarks
1109
^^^^^^^^^^^^^^^^^^^
1110
1111
Many users begin to add interaction terms to the design formula, when in fact a
1112
much simpler approach would give all the results tables that are desired.  We
1113
will explain this approach first, because it is much simpler to perform.  If the
1114
comparisons of interest are, for example, the effect of a condition for
1115
different sets of samples, a simpler approach than adding interaction terms
1116
explicitly to the design formula is to perform the following steps:
1117
1118
  * combine the factors of interest into a single factor with all
1119
    combinations of the original factors
1120
  * change the design to include just this factor, *e.g.* :code:`"~ group"`
1121
1122
Using this design is similar to adding an interaction term, in that it models
1123
multiple condition effects which can be easily extracted with
1124
:meth:`.DESeqDataSet.results`.  Suppose we have two factors :code:`genotype`
1125
(with values I, II, and III) and :code:`condition` (with values A and B), and we
1126
want to extract the condition effect specifically for each genotype. We could
1127
use the following approach to obtain, *e.g.* the condition effect for genotype
1128
I:
1129
1130
>>> dds.obs["group"] = Factor([f"{g}{c}" for (g,c) in zip(dds.obs["genotype"], dds.obs["condition"])])
1131
>>> dds.design = "~ group"
1132
>>> dds = DESeq(dds)
1133
>>> dds.resultsNames()
1134
>>> dds.results(contrast=["group", "IB", "IA"])
1135
1136
1137
Adding interactions to the design
1138
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1139
1140
The following two plots diagram genotype-specific condition effects, which could
1141
be modeled with interaction terms by using a design of :code:`~genotype +
1142
condition + genotype:condition`.
1143
1144
In the first plot (Gene 1), note that the condition effect is consistent across
1145
genotypes. Although condition A has a different baseline for I, II, and III, the
1146
condition effect is a log2 fold change of about 2 for each genotype.  Using a
1147
model with an interaction term :code:`genotype:condition`, the interaction terms
1148
for genotype II and genotype III will be nearly 0.
1149
1150
Here, the y-axis represents :math:`\log(n+1)`, and each group has 20 samples
1151
(black dots). A red line connects the mean of the groups within each genotype.
1152
1153
.. repl-quiet::
1154
   from inmoose.utils import rnbinom
1155
   import seaborn as sns
1156
1157
   npg = 20
1158
   mu = 2**np.array([8,10,9,11,10,12])
1159
   cond = np.repeat([np.repeat(["A","B"], npg)], 3, axis=0).flatten()
1160
   geno = np.repeat(["I","II","III"], 2*npg)
1161
   #table(cond, geno)
1162
   counts = rnbinom(6*npg, mu = np.repeat(mu, npg), size=1/.01)
1163
   d = pd.DataFrame({"log2c": np.log2(counts+1), "cond": cond, "geno": geno})
1164
1165
   def plotit(d, title):
1166
     g = sns.FacetGrid(d, col="geno")
1167
     g.map(sns.stripplot, "cond", "log2c", color="black")
1168
     g.map(sns.pointplot, "cond", "log2c", color="red")
1169
     g.fig.subplots_adjust(top=0.9)
1170
     g.fig.suptitle(title)
1171
     plt.show()
1172
1173
   plotit(d, "Gene 1")
1174
1175
In the second plot (Gene 2), we can see that the condition effect is not
1176
consistent across genotype. Here the main condition effect (the effect for the
1177
reference genotype I) is again 2. However, this time the interaction terms will
1178
be around 1 for genotype II and -4 for genotype III. This is because the
1179
condition effect is higher by 1 for genotype II compared to genotype I, and
1180
lower by 4 for genotype III compared to genotype I.  The condition effect for
1181
genotype II (or III) is obtained by adding the main condition effect and the
1182
interaction term for that genotype.  Such a plot can be made using the
1183
:func:`plotCounts` function as shown above.
1184
1185
.. repl-quiet::
1186
   mu[3] = 2**12
1187
   mu[5] = 2**8
1188
   counts = rnbinom(6*npg, mu=np.repeat(mu, npg), size=1/.01)
1189
   d2 = pd.DataFrame({"log2c": np.log2(counts+1), "cond": cond, "geno": geno})
1190
   plotit(d2, "Gene 2")
1191
1192
Now we will continue to explain the use of interactions in order to test for
1193
*differences* in condition effects. We continue with the example of condition
1194
effects across three genotypes (I, II, and III).
1195
1196
The key point to remember about designs with interaction terms is that, unlike
1197
for a design :code:`~genotype + condition`, where the condition effect
1198
represents the *overall* effect controlling for differences due to genotype, by
1199
adding :code:`genotype:condition`, the main condition effect only represents the
1200
effect of condition for the *reference level* of genotype (I, or whichever level
1201
was defined by the user as the reference level). The interaction terms
1202
:code:`genotypeII.conditionB` and :code:`genotypeIII.conditionB` give the
1203
*difference* between the condition effect for a given genotype and the condition
1204
effect for the reference genotype.
1205
1206
This genotype-condition interaction example is examined in further detail in
1207
Example 3 in the help page for :meth:`.DESeqDataSet.results`. In particular, we
1208
show how to test for differences in the condition effect across genotype, and we
1209
show how to obtain the condition effect for non-reference genotypes.
1210
1211
Time-series experiments
1212
-----------------------
1213
1214
There are a number of ways to analyze time-series experiments, depending on the
1215
biological question of interest. In order to test for any differences over
1216
multiple time points, once can use a design including the time factor, and then
1217
test using the likelihood ratio test as described in the following section,
1218
where the time factor is removed in the reduced formula. For a control and
1219
treatment time series, one can use a design formula containing the condition
1220
factor, the time factor, and the interaction of the two. In this case, using the
1221
likelihood ratio test with a reduced model which does not contain the
1222
interaction terms will test whether the condition induces a change in gene
1223
expression at any time point after the reference level time point (time 0). An
1224
example of the later analysis is provided in the `RNA-seq workflow
1225
<http://www.bioconductor.org/help/workflows/rnaseqGene>`_.
1226
1227
Likelihood ratio test
1228
---------------------
1229
1230
DESeq2 offers two kinds of hypothesis tests: the Wald test, where we use the
1231
estimated standard error of a log2 fold change to test if it is equal to zero,
1232
and the likelihood ratio test (LRT). The LRT examines two models for the counts,
1233
a *full* model with a certain number of terms and a *reduced* model, in which
1234
some of the terms of the *full* model are removed. The test determines if the
1235
increased likelihood of the data using the extra terms in the *full* model is
1236
more than expected if those extra terms are truly zero.
1237
1238
The LRT is therefore useful for testing multiple terms at once, for example
1239
testing 3 or more levels of a factor at once, or all interactions between two
1240
variables.  The LRT for count data is conceptually similar to an analysis of
1241
variance (ANOVA) calculation in linear regression, except that in the case of
1242
the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the
1243
*deviance* captures the difference in likelihood between a full and a reduced
1244
model.
1245
1246
The likelihood ratio test can be performed by specifying :code:`test="LRT"` when
1247
using the :func:`DESeq` function, and providing a reduced design formula, *e.g.*
1248
one in which a number of terms from :code:`dds.design` are removed.  The degrees
1249
of freedom for the test is obtained from the difference between the number of
1250
parameters in the two models.  A simple likelihood ratio test, if the full
1251
design was :code:`~condition` would look like:
1252
1253
>>> dds = DESeq(dds, test="LRT", reduced="~1")
1254
>>> res = dds.results()
1255
1256
If the full design contained other variables, such as a batch variable, *e.g.*
1257
:code:`~batch + condition` then the likelihood ratio test would look like:
1258
1259
>>> dds = DESeq(dds, test="LRT", reduced="~batch")
1260
>>> res = dds.results()
1261
1262
.. _moreshrink:
1263
1264
Extended section on shrinkage estimators
1265
----------------------------------------
1266
1267
Here we extend the :ref:`discussion of shrinkage estimators<shrink>`.  Below is
1268
a summary table of differences between methods available in :func:`lfcShrink`
1269
via the :code:`type` argument (and for further technical reference on use of
1270
arguments please the documentation of :func:`lfcShrink`):
1271
1272
.. list-table::
1273
   :header-rows: 1
1274
   :stub-columns: 1
1275
1276
   * - method
1277
     - :code:`apeglm` [Zhu2018]_
1278
     - :code:`ashr` [Stephens2016]_
1279
     - :code:`normal` [Love2014]_
1280
   * - Good for ranking by LFC
1281
     - ✓
1282
     - ✓
1283
     - ✓
1284
   * - Preserves size of large LFC
1285
     - ✓
1286
     - ✓
1287
     -
1288
   * - Can compute *s*-values [Stephens2016]_
1289
     - ✓
1290
     - ✓
1291
     -
1292
   * - Allows use of :code:`coef`
1293
     - ✓
1294
     - ✓
1295
     - ✓
1296
   * - Allows use of :code:`lfcThreshold`
1297
     - ✓
1298
     - ✓
1299
     - ✓
1300
   * - Allows use of :code:`contrast`
1301
     - ✓
1302
     - ✓
1303
     -
1304
   * - Can shrink interaction terms
1305
     - ✓
1306
     - ✓
1307
     -
1308
1309
Beginning with the first row, all shrinkage methods provided by DESeq2 are good
1310
for ranking genes by "effect size", that is the log2 fold change (LFC) across
1311
groups, or associated with an interaction term. It is useful to contrast ranking
1312
by effect size with ranking by a *p*-value or adjusted *p*-value associated with
1313
a null hypothesis: while increasing the number of samples will tend to decrease
1314
the associated *p*-value for a gene that is differentially expressed, the
1315
estimated effect size or LFC becomes more precise. Also, a gene can have a small
1316
*p*-value although the change in expression is not great, as long as the
1317
standard error associated with the estimated LFC is small.
1318
1319
The next two rows point out that :code:`apeglm` and :code:`ashr` shrinkage
1320
methods help to preserve the size of large LFC, and can be used to compute
1321
*s-values*. These properties are related. As noted in the :ref:`previous
1322
section<shrink>`, the original DESeq2 shrinkage estimator used a Normal
1323
distribution, with a scale that adapts to the spread of the observed LFCs.
1324
Because the tails of the Normal distribution become thin relatively quickly, it
1325
was important when we designed the method that the prior scaling is sensitive to
1326
the very largest observed LFCs. As you can read in the DESeq2 paper, under the
1327
section, "*Empirical prior estimate*", we used the top 5% of the LFCs by
1328
absolute value to set the scale of the Normal prior (we later added weighting
1329
the quantile by precision). :code:`ashr`, published in 2016, and :code:`apeglm`
1330
use wide-tailed priors to avoid shrinking large LFCs. While a typical RNA-seq
1331
experiment may have many LFCs between -1 and 1, we might consider a LFC of >4 to
1332
be very large, as they represent 16-fold increases or decreases in expression.
1333
:code:`ashr` and :code:`apeglm` can adapt to the scale of the entirety of LFCs,
1334
while not over-shrinking the few largest LFCs. The potential for over-shrinking
1335
LFC is also why DESeq2's shrinkage estimator is not recommended for designs with
1336
interaction terms.
1337
1338
What are *s-values*? This quantity proposed by [Stephens2016]_ gives the
1339
estimated rate of *false sign* among genes with equal or smaller *s*-value.
1340
[Stephens2016]_ points out they are analogous to the *q*-value of [Storey2003]_.
1341
The *s*-value has a desirable property relative to the adjusted *p*-value or
1342
*q*-value, in that it does not require supposing there be a set of null genes
1343
with LFC = 0 (the most commonly used null hypothesis). Therefore, it can be
1344
benchmarked by comparing estimated LFC and *s*-value to the "true LFC" in a
1345
setting where this can be reasonably defined. For these estimated probabilities
1346
to be accurate, the scale of the prior needs to match the scale of the
1347
distribution of effect sizes, and so the original DESeq2 shrinkage method is not
1348
really compatible with computing *s*-values.
1349
1350
The last four rows explain differences in whether coefficients or contrasts can
1351
have shrinkage applied by the various methods. All three methods can use
1352
:code:`coef` with either the name or numeric index from
1353
:code:`dds.resultsNames()` to specify which coefficient to shrink.  All three
1354
methods allow for a positive :code:`lfcThreshold` to be specified, in which
1355
case, they will return *p*-values and adjusted *p*-values or *s*-values for the
1356
LFC being greater in absolute value than the threshold (see :ref:`this
1357
section<thresh>` for :code:`normal`).  For :code:`apeglm` and :code:`ashr`,
1358
setting a threshold means that the *s*-values will give the "false sign or
1359
small" rate (FSOS) among genes with equal or small *s*-value.  We found FSOS to
1360
be a useful description for when the LFC is either the wrong sign or less than
1361
the threshold distance from 0.
1362
1363
TODO
1364
1365
..
1366
  ```{r apeThresh}
1367
  resApeT <- lfcShrink(dds, coef=2, type="apeglm", lfcThreshold=1)
1368
  plotMA(resApeT, ylim=c(-3,3), cex=.8)
1369
  abline(h=c(-1,1), col="dodgerblue", lwd=2)
1370
  ```
1371
1372
  ```{r ashThresh}
1373
  resAshT <- lfcShrink(dds, coef=2, type="ashr", lfcThreshold=1)
1374
  plotMA(resAshT, ylim=c(-3,3), cex=.8)
1375
  abline(h=c(-1,1), col="dodgerblue", lwd=2)
1376
  ```
1377
1378
Finally, :code:`normal` and :code:`ashr` can be used with arbitrary specified
1379
:code:`contrast` because :code:`normal` shrinks multiple coefficients
1380
simultaneously (:code:`apeglm` does not), and because :code:`ashr` does not
1381
estimate a vector of coefficients but models estimated coefficients and their
1382
standard errors from upstream methods (here, DESeq2's MLE).  Although
1383
:code:`apeglm` cannot be used with :code:`contrast`, we note that many designs
1384
can be easily rearranged such that what was a contrast becomes its own
1385
coefficient. In this case, the dispersion does not have to be estimated again,
1386
as the designs are equivalent, up to the meaning of the coefficients. Instead,
1387
one need only run :code:`nbinomWaldTest` to re-estimate MLE coefficients --
1388
these are necessary for :code:`apeglm` -- and then run :code:`lfcShrink`
1389
specifying the coefficient of interest in `dds.resultsNames()`.
1390
1391
We give some examples below of producing equivalent designs for use with
1392
:code:`coef`. We show how the coefficients change with :code:`patsy.dmatrix`,
1393
but the user would, for example, either change the levels of `dds.obs.condition`
1394
or replace the :attr:`.DESeqDataSet.design`, then run :func:`nbinomWaldTest`
1395
followed by :func:`lfcShrink`.
1396
1397
Three groups:
1398
1399
.. repl::
1400
   condition = Factor(["A", "A", "B", "B", "C", "C"])
1401
   patsy.dmatrix("~condition")
1402
   # to compare C vs B, make B the reference level,
1403
   # and select the last coefficient
1404
   condition = condition.reorder_categories(["B", "A", "C"])
1405
   patsy.dmatrix("~condition")
1406
1407
Three groups, compare condition effects:
1408
1409
.. repl::
1410
   grp = Factor([1,1,1,1,2,2,2,2,3,3,3,3])
1411
   cnd = Factor(["A","A","B","B","A","A","B","B","A","A","B","B"])
1412
   patsy.dmatrix("~ grp + cnd + grp:cnd")
1413
   # to compare condition effect in group 3 vs 2,
1414
   # make group 2 the reference level,
1415
   # and select the last coefficient
1416
   grp = grp.reorder_categories([2,1,3])
1417
   patsy.dmatrix("~ grp + cnd + grp:cnd")
1418
1419
Two groups, two individuals per group, compare within-individual condition
1420
effects:
1421
1422
.. repl::
1423
   grp = Factor([1,1,1,1,2,2,2,2])
1424
   ind = Factor([1,1,2,2,1,1,2,2])
1425
   cnd = Factor(["A","B","A","B","A","B","A","B"])
1426
   patsy.dmatrix("~ grp + grp:ind + grp:cnd")
1427
   # to compare condition effect across group,
1428
   # add a main effect for 'cnd',
1429
   # and select the last coefficient
1430
   patsy.dmatrix("~ grp + cnd + grp:ind + grp:cnd")
1431
1432
.. _singlecell:
1433
1434
Recommendations for single-cell analysis
1435
----------------------------------------
1436
1437
The DESeq2 developers and collaborating groups have published recommendations
1438
for the best use of DESeq2 for single-cell datasets, which have been described
1439
first in [Berge2018]_. Default values for DESeq2 were designed for bulk data and
1440
will not be appropriate for single-cell datasets. These settings and additional
1441
improvements have also been tested subsequently and published in [Zhu2018]_ and
1442
[AhlmannEltze2020]_.
1443
1444
* Use :code:`test="LRT"` for significance testing when working with single-cell
1445
  data, over the Wald test. This has been observed across multiple single-cell
1446
  benchmarks.
1447
* Set the following :func:`DESeq` arguments to these values: :code:`useT=TRUE`,
1448
  :code:`minmu=1e-6`, and :code:`minReplicatesForReplace=Inf`.  The default
1449
  setting of :code:`minmu` was benchmarked on bulk RNA-seq and is not
1450
  appropriate for single cell data when the expected count is often much less
1451
  than 1.
1452
* The default size factors are not optimal for single cell count matrices,
1453
  instead consider setting :attr:`.DESeqDataSet.sizeFactors` from
1454
  :code:`scran::computeSumFactors`.
1455
* One important concern for single-cell data analysis is the size of the
1456
  datasets and associated processing time. To address the speed concerns, DESeq2
1457
  provides an interface to `glmGamPoi
1458
  <https://bioconductor.org/packages/glmGamPoi/>`_, which implements faster
1459
  dispersion and parameter estimation routines for single-cell data
1460
  [AhlmannEltze2020]_. To use this feature, set :code:`fitType = "glmGamPoi"`.
1461
  Alternatively, one can use *glmGamPoi* as a standalone package.  This
1462
  provides the additional option to process data on-disk if the full dataset
1463
  does not fit in memory, a quasi-likelihood framework for differential testing,
1464
  and the ability to form pseudobulk samples (more details how to use
1465
  *glmGamPoi* are in its `README <https://github.com/const-ae/glmGamPoi>`_).
1466
1467
Optionally, one can consider using the `zinbwave
1468
<https://bioconductor.org/packages/zinbwave>`_ package to directly model the
1469
zero inflation of the counts, and take account of these in the DESeq2 model.
1470
This allows for the DESeq2 inference to apply to the part of the data which is
1471
not due to zero inflation. Not all single cell datasets exhibit zero inflation,
1472
and instead may just reflect low conditional estimated counts (conditional on
1473
cell type or cell state). There is example code for combining *zinbwave* and
1474
*DESeq2* package functions in the *zinbwave* vignette. We also have an example
1475
of ZINB-WaVE + DESeq2 integration using the `splatter
1476
<https://bioconductor.org/packages/splatter>`_ package for simulation at the
1477
`zinbwave-deseq2 <https://github.com/mikelove/zinbwave-deseq2>`_ GitHub
1478
repository.
1479
1480
.. _outlier:
1481
1482
Approach to count outliers
1483
--------------------------
1484
1485
RNA-seq data sometimes contain isolated instances of very large counts that are
1486
apparently unrelated to the experimental or study design, and which may be
1487
considered outliers. There are many reasons why outliers can arise, including
1488
rare technical or experimental artifacts, read mapping problems in the case of
1489
genetically differing samples, and genuine, but rare biological events. In many
1490
cases, users appear primarily interested in genes that show a consistent
1491
behavior, and this is the reason why by default, genes that are affected by such
1492
outliers are set aside by DESeq2, or if there are sufficient samples, outlier
1493
counts are replaced for model fitting.  These two behaviors are described below.
1494
1495
The :func:`DESeq` function calculates, for every gene and for every sample, a
1496
diagnostic test for outliers called *Cook's distance*. Cook's distance is a
1497
measure of how much a single sample is influencing the fitted coefficients for a
1498
gene, and a large value of Cook's distance is intended to indicate an outlier
1499
count.  The Cook's distances are stored as a matrix available in
1500
:code:`dds.layers["cooks"]`.
1501
1502
The :meth:`.DESeqDataSet.results` function automatically flags genes which
1503
contain a Cook's distance above a cutoff for samples which have 3 or more
1504
replicates.  The *p*-values and adjusted *p*-values for these genes are set to
1505
:code:`NA`.  At least 3 replicates are required for flagging, as it is difficult
1506
to judge which sample might be an outlier with only 2 replicates.  This
1507
filtering can be turned off with :code:`dds.results(cooksCutoff=FALSE)`.
1508
1509
With many degrees of freedom -- *i.e.* many more samples than number of
1510
parameters to be estimated -- it is undesirable to remove entire genes from the
1511
analysis just because their data include a single count outlier. When there are
1512
7 or more replicates for a given sample, the :func:`DESeq` function will
1513
automatically replace counts with large Cook's distance with the trimmed mean
1514
over all samples, scaled up by the size factor or normalization factor for that
1515
sample. This approach is conservative, it will not lead to false positives, as
1516
it replaces the outlier value with the value predicted by the null hypothesis.
1517
This outlier replacement only occurs when there are 7 or more replicates, and
1518
can be turned off with :code:`DESeq(dds, minReplicatesForReplace=Inf)`.
1519
1520
The default Cook's distance cutoff for the two behaviors described above depends
1521
on the sample size and number of parameters to be estimated. The default is to
1522
use the 99% quantile of the :math:`F(p,m-p)` distribution (with :math:`p` the
1523
number of parameters including the intercept and :math:`m` the number of
1524
samples).  The default for gene flagging can be modified using the
1525
:code:`cooksCutoff` argument to the :meth:`.DESeqDataSet.results` function.  For
1526
outlier replacement, :func:`DESeq` preserves the original counts in
1527
:code:`dds.counts()` saving the replacement counts as a matrix named
1528
:code:`"replaceCounts"` in :code:`dds.layers`.  Note that with continuous
1529
variables in the design, outlier detection and replacement is not automatically
1530
performed, as our current methods involve a robust estimation of within-group
1531
variance which does not extend easily to continuous covariates. However, users
1532
can examine the Cook's distances in :code:`dds.layers["cooks"]`, in order to
1533
perform manual visualization and filtering if necessary.
1534
1535
.. note::
1536
   If there are very many outliers (*e.g.* many hundreds or thousands) reported
1537
   by :code:`res.summary()`, one might consider further exploration to see if a
1538
   single sample or a few samples should be removed due to low quality.  The
1539
   automatic outlier filtering/replacement is most useful in situations which
1540
   the number of outliers is limited. When there are thousands of reported
1541
   outliers, it might make more sense to turn off the outlier
1542
   filtering/replacement (:func:`DESeq` with
1543
   :code:`minReplicatesForReplace=np.inf` and :meth:`.DESeqDataSet.results` with
1544
   :code:`cooksCutoff=FALSE`) and perform manual inspection:
1545
1546
   - first it would be advantageous to make a PCA plot as described above to
1547
     spot individual sample outliers;
1548
   - second, one can make a boxplot of the Cook's distances to see if one sample
1549
     is consistently higher than others (here this is not the case):
1550
1551
   .. repl::
1552
      sns.boxplot(pd.DataFrame(np.log10(dds.layers["cooks"].T),
1553
                               columns=dds.obs_names))
1554
      plt.xticks(rotation=25)
1555
      plt.show()
1556
1557
Dispersion plot and fitting alternatives
1558
----------------------------------------
1559
1560
Plotting the dispersion estimates is a useful diagnostic. The dispersion plot
1561
below is typical, with the final estimates shrunk from the gene-wise estimates
1562
towards the fitted estimates. Some gene-wise estimates are flagged as outliers
1563
and not shrunk towards the fitted value, (this outlier detection is described in
1564
the manual page for :func:`estimateDispersionsMAP`).  The amount
1565
of shrinkage can be more or less than seen here, depending on the sample size,
1566
the number of coefficients, the row mean and the variability of the gene-wise
1567
estimates.
1568
1569
.. repl::
1570
   dds.plotDispEsts()
1571
1572
Local or mean dispersion fit
1573
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1574
1575
A local smoothed dispersion fit is automatically substituted in the case that
1576
the parametric curve does not fit the observed dispersion mean relationship.
1577
This can be prespecified by providing the argument :code:`fitType="local"` to
1578
either :func:`DESeq` or :meth:`.DESeqDataSet.estimateDispersions`.
1579
Additionally, using the mean of gene-wise disperion estimates as the fitted
1580
value can be specified by providing the argument :code:`fitType="mean"`.
1581
1582
Supply a custom dispersion fit
1583
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1584
1585
Any fitted values can be provided during dispersion estimation, using the
1586
lower-level functions described in the manual page for
1587
:func:`estimateDispersionsGeneEst`. In the code chunk below, we store the
1588
gene-wise estimates which were already calculated and saved in the metadata
1589
column :code:`dispGeneEst`. Then we calculate the median value of the dispersion
1590
estimates above a threshold, and save these values as the fitted dispersions,
1591
using the replacement function for :attr:`.DESeqDataSet.dispersionFunction`. In
1592
the last line, the function :func:`estimateDispersionsMAP`, uses the fitted
1593
dispersions to generate maximum *a posteriori* (MAP) estimates of dispersion:
1594
1595
.. repl::
1596
   ddsCustom = dds.copy()
1597
   useForMedian = ddsCustom.var["dispGeneEst"] > 1e-7
1598
   medianDisp = np.nanmedian(ddsCustom.var["dispGeneEst"][useForMedian])
1599
   ddsCustom.setDispFunction(lambda mu: medianDisp)
1600
   ddsCustom = estimateDispersionsMAP(ddsCustom)
1601
1602
.. _indfilt:
1603
1604
Independent filtering of results
1605
--------------------------------
1606
1607
The :meth:`.DESeqDataSet.results` function of the DESeq2 package performs
1608
independent filtering by default using the mean of normalized counts as a filter
1609
statistic.  A threshold on the filter statistic is found which optimizes the
1610
number of adjusted *p*-values lower than a significance level :code:`alpha` (we
1611
use the standard variable name for significance level, though it is unrelated to
1612
the dispersion parameter :math:`\alpha`).  The theory behind independent
1613
filtering is discussed in greater detail :ref:`below<indfilttheory>`. The
1614
adjusted *p*-values for the genes which do not pass the filter threshold are set
1615
to :code:`NA`.
1616
1617
The default independent filtering is performed using the
1618
:func:`~.results.filtered_p` function of the `genefilter
1619
<http://bioconductor.org/packages/genefilter>`_ package, and all of the
1620
arguments of :func:`~.results.filtered_p` can be passed to the
1621
:meth:`.DESeqDataSet.results` function.  The filter threshold value and the
1622
number of rejections at each quantile of the filter statistic are available as
1623
metadata of the object returned by :meth:`.DESeqDataSet.results`.
1624
1625
For example, we can visualize the optimization by plotting the
1626
:code:`filterNumRej` attribute of the results object. The
1627
:meth:`.DESeqDataSet.results` function maximizes the number of rejections
1628
(adjusted *p*-value less than a significance level), over the quantiles of a
1629
filter statistic (the mean of normalized counts). The threshold chosen (vertical
1630
line) is the lowest quantile of the filter for which the number of rejections is
1631
within 1 residual standard deviation to the peak of a curve fit to the number of
1632
rejections over the filter quantiles:
1633
1634
.. repl::
1635
   res.alpha
1636
   res.filterThreshold
1637
   plt.plot(res.filterNumRej["theta"],
1638
            res.filterNumRej["numRej"],
1639
            'o', color="black")
1640
   plt.xlabel("quantiles of filter")
1641
   plt.ylabel("number of rejections")
1642
   plt.plot(res.lo_fit[:,0], res.lo_fit[:,1], color="red")
1643
   plt.axvline(x=res.filterTheta, color="black")
1644
   plt.show()
1645
1646
Independent filtering can be turned off by setting :code:`independentFiltering`
1647
to :code:`FALSE`.
1648
1649
.. repl::
1650
   resNoFilt = dds.results(independentFiltering=False)
1651
   df = pd.DataFrame({"filtering": (res.padj < .1),
1652
                      "noFiltering": (resNoFilt.padj < .1)})
1653
   df.groupby(["filtering", "noFiltering"]).size()
1654
1655
1656
.. _thresh:
1657
1658
Tests of log2 fold change above or below a threshold
1659
----------------------------------------------------
1660
1661
It is also possible to provide thresholds for constructing Wald tests of
1662
significance. Two arguments to the :meth:`.DESeqDataSet.results` function allow
1663
for threshold-based Wald tests: :code:`lfcThreshold`, which takes a numeric of a
1664
non-negative threshold value, and :code:`altHypothesis`, which specifies the
1665
kind of test.  Note that the *alternative hypothesis* is specified by the user,
1666
*i.e.* those genes which the user is interested in finding, and the test
1667
provides *p*-values for the null hypothesis, the complement of the set defined
1668
by the alternative. The :code:`altHypothesis` argument can take one of the
1669
following four values, where :math:`\beta` is the log2 fold change specified by
1670
the :code:`name` argument, and :math:`x` is the :code:`lfcThreshold`.
1671
1672
* `greaterAbs` - :math:`|\beta| > x` - tests are two-tailed
1673
* `lessAbs` - :math:`|\beta| < x` - *p*-values are the maximum of the upper and lower tests
1674
* `greater` - :math:`\beta > x`
1675
* `less` - :math:`\beta < -x`
1676
1677
The four possible values of :code:`altHypothesis` are demonstrated in the
1678
following code and visually by MA-plots in the following figures.
1679
1680
.. repl::
1681
   ylim = [-2.5, 2.5]
1682
   resGA = dds.results(lfcThreshold=.5, altHypothesis="greaterAbs")
1683
   resLA = dds.results(lfcThreshold=.5, altHypothesis="lessAbs")
1684
   resG = dds.results(lfcThreshold=.5, altHypothesis="greater")
1685
   resL = dds.results(lfcThreshold=.5, altHypothesis="less")
1686
1687
   def drawlines(ax):
1688
     ax.axhline(y=-.5, c="dodgerblue")
1689
     ax.axhline(y=.5, c="dodgerblue")
1690
     plt.show()
1691
1692
   drawlines(resGA.plotMA(ylim=ylim))
1693
   drawlines(resLA.plotMA(ylim=ylim))
1694
   drawlines(resG.plotMA(ylim=ylim))
1695
   drawlines(resL.plotMA(ylim=ylim))
1696
1697
.. _access:
1698
1699
Access to all calculated values
1700
-------------------------------
1701
1702
All row-wise calculated values (intermediate dispersion calculations,
1703
coefficients, standard errors, etc.) are stored in the
1704
:class:`~DESeqDataSet.DESeqDataSet` object, *e.g.* :code:`dds` in this vignette.
1705
These values are accessible by inspecting the :code:`var` attribute of
1706
:code:`dds`.  Descriptions of the columns are accessible through the
1707
:code:`description` attribute:
1708
1709
.. repl::
1710
   dds.var.iloc[:4,:4]
1711
   dds.var.columns
1712
   dds.var.description
1713
1714
The mean values :math:`\mu_{ij} = s_j q_{ij}` and the Cook's distances for each
1715
gene and sample are stored as matrices in the :attr:`layers` attribute:
1716
1717
.. repl::
1718
   dds.layers["mu"]
1719
   dds.layers["cooks"]
1720
1721
The dispersions :math:`\alpha_i` can be accessed with the
1722
:attr:`.DESeqDataSet.dispersions` attribute:
1723
1724
.. repl::
1725
   dds.dispersions.head()
1726
   dds.var["dispersion"].head()
1727
1728
The size factors :math:`s_j` are accessible via the
1729
:attr:`.DESeqDataSet.sizeFactors` attribute:
1730
1731
.. repl::
1732
   dds.sizeFactors
1733
1734
For advanced users, we also include a convenience function :func:`coef` for
1735
extracting the matrix :math:`[\beta_{ir}]` for all genes :math:`i` and model
1736
coefficients :math:`r`.  This function can also return a matrix of standard
1737
errors, see the documentation of :func:`coef`.  The columns of this matrix
1738
correspond to the effects returned by :meth:`.DESeqDataSet.resultsNames`.  Note
1739
that the :meth:`.DESeqDataSet.results` function is best for building results
1740
tables with *p*-values and adjusted *p*-values:
1741
1742
.. repl::
1743
   coef(dds).head()
1744
1745
The beta prior variance :math:`\sigma_r^2` is stored as an attribute of the
1746
:class:`~DESeqDataSet.DESeqDataSet`:
1747
1748
.. repl::
1749
   dds.betaPriorVar
1750
1751
General information about the prior used for log fold change shrinkage is also
1752
stored in a slot of the :class:`.DESeqResults` object. This would also contain
1753
information about what other packages were used for log2 fold change shrinkage:
1754
1755
.. repl::
1756
   resLFC.priorInfo
1757
   resNorm.priorInfo
1758
   resAsh.priorInfo
1759
1760
The dispersion prior variance :math:`\sigma_d^2` is stored as an attribute of
1761
the dispersion function:
1762
1763
.. repl::
1764
   dds.dispersionFunction
1765
   dds.dispersionFunction.dispPriorVar
1766
1767
..
1768
  The version of DESeq2 which was used to construct the
1769
  :class:`~DESeqDataSet.DESeqDataSet` object, or the version used when
1770
  :func:`DESeq` was run, is stored here:
1771
1772
  .. repl::
1773
     dds.version
1774
1775
1776
Sample-/gene-dependent normalization factors
1777
--------------------------------------------
1778
1779
In some experiments, there might be gene-dependent dependencies which vary
1780
across samples. For instance, GC-content bias or length bias might vary across
1781
samples coming from different labs or processed at different times. We use the
1782
terms *normalization factors* for a gene x sample matrix, and *size factors* for
1783
a single number per sample.  Incorporating normalization factors, the mean
1784
parameter :math:`\mu_{ij}` becomes:
1785
1786
.. math::
1787
   \mu_{ij} = NF_{ij} q_{ij}
1788
1789
with normalization factor matrix :math:`NF` having the same dimensions as the
1790
counts matrix :math:`K`. This matrix can be incorporated as shown below. We
1791
recommend providing a matrix with gene-wise geometric means of 1, so that the
1792
mean of normalized counts for a gene is close to the mean of the unnormalized
1793
counts.  This can be accomplished by dividing out the current gene geometric
1794
means:
1795
1796
>>> normFactors = normFactors / np.exp(np.mean(np.log(normFactors), axis=0))
1797
>>> dds.normalizationFactors = normFactors
1798
1799
These steps then replace :meth:`.DESeqDataSet.estimateSizeFactors` which occurs
1800
within the :func:`DESeq` function. The :func:`DESeq` function will look for
1801
pre-existing normalization factors and use these in the place of size factors
1802
(and a message will be printed confirming this).
1803
1804
..
1805
  The methods provided by the `cqn <http://bioconductor.org/packages/cqn>`_ or
1806
  `EDASeq <http://bioconductor.org/packages/EDASeq>`_ packages can help correct
1807
  for GC or length biases. They both describe in their vignettes how to create
1808
  matrices which can be used by DESeq2.  From the formula above, we see that
1809
  normalization factors should be on the scale of the counts, like size factors,
1810
  and unlike offsets which are typically on the scale of the predictors (*i.e.*
1811
  the logarithmic scale for the negative binomial GLM). At the time of writing,
1812
  the transformation from the matrices provided by these packages should be:
1813
1814
  ```{r offsetTransform, eval=FALSE}
1815
  cqnOffset <- cqnObject$glm.offset
1816
  cqnNormFactors <- exp(cqnOffset)
1817
  EDASeqNormFactors <- exp(-1 * EDASeqOffset)
1818
  ```
1819
1820
"Model matrix not full rank"
1821
----------------------------
1822
1823
While most experimental designs run easily using design formula, some design
1824
formulas can cause problems and result in the :func:`DESeq` function returning
1825
an error with the text: "the model matrix is not full rank, so the model cannot
1826
be fit as specified."  There are two main reasons for this problem: either one
1827
or more columns in the model matrix are linear combinations of other columns, or
1828
there are levels of factors or combinations of levels of multiple factors which
1829
are missing samples. We address these two problems below and discuss possible
1830
solutions.
1831
1832
Linear combinations
1833
^^^^^^^^^^^^^^^^^^^
1834
1835
The simplest case is the linear combination, or linear dependency problem, when
1836
two variables contain exactly the same information, such as in the following
1837
sample table. The software cannot fit an effect for :code:`batch` and
1838
:code:`condition`, because they produce identical columns in the model matrix.
1839
This is also referred to as *perfect confounding*. A unique solution of
1840
coefficients (the :math:`\beta_i` in the formula :ref:`below<theory>`) is not
1841
possible:
1842
1843
.. repl::
1844
   pd.DataFrame({"batch": Factor([1,1,2,2]),
1845
                 "condition": Factor(["A", "A", "B", "B"])})
1846
1847
Another situation which will cause problems is when the variables are not
1848
identical, but one variable can be formed by the combination of other factor
1849
levels. In the following example, the effect of batch 2 vs 1 cannot be fit
1850
because it is identical to a column in the model matrix which represents the
1851
condition C vs A effect:
1852
1853
.. repl::
1854
   pd.DataFrame({"batch": Factor([1,1,1,1,2,2]),
1855
                 "condition": Factor(["A", "A", "B", "B", "C", "C"])})
1856
1857
In both of these cases above, the batch effect cannot be fit and must be removed
1858
from the model formula. There is just no way to tell apart the condition effects
1859
and the batch effects. The options are either to assume there is no batch effect
1860
(which we know is highly unlikely given the literature on batch effects in
1861
sequencing datasets) or to repeat the experiment and properly balance the
1862
conditions across batches.  A balanced design would look like:
1863
1864
.. repl::
1865
   pd.DataFrame({"batch": Factor([1,1,1,2,2,2]),
1866
                 "condition": Factor(["A", "B", "C", "A", "B", "C"])})
1867
1868
.. _nested-div:
1869
1870
Group-specific condition effects, individuals nested within groups
1871
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1872
1873
Finally, there is a case where we *can* in fact perform inference, but we may
1874
need to re-arrange terms to do so. Consider an experiment with grouped
1875
individuals, where we seek to test the group-specific effect of a condition or
1876
treatment, while controlling for individual effects. The individuals are nested
1877
within the groups: an individual can only be in one of the groups, although each
1878
individual has one or more observations across condition.
1879
1880
An example of such an experiment is below:
1881
1882
.. repl::
1883
   colData = pd.DataFrame({"grp": Factor(["X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"]),
1884
                           "ind": Factor([1,1,2,2,3,3,4,4,5,5,6,6]),
1885
                           "cnd": Factor(["A","B","A","B","A","B","A","B","A","B","A","B"])})
1886
   colData
1887
1888
Note that individual (:code:`ind`) is a *factor* not a numeric. This is very
1889
important.
1890
1891
We have two groups of samples X and Y, each with three distinct individuals
1892
(labeled here 1-6). For each individual, we have conditions A and B (for
1893
example, this could be control and treated).
1894
1895
This design can be analyzed by DESeq2 but requires a bit of refactoring in order
1896
to fit the model terms. Here we will use a trick described in the `edgeR
1897
<http://bioconductor.org/packages/edgeR>`_ user guide, from the section
1898
*Comparisons Both Between and Within Subjects*.  If we try to analyze with a
1899
formula such as, :code:`"~ ind + grp*cnd"`, we will obtain an error, because the
1900
effect for group is a linear combination of the individuals.
1901
1902
However, the following steps allow for an analysis of group-specific condition
1903
effects, while controlling for differences in individual.  For object
1904
construction, you can use a simple design, such as :code:`"~ ind + cnd"`, as
1905
long as you remember to replace it before running :func:`DESeq`.  Then add a
1906
column :code:`ind_n` which distinguishes the individuals nested within a group.
1907
Here, we add this column to :code:`colData`, but in practice you would add this
1908
column to :code:`dds`:
1909
1910
.. repl::
1911
   colData["ind_n"] = Factor([1,1,2,2,3,3,1,1,2,2,3,3])
1912
   colData
1913
1914
Now we can reassign our :class:`~DESeqDataSet.DESeqDataSet` a design of
1915
:code:`"~ grp + grp:ind.n + grp:cnd"`, before we call :func:`DESeq`. This new
1916
design will result in the following model matrix:
1917
1918
.. repl::
1919
   patsy.dmatrix("~ grp + grp:ind_n + grp:cnd", colData)
1920
1921
1922
Note that, if you have unbalanced numbers of individuals in the two groups, you
1923
will have zeros for some of the interactions between :code:`grp` and
1924
:code:`ind.n`. You can remove these columns manually from the model matrix and
1925
pass the corrected model matrix to the :code:`full` argument of the
1926
:func:`DESeq` function. See example code in the next section. Note that, in this
1927
case, you will not be able to create the :class:`~DESeqDataSet.DESeqDataSet`
1928
with the design that leads to less than full rank model matrix. You can either
1929
use :code:`design="~1"` when creating the dataset object, or you can provide the
1930
corrected model matrix to the :attr:`.DESeqDataSet.design` attribute of the
1931
dataset from the start.
1932
1933
Above, the terms :code:`grpX.cndB` and :code:`grpY.cndB` give the group-specific
1934
condition effects, in other words, the condition B vs A effect for group X
1935
samples, and likewise for group Y samples. These terms control for all of the
1936
six individual effects.  These group-specific condition effects can be extracted
1937
using :meth:`.DESeqDataSet.results` with the :code:`name` argument.
1938
1939
Furthermore, :code:`grpX.cndB` and :code:`grpY.cndB` can be contrasted using the
1940
:code:`contrast` argument, in order to test if the condition effect is different
1941
across group:
1942
1943
>>> dds.results(contrast=("grpY.cndB", "grpX.cndB"))
1944
1945
1946
Levels without samples
1947
^^^^^^^^^^^^^^^^^^^^^^
1948
1949
The function :func:`patsy.dmatrix` will produce a column of zeros if a level is
1950
missing from a factor or a combination of levels is missing from an interaction
1951
of factors. The solution to the first case is to call
1952
:code:`remove_unused_categories` on the column, which will remove levels without
1953
samples. This was shown in the beginning of this vignette.
1954
1955
The second case is also solvable, by manually editing the model matrix, and then
1956
providing this to :func:`DESeq`. Here we construct an example dataset to
1957
illustrate it:
1958
1959
.. repl::
1960
   group = Factor([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3])
1961
   condition = Factor(["A","A","B","B","C","C","A","A","B","B","C","C","A","A","B","B","C","C"])
1962
   d = pd.DataFrame({"group": group, "condition": condition})[:16]
1963
   d
1964
1965
Note that if we try to estimate all interaction terms, we introduce a column
1966
with all zeros, as there are no condition C samples for group 3
1967
(:code:`np.asarray` is used to display the matrix):
1968
1969
.. repl::
1970
   m1 = patsy.dmatrix("~condition*group", d)
1971
   m1.design_info.column_names
1972
   np.asarray(m1)
1973
   all_zero = (m1 == 0).all(axis=0)
1974
   all_zero
1975
1976
We can remove this column like so:
1977
1978
.. repl::
1979
   m1 = m1[:,~all_zero]
1980
   m1
1981
1982
1983
Now this matrix :code:`m1` can be provided to the :code:`full` argument of
1984
:func:`DESeq`.  For a likelihood ratio test of interactions, a model matrix
1985
using a reduced design such as :code:`"~ condition + group"` can be given to the
1986
:code:`reduced` argument. Wald tests can also be generated instead of the
1987
likelihood ratio test, but for user-supplied model matrices, the argument
1988
:code:`betaPrior` must be set to :code:`False`.
1989
1990
.. _theory:
1991
1992
Theory behind DESeq2
1993
====================
1994
1995
The DESeq2 model
1996
----------------
1997
1998
The DESeq2 model and all the steps taken in the software are described in detail
1999
in our publication [Love2014]_, and we include the formula and descriptions in
2000
this section as well.  The differential expression analysis in DESeq2 uses a
2001
generalized linear model of the form:
2002
2003
.. math::
2004
   K_{ij} \sim \textrm{NB}(\mu_{ij}, \alpha_i)
2005
2006
   \mu_{ij} = s_j q_{ij}
2007
2008
   \log_2(q_{ij}) = x_{j.} \beta_i
2009
2010
where counts :math:`K_{ij}` for gene :math:`i`, sample :math:`j` are modeled
2011
using a negative binomial distribution with fitted mean :math:`\mu_{ij}` and a
2012
gene-specific dispersion parameter :math:`\alpha_i`.  The fitted mean is
2013
composed of a sample-specific size factor :math:`s_j` and a parameter
2014
:math:`q_{ij}` proportional to the expected true concentration of fragments for
2015
sample :math:`j`.  The coefficients :math:`\beta_i` give the log2 fold changes
2016
for gene :math:`i` for each column of the model matrix :math:`X`.  Note that the
2017
model can be generalized to use sample- and gene-dependent normalization factors
2018
:math:`s_{ij}`.
2019
2020
The dispersion parameter :math:`\alpha_i` defines the relationship between the
2021
variance of the observed count and its mean value. In other words, how far do we
2022
expected the observed count will be from the mean value, which depends both on
2023
the size factor :math:`s_j` and the covariate-dependent part :math:`q_{ij}` as
2024
defined above.
2025
2026
.. math::
2027
   \textrm{Var}(K_{ij}) = E[ (K_{ij} - \mu_{ij})^2 ] = \mu_{ij} + \alpha_i \mu_{ij}^2
2028
2029
An option in DESeq2 is to provide maximum *a posteriori* estimates of the log2
2030
fold changes in :math:`\beta_i` after incorporating a zero-centered Normal prior
2031
(:code:`betaPrior`). While previously, these moderated, or shrunken, estimates
2032
were generated by :func:`DESeq` or :func:`nbinomWaldTest` functions, they are
2033
now produced by the :func:`lfcShrink` function.  Dispersions are estimated using
2034
expected mean values from the maximum likelihood estimate of log2 fold changes,
2035
and optimizing the Cox-Reid adjusted profile likelihood, as first implemented
2036
for RNA-seq data in `edgeR <http://bioconductor.org/packages/edgeR>`_
2037
[@CR,edgeR_GLM]. The steps performed by the :func:`DESeq` function are
2038
documented in its manual page; briefly, they are:
2039
2040
  1. estimation of size factors :math:`s_j` by
2041
     :meth:`.DESeqDataSet.estimateSizeFactors`
2042
  2. estimation of dispersion :math:`\alpha_i` by
2043
     :meth:`.DESeqDataSet.estimateDispersions`
2044
  3. negative binomial GLM fitting for :math:`\beta_i` and Wald statistics by
2045
     :func:`nbinomWaldTest`
2046
2047
For access to all the values calculated during these steps, see the section
2048
:ref:`above<access>`.
2049
2050
Changes compared to R DESeq2
2051
----------------------------
2052
2053
This module is a Python port of the R package `DESeq2
2054
<https://bioconductor.org/packages/release/bioc/html/DESeq2.html>`_. The port is
2055
based on version 1.39.3, and changes to the R package since this version have
2056
not been reflected in the present module.  Also note that the port is partial:
2057
not all features may have been ported yet.  To help us prioritize our future
2058
focus, please open a "feature request" issue on `inmoose GitHub
2059
repository <https://github.com/epigenelabs/inmoose>`_.
2060
2061
The present page mirrors DESeq2 vignette, and despite our efforts, some code
2062
examples may not accurately reflect the state of the Python module.
2063
2064
The main changes in this module compared to the original DESeq2 package are as
2065
follows:
2066
2067
  - :class:`anndata.AnnData` is used as the superclass for storage of input
2068
    data, intermediate calculations and results.
2069
2070
2071
.. _changes:
2072
2073
Methods changes since the 2014 DESeq2 paper
2074
-------------------------------------------
2075
2076
.. note::
2077
2078
   The changes below are present in the R DESeq2 package, and "we" stands for
2079
   the authors of the R package, not the authors of InMoose.
2080
2081
* In version 1.18 (November 2017), we add two :ref:`alternative shrinkage
2082
  estimators<shrink>`, which can be used via :func:`lfcShrink`: an estimator
2083
  using a t prior from the apeglm packages, and an estimator with a fitted
2084
  mixture of normals prior from the ashr package.
2085
* In version 1.16 (November 2016), the log2 fold change shrinkage is no longer
2086
  default for the :func:`DESeq` and :func:`nbinomWaldTest` functions, by setting
2087
  the defaults of these to :code:`betaPrior=FALSE`, and by introducing a
2088
  separate function :func:`lfcShrink`, which performs log2 fold change shrinkage
2089
  for visualization and ranking of genes.  While for the majority of bulk
2090
  RNA-seq experiments, the LFC shrinkage did not affect statistical testing,
2091
  DESeq2 has become used as an inference engine by a wider community, and
2092
  certain sequencing datasets show better performance with the testing separated
2093
  from the use of the LFC prior. Also, the separation of LFC shrinkage to a
2094
  separate function :func:`lfcShrink` allows for easier methods development of
2095
  alternative effect size estimators.
2096
* A small change to the independent filtering routine: instead of taking the
2097
  quantile of the filter (the mean of normalized counts) which directly
2098
  *maximizes* the number of rejections, the threshold chosen is the lowest
2099
  quantile of the filter for which the number of rejections is close to the peak
2100
  of a curve fit to the number of rejections over the filter quantiles.  "Close
2101
  to" is defined as within 1 residual standard deviation.  This change was
2102
  introduced in version 1.10 (October 2015).
2103
* For the calculation of the beta prior variance, instead of matching the
2104
  empirical quantile to the quantile of a Normal distribution, DESeq2 now uses
2105
  the weighted quantile function of the Hmisc package. The weighting is
2106
  described in the manual page for :func:`nbinomWaldTest`.  The weights are the
2107
  inverse of the expected variance of log counts (as used in the diagonals of
2108
  the matrix :math:`W` in the GLM). The effect of the change is that the
2109
  estimated prior variance is robust against noisy estimates of log fold change
2110
  from genes with very small counts. This change was introduced in version 1.6
2111
  (October 2014).
2112
2113
Count outlier detection
2114
-----------------------
2115
2116
DESeq2 relies on the negative binomial distribution to make estimates and
2117
perform statistical inference on differences.  While the negative binomial is
2118
versatile in having a mean and dispersion parameter, extreme counts in
2119
individual samples might not fit well to the negative binomial. For this reason,
2120
we perform automatic detection of count outliers. We use Cook's distance, which
2121
is a measure of how much the fitted coefficients would change if an individual
2122
sample were removed [Cook1977]_. For more on the implementation of Cook's
2123
distance see the manual page for the :meth:`.DESeqDataSet.results` function.
2124
Below we plot the maximum value of Cook's distance for each row over the rank of
2125
the test statistic to justify its use as a filtering criterion.
2126
2127
.. repl::
2128
   W = res["stat"]
2129
   maxCooks = dds.layers["cooks"].max(axis=0)
2130
   idx = ~np.isnan(W)
2131
   plt.scatter(np.argsort(W[idx]), maxCooks[idx], c="grey")
2132
   plt.xlabel("rank of Wald statistic")
2133
   plt.ylabel("maximum Cook's distance per gene")
2134
   m = dds.n_obs
2135
   p = 3
2136
   plt.axhline(y = scipy.stats.f.ppf(.99, p, m-p))
2137
   plt.show()
2138
2139
2140
Contrasts
2141
---------
2142
2143
Contrasts can be calculated for a :class:`~DESeqDataSet.DESeqDataSet` object for
2144
which the GLM coefficients have already been fit using the Wald test steps
2145
(:func:`DESeq` with :code:`test="Wald"` or using :func:`nbinomWaldTest`).  The
2146
vector of coefficients :math:`\beta` is left multiplied by the contrast vector
2147
:math:`c` to form the numerator of the test statistic. The denominator is formed
2148
by multiplying the covariance matrix :math:`\Sigma` for the coefficients on
2149
either side by the contrast vector :math:`c`. The square root of this product is
2150
an estimate of the standard error for the contrast. The contrast statistic is
2151
then compared to a Normal distribution as are the Wald statistics for the DESeq2
2152
package.
2153
2154
.. math::
2155
   W = \frac{c^t \beta}{\sqrt{c^t \Sigma c}}
2156
2157
2158
Expanded model matrices
2159
-----------------------
2160
2161
For the specific combination of :func:`lfcShrink` with the type :code:`normal`
2162
and using :code:`contrast`, DESeq2 uses *expanded model matrices* to produce
2163
shrunken log2 fold change estimates where the shrinkage is independent of the
2164
choice of reference level. In all other cases, DESeq2 uses standard model
2165
matrices, as produced by :func:`patsy.dmatrix`.  The expanded model matrices
2166
differ from the standard model matrices, in that they have an indicator column
2167
(and therefore a coefficient) for each level of factors in the design formula in
2168
addition to an intercept. This is described in the DESeq2 paper. Using type
2169
:code:`normal` with :func:`coef` uses standard model matrices, as does the
2170
:code:`apeglm` shrinkage estimator.
2171
2172
.. _indfilttheory:
2173
2174
Independent filtering and multiple testing
2175
------------------------------------------
2176
2177
Filtering criteria
2178
^^^^^^^^^^^^^^^^^^
2179
2180
The goal of independent filtering is to filter out those tests from the
2181
procedure that have no, or little chance of showing significant evidence,
2182
without even looking at their test statistic. Typically, this results in
2183
increased detection power at the same experiment-wide type I error. Here, we
2184
measure experiment-wide type I error in terms of the false discovery rate.
2185
2186
A good choice for a filtering criterion is one that:
2187
2188
  1. is statistically independent from the test statistic under the null
2189
     hypothesis,
2190
  2. is correlated with the test statistic under the alternative, and
2191
  3. does not notably change the dependence structure -- if there is any --
2192
     between the tests that pass the filter, compared to the dependence
2193
     structure between the tests before filtering.
2194
2195
The benefit from filtering relies on property (2), and we will explore it
2196
further below. Its statistical validity relies on property (1) -- which is
2197
simple to formally prove for many combinations of filter criteria with test
2198
statistics -- and (3), which is less easy to theoretically imply from first
2199
principles, but rarely a problem in practice.  We refer to [Bourgon2010]_ for
2200
further discussion of this topic.
2201
2202
A simple filtering criterion readily available in the results object is the mean
2203
of normalized counts irrespective of biological condition, and so this is the
2204
criterion which is used automatically by the :meth:`.DESeqDataSet.results`
2205
function to perform independent filtering.  Genes with very low counts are not
2206
likely to see significant differences typically due to high dispersion. For
2207
example, we can plot the :math:`-\log_{10}` *p*-values from all genes over the
2208
normalized mean counts:
2209
2210
.. repl::
2211
   plt.scatter(res["baseMean"]+1, -np.log10(res["pvalue"]), c="black")
2212
   plt.xscale("log")
2213
   plt.xlabel("mean of normalized counts")
2214
   plt.ylabel("-log[10](pvalue)")
2215
   plt.ylim(0,30)
2216
   plt.show()
2217
2218
2219
Why does it work?
2220
^^^^^^^^^^^^^^^^^
2221
2222
Consider the *p*-value histogram below: it shows how the filtering ameliorates
2223
the multiple testing problem -- and thus the severity of a multiple testing
2224
adjustment -- by removing a background set of hypotheses whose *p*-values are
2225
distributed more or less uniformly in [0,1].
2226
2227
.. repl::
2228
   use = res["baseMean"] > res.filterThreshold
2229
   plt.hist(res["pvalue"][~use], bins=50, color="khaki", label="do not pass")
2230
   plt.hist(res["pvalue"][use], bins=50, color="powderblue", label="pass")
2231
   plt.legend(loc="upper right")
2232
   plt.show()
2233
2234
Histogram of *p*-values for all tests.  The area shaded in blue indicates the
2235
subset of those that pass the filtering, the area in khaki those that do not
2236
pass.
2237
2238
2239
References
2240
==========
2241
2242
.. [AhlmannEltze2020] C. Ahlmann-Eltze, W. Huber. 2020. glmGamPoi: fitting
2243
   Gamma-Poisson generalized linear models on single-cell count data.
2244
   *Bioinformatics*, 36(24). :doi:`10.1093/bioinformatics/btaa1009`
2245
2246
.. [Anders2010] S. Anders and W. Huber. 2010. Differential expression for
2247
   sequence count data. *Genome Biology*, 11:106.
2248
   :doi:`10.1186/gb-2010-11-10-r106`
2249
2250
.. [Berge2018] K. van den Berge, F. Perraudeau, C. Soneson, M.I. Loce, D. Risso,
2251
   J.P. Vert, M.D. Robinson, S. Dudoit, L. Clement. 2018. Observation weights
2252
   unlock bulk RNA-seq tools for zero inflation and single-cell applications.
2253
   *Genome Biology*, 19(24). :doi:`10.1186/s13059-018-1406-4`
2254
2255
.. [Bourgon2010] R. Bourgon, R. Gentleman, W. Huber. 2010. Independent filtering
2256
   increases detection power for high-throughput experiments. *PNAS*,
2257
   107(21):9546-51.  :doi:`10.1073/pnas.0914005107`
2258
2259
.. [Cook1977] R.D. Cook. 1977. Detection of inferential observation in linear
2260
   regression. *Technometrics*, 19(1):15-18. :doi:`10.2307/1268249`
2261
2262
.. [Gerard2020] D. Gerard, M. Stephens. 2020. Empirical Bayes shrinkage and
2263
   false discovery rate estimation, allowing for unwanted variation.
2264
   *Biostatistics*, 21(1):15-32. :doi:`10.1093/biostatistics/kxy029`
2265
2266
.. [Leek2014] J.T. Leek. 2014. svaseq: removing batch effects and other unwanted
2267
   noise from sequencing data. *Nucleic Acids Research*, 42(21).
2268
   :doi:`10.1093/nar/gku864`
2269
2270
.. [Liao2013] Y. Liao, G.K. Smyth, W. Shi. 2013. featureCounts: an efficient
2271
   general purpose program for assigning sequence reads to genomic features.
2272
   *Bioinformatics*, 30(7):923-30. :doi:`10.1093/bioinformatics/btt656`
2273
2274
.. [Love2014] M.I. Love, W. Huber, S. Anders. 2014. Moderated estimation of fold
2275
   change and dispersion for RNA-seq data with DESeq2. *Genome Biology*, 15(12),
2276
   550.  :doi:`10.1186/s13059-014-0550-8`
2277
2278
.. [McCarthy2012] D.J. McCarthy, Y. Chen, G.K. Smyth. 2012. Differential
2279
   expression analysis of multifactor RNA-Seq experiments with respect to
2280
   biological variation. *Nucleic Acids Research* 40, 4288-4297.
2281
   :doi:`10.1093/nar/gks042`
2282
2283
.. [Risso2014] D. Risso, J. Ngai. T.P. Speed, S. Dudoit. 2014. Normalization of
2284
   RNA-seq data using factor analysis of control genes or samples. *Nature
2285
   Biotechnology*, 32(9). :doi:`10.1038/nbt.2931`
2286
2287
.. [Stephens2016] M. Stephens. 2016. False discovery rates: a new deal.
2288
   *Biostatistics*, 18(2). :doi:`10.1093/biostatistics/kxw041`
2289
2290
.. [Storey2003] J. Storey. 2003. The positive false discovery rate: a Bayesian
2291
   interpretation and the *q*-value. *The Annals of Statistics*,
2292
   31(6):2013-2035.
2293
2294
.. [Wu2012] H. Wu, C. Wang, Z. Wu. 2012. A new shrinkage estimator for
2295
   dispersion improves differential detection in RNA-Seq data. *Biostatistics*.
2296
   :doi:`10.1093/biostatistics/kxs033`
2297
2298
.. [Zhu2018] A. Zhu, J.G. Ibrahim, M.I. Love. 2018. Heavy-tailed prior
2299
   distributions for sequence count data: removing the noise and preserving
2300
   large differences. *Bioinformatics*, 35(12), 2084-2092.
2301
   :doi:`10.1093/bioinformatics/bty895`
2302
2303
Code documentation
2304
==================
2305
2306
.. autosummary::
2307
   :toctree: generated/
2308
2309
   ~DESeqDataSet.DESeqDataSet
2310
   ~results.DESeqResults
2311
2312
   DESeq
2313
   collapseReplicates
2314
   estimateBetaPriorVar
2315
   estimateDispersionsFit
2316
   estimateDispersionsGeneEst
2317
   estimateDispersionsMAP
2318
   estimateDispersionsPriorVar
2319
   estimateSizeFactorsForMatrix
2320
   ~results.filtered_p
2321
   lfcShrink
2322
   makeExampleDESeqDataSet
2323
   nbinomLRT
2324
   nbinomWaldTest
2325
   ~results.p_adjust
2326
   replaceOutliers
2327
   varianceStabilizingTransformation
2328
   ~Hmisc.wtd_quantile
2329