|
a |
|
b/docs/source/deseq.rst |
|
|
1 |
====== |
|
|
2 |
deseq2 |
|
|
3 |
====== |
|
|
4 |
|
|
|
5 |
.. currentmodule:: inmoose.deseq2 |
|
|
6 |
|
|
|
7 |
This module is a partial port of the R Bioconductor `DESeq2 package |
|
|
8 |
<https://bioconductor.org/packages/release/bioc/html/DESeq2.html>`_ [Love2014]_. |
|
|
9 |
|
|
|
10 |
.. repl:: |
|
|
11 |
from inmoose.deseq2 import * |
|
|
12 |
import patsy |
|
|
13 |
import numpy as np |
|
|
14 |
import matplotlib.pyplot as plt |
|
|
15 |
import pandas as pd |
|
|
16 |
import scipy |
|
|
17 |
|
|
|
18 |
|
|
|
19 |
.. repl-quiet:: |
|
|
20 |
pd.options.display.max_columns = 10 |
|
|
21 |
from matplotlib import rcParams |
|
|
22 |
# repl default config replaces '.' by '-' in the savefig.directory :/ |
|
|
23 |
rcParams['savefig.directory'] = rcParams['savefig.directory'].replace("readthedocs-org", "readthedocs.org") |
|
|
24 |
|
|
|
25 |
Quick start |
|
|
26 |
=========== |
|
|
27 |
|
|
|
28 |
Here we show the most basic steps for a differential expression analysis. There |
|
|
29 |
are a variety of steps upstream of DESeq2 that result in the generation of |
|
|
30 |
counts or estimated counts for each sample, which we will discuss in the |
|
|
31 |
sections below. This code chunk assumes that you have a count matrix called |
|
|
32 |
:code:`cts` and a table of sample information called :code:`coldata`. The |
|
|
33 |
:code:`design` indicates how to model the samples, here, that we want to measure |
|
|
34 |
the effect of the condition, controlling for batch differences. The two factor |
|
|
35 |
variables :code:`batch` and :code:`condition` should be columns of |
|
|
36 |
:code:`coldata`: |
|
|
37 |
|
|
|
38 |
>>> dds = DESeqDataSet(countData = cts, |
|
|
39 |
... clinicalData = coldata, |
|
|
40 |
... design = "~ batch + condition") |
|
|
41 |
>>> dds = DESeq(dds) |
|
|
42 |
>>> # list the coefficients |
|
|
43 |
>>> dds.resultsNames() |
|
|
44 |
>>> res = dds.results(name = "condition_trt_vs_untrt") |
|
|
45 |
>>> # or to shrink the log fold changes association with condition |
|
|
46 |
>>> res = dds.lfcShrink(coef = "condition_trt_vs_untrt", type = "apeglm") |
|
|
47 |
|
|
|
48 |
|
|
|
49 |
Input data |
|
|
50 |
========== |
|
|
51 |
|
|
|
52 |
Why un-normalized counts? |
|
|
53 |
------------------------- |
|
|
54 |
|
|
|
55 |
As input, the DESeq2 package expects count data as obtained, *e.g.* from RNA-seq |
|
|
56 |
or another high-throughput sequencing experiment, in the form of a matrix of |
|
|
57 |
integer values. The value in the :math:`i`-th row and the :math:`j`-th column of |
|
|
58 |
the matrix tells how many reads can be assigned to gene :math:`i` in sample |
|
|
59 |
:math:`j`. Analogously, for other types of assays, the rows of the matrix might |
|
|
60 |
correspond *e.g.* to binding regions (with ChIP-Seq) or peptide sequences (with |
|
|
61 |
quantitative mass spectrometry). We will list method for obtaining count |
|
|
62 |
matrices in sections below. |
|
|
63 |
|
|
|
64 |
The values in the matrix should be un-normalized counts or estimated counts of |
|
|
65 |
sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq). |
|
|
66 |
The `RNA-seq workflow <http://www.bioconductor.org/help/workflows/rnaseqGene/>`_ |
|
|
67 |
describes multiple techniques for preparing such count matrices. It is |
|
|
68 |
important to provide count matrices as input for DESeq2's statistical model |
|
|
69 |
[Love2014]_ to hold, as only the count values allow assessing the measurement |
|
|
70 |
precision correctly. The DESeq2 model internally corrects for library size, so |
|
|
71 |
transformed or normalized values such as counts scaled by library size should |
|
|
72 |
not be used as input. |
|
|
73 |
|
|
|
74 |
The DESeqDataSet |
|
|
75 |
---------------- |
|
|
76 |
|
|
|
77 |
The object class used by the DESeq2 package to store the read counts and the |
|
|
78 |
intermediate estimated quantities during statistical analysis is the |
|
|
79 |
:class:`~DESeqDataSet.DESeqDataSet`, which will usually be represented in the |
|
|
80 |
code here as an object :code:`dds`. |
|
|
81 |
|
|
|
82 |
A technical detail is that the :class:`~DESeqDataSet.DESeqDataSet` class extends |
|
|
83 |
the :class:`AnnData` class of the `anndata |
|
|
84 |
<https://github.com/scverse/anndata>`_ package. |
|
|
85 |
|
|
|
86 |
A :class:`~DESeqDataSet.DESeqDataSet` object must have an associated **design |
|
|
87 |
matrix**. The design matrix expresses the variables which will be used in |
|
|
88 |
modeling. The matrix is built from a formula starting with a tilde (~) followed |
|
|
89 |
by the variables with plus signs between them (it will be coerced into a formula |
|
|
90 |
if it is not already). The design can be changed later, however then all |
|
|
91 |
differential analysis steps should be repeated, as the design formula is used to |
|
|
92 |
estimate the dispersions and to estimate the log2 fold changes of the model. |
|
|
93 |
|
|
|
94 |
.. note:: |
|
|
95 |
In order to benefit from the default settings of the package, you should put |
|
|
96 |
the variable of interest at the end of the formula and make sure the control |
|
|
97 |
level is the first level. |
|
|
98 |
|
|
|
99 |
We will now show 2 ways of constructing a :class:`~DESeqDataSet.DESeqDataSet`, |
|
|
100 |
depending on what pipeline was used upstream of DESeq2 to generated counts or |
|
|
101 |
estimated counts: |
|
|
102 |
|
|
|
103 |
1. From a count matrix :ref:`countmat` |
|
|
104 |
2. From an :class:`AnnData` object :ref:`ad` |
|
|
105 |
|
|
|
106 |
.. note:: |
|
|
107 |
The original R package allows to build a :class:`~DESeqDataSet.DESeqDataSet` |
|
|
108 |
from transcript abundance files and htseq count files, but those features |
|
|
109 |
have not yet been ported into `inmoose`. |
|
|
110 |
|
|
|
111 |
.. _countmat: |
|
|
112 |
|
|
|
113 |
Count matrix input |
|
|
114 |
------------------ |
|
|
115 |
|
|
|
116 |
The constructor :meth:`.DESeqDataSet.__init__` can be used if you already have a |
|
|
117 |
matrix of read counts prepared from another source. Another method for quickly |
|
|
118 |
producing count matrices from alignment files is the :code:`featureCounts` |
|
|
119 |
function [Liao2013]_ in the `Rsubread |
|
|
120 |
<http://bioconductor.org/packages/Rsubread>`_ package. To use the constructor |
|
|
121 |
of :meth:`.DESeqDataSet.__init__`, the user should provide the counts matrix, |
|
|
122 |
the information about the samples (the rows of the count matrix) as a data |
|
|
123 |
frame, and the design formula. |
|
|
124 |
|
|
|
125 |
To demonstrate the construction of :class:`~DESeqDataSet.DESeqDataSet` from a |
|
|
126 |
count matrix, we will read in count data from the `pasilla |
|
|
127 |
<http://bioconductor.org/packages/pasilla>`_ package, available in `inmoose`. We |
|
|
128 |
read in a count matrix, which we will name :code:`cts`, and the sample |
|
|
129 |
information table, which we will name :code:`sample_data`. Further below we |
|
|
130 |
describe how to extract these objects from, *e.g.* :code:`featureCounts` |
|
|
131 |
output: |
|
|
132 |
|
|
|
133 |
.. repl:: |
|
|
134 |
|
|
|
135 |
import importlib.resources |
|
|
136 |
from inmoose.utils import Factor |
|
|
137 |
|
|
|
138 |
data_dir = importlib.resources.files("inmoose.data.pasilla") |
|
|
139 |
pasCts = data_dir.joinpath("pasilla_gene_counts.tsv") |
|
|
140 |
pasAnno = data_dir.joinpath("pasilla_sample_annotation.csv") |
|
|
141 |
cts = pd.read_csv(pasCts, sep='\t', index_col=0) |
|
|
142 |
sample_data = pd.read_csv(pasAnno, index_col=0) |
|
|
143 |
sample_data = sample_data[["condition", "type"]] |
|
|
144 |
sample_data["condition"] = Factor(sample_data["condition"]) |
|
|
145 |
sample_data["type"] = Factor(sample_data["type"]) |
|
|
146 |
|
|
|
147 |
We examine the count matrix and sample data to see if they are consistent in |
|
|
148 |
terms of sample order: |
|
|
149 |
|
|
|
150 |
.. repl:: |
|
|
151 |
cts.head(2) |
|
|
152 |
sample_data |
|
|
153 |
|
|
|
154 |
Note that these are not in the same order with respect to samples! |
|
|
155 |
|
|
|
156 |
It is absolutely critical that the rows of the count matrix and the rows of |
|
|
157 |
the sample data (information about samples) are in the same order. DESeq2 will |
|
|
158 |
not make guesses as to which row of the count matrix belongs to which row of |
|
|
159 |
the sample data, these must be provided to DESeq2 already in consistent order. |
|
|
160 |
|
|
|
161 |
As they are not in the correct order as given, we need to re-arrange one or the |
|
|
162 |
other so that they are consistent in terms of sample order (if we do not, later |
|
|
163 |
functions would produce an error). We additionally need to chop off the `"fb"` |
|
|
164 |
of the row names of :code:`sample_data`, so the naming is consistent: |
|
|
165 |
|
|
|
166 |
.. repl:: |
|
|
167 |
|
|
|
168 |
sample_data.index = [i[:-2] for i in sample_data.index] |
|
|
169 |
all(sample_data.index.isin(cts.columns)) |
|
|
170 |
all(sample_data.index == cts.columns) |
|
|
171 |
sample_data = sample_data.reindex(cts.columns) |
|
|
172 |
all(sample_data.index == cts.columns) |
|
|
173 |
|
|
|
174 |
If you have used the :code:`featureCounts` function [Liao2013]_ in the `Rsubread |
|
|
175 |
<http://bioconductor.org/packages/Rsubread>`_ package, the matrix of read counts |
|
|
176 |
can be directly provided from the `"counts"` element in the list output. The |
|
|
177 |
count matrix and sample data can typically be read into Python from flat files |
|
|
178 |
using import functions from :code:`pandas` or :code:`numpy`. |
|
|
179 |
|
|
|
180 |
With the count matrix, :code:`cts`, and the sample information, |
|
|
181 |
:code:`sample_data`, we can construct a :class:`~DESeqDataSet.DESeqDataSet`: |
|
|
182 |
|
|
|
183 |
.. repl:: |
|
|
184 |
|
|
|
185 |
dds = DESeqDataSet(countData = cts.T, |
|
|
186 |
clinicalData = sample_data, |
|
|
187 |
design = "~ condition") |
|
|
188 |
dds |
|
|
189 |
|
|
|
190 |
If you have additional feature data, it can be added to the |
|
|
191 |
:class:`~DESeqDataSet.DESeqDataSet` by adding to the metadata columns of a newly |
|
|
192 |
constructed object. (Here we add redundant data just for demonstration, as the |
|
|
193 |
gene names are already the rownames of the :code:`dds`.): |
|
|
194 |
|
|
|
195 |
.. repl:: |
|
|
196 |
|
|
|
197 |
featureData = dds.var.index |
|
|
198 |
dds.var["featureData"] = featureData |
|
|
199 |
dds.var.head() |
|
|
200 |
|
|
|
201 |
|
|
|
202 |
.. _ad: |
|
|
203 |
|
|
|
204 |
:class:`AnnData` input |
|
|
205 |
---------------------- |
|
|
206 |
|
|
|
207 |
If one has already created or obtained an :class:`AnnData`, it can be easily |
|
|
208 |
input into DESeq2 as follows. First we load the module containing the `airway` |
|
|
209 |
dataset: |
|
|
210 |
|
|
|
211 |
.. repl:: |
|
|
212 |
|
|
|
213 |
from inmoose.data.airway import airway |
|
|
214 |
ad = airway() |
|
|
215 |
|
|
|
216 |
The constructor function below shows the generation of a |
|
|
217 |
:class:`~DESeqDataSet.DESeqDataSet` from an :class:`AnnData` :code:`ad`: |
|
|
218 |
|
|
|
219 |
.. repl:: |
|
|
220 |
|
|
|
221 |
ddsAD = DESeqDataSet(ad, design = "~ cell + dex") |
|
|
222 |
ddsAD |
|
|
223 |
|
|
|
224 |
|
|
|
225 |
Pre-filtering |
|
|
226 |
------------- |
|
|
227 |
|
|
|
228 |
While it is not necessary to pre-filter low count genes before running the |
|
|
229 |
DESeq2 functions, there are two reasons which make pre-filtering useful: by |
|
|
230 |
removing rows in which there are very few reads, we reduce the memory size of |
|
|
231 |
the :code:`dds` data object, and we increase the speed of the transformation and |
|
|
232 |
testing functions within DESeq2. It can also improve visualizations, as features |
|
|
233 |
with no information for differential expression are not plotted. |
|
|
234 |
|
|
|
235 |
Here we perform a minimal pre-filtering to keep only rows that have at least 10 |
|
|
236 |
reads total. Note that more strict filtering to increase power is |
|
|
237 |
*automatically* applied via :ref:`independent filtering<indfilt>` on the mean of |
|
|
238 |
normalized counts within the :meth:`.DESeqDataSet.results` function: |
|
|
239 |
|
|
|
240 |
.. repl:: |
|
|
241 |
|
|
|
242 |
keep = dds.counts().sum(axis=1) >= 10 |
|
|
243 |
dds = dds[keep,:] |
|
|
244 |
|
|
|
245 |
Alternatively, a popular filter is to ensure at least :code:`X` samples with a |
|
|
246 |
count of 10 or more, where :code:`X` can be chosen as the sample size of the |
|
|
247 |
smallest group of samples: |
|
|
248 |
|
|
|
249 |
>>> keep = (dds.counts() >= 10).sum(axis=1) >= X |
|
|
250 |
>>> dds = dds[keep,:] |
|
|
251 |
|
|
|
252 |
.. _factorLevels: |
|
|
253 |
|
|
|
254 |
Note on factor levels |
|
|
255 |
--------------------- |
|
|
256 |
|
|
|
257 |
By default, Python will choose a *reference level* for factors based on |
|
|
258 |
alphabetical order, it chooses the first value as the reference. Then, if you |
|
|
259 |
never tell the DESeq2 functions which level you want to compare against |
|
|
260 |
(*e.g.* which level represents the control group), the comparisons will |
|
|
261 |
be based on the alphabetical order of the levels. There are two |
|
|
262 |
solutions: you can either explicitly tell :meth:`.DESeqDataSet.results` which |
|
|
263 |
comparison to make using the :code:`contrast` argument (this will be shown |
|
|
264 |
later), or you can explicitly set the factors levels by specifying the desired |
|
|
265 |
reference value first. In order to see the change of reference levels reflected |
|
|
266 |
in the results names, you need to either run :func:`DESeq` or |
|
|
267 |
:func:`nbinomWaldTest` / :func:`nbinomLRT` after the re-leveling operation. |
|
|
268 |
Setting the factor levels can be done with the:code:`reorder_categories` function: |
|
|
269 |
|
|
|
270 |
.. repl:: |
|
|
271 |
|
|
|
272 |
dds.obs["condition"] = dds.obs["condition"].cat.reorder_categories(["untreated", "treated"]) |
|
|
273 |
|
|
|
274 |
If you need to subset the columns of a :class:`~DESeqDataSet.DESeqDataSet`, |
|
|
275 |
*i.e.* when removing certain samples from the analysis, it is possible that all |
|
|
276 |
the samples for one or more levels of a variable in the design formula would be |
|
|
277 |
removed. In this case, the :code:`remove_unused_categories` function can be used |
|
|
278 |
to remove those levels which do not have samples in the current |
|
|
279 |
:class:`~DESeqDataSet.DESeqDataSet`: |
|
|
280 |
|
|
|
281 |
.. repl:: |
|
|
282 |
|
|
|
283 |
dds.obs["condition"] = dds.obs["condition"].cat.remove_unused_categories() |
|
|
284 |
|
|
|
285 |
Collapsing technical replicates |
|
|
286 |
------------------------------- |
|
|
287 |
|
|
|
288 |
DESeq2 provides a function :func:`collapseReplicates` which can assist in |
|
|
289 |
combining the counts from technical replicates into single columns of the count |
|
|
290 |
matrix. The term *technical replicate* implies multiple sequencing runs of the |
|
|
291 |
same library. You should not collapse biological replicates using this |
|
|
292 |
function. See the manual page for an example of the use of |
|
|
293 |
:func:`collapseReplicates`. |
|
|
294 |
|
|
|
295 |
About the pasilla dataset |
|
|
296 |
------------------------- |
|
|
297 |
|
|
|
298 |
We continue with the :doc:`/pasilla` data constructed from the count matrix |
|
|
299 |
method above. This data set is from an experiment on *Drosophila melanogaster* |
|
|
300 |
cell cultures and investigated the effect of RNAi knock-down of the splicing |
|
|
301 |
factor *pasilla* [Brooks2011]_. The detailed transcript of the production of |
|
|
302 |
the `pasilla <http://bioconductor.org/packages/pasilla>`_ data is provided in |
|
|
303 |
the vignette of the data package `pasilla |
|
|
304 |
<http://bioconductor.org/packages/pasilla>`_. |
|
|
305 |
|
|
|
306 |
.. _de: |
|
|
307 |
|
|
|
308 |
Differential expression analysis |
|
|
309 |
================================ |
|
|
310 |
|
|
|
311 |
The standard differential expression analysis steps are wrapped into a single |
|
|
312 |
function :func:`DESeq`. The estimation steps performed by this function are |
|
|
313 |
described :ref:`below<theory>`, in the manual page for :func:`DESeq` and in the |
|
|
314 |
Methods section of the DESeq2 publication [Love2014]_. |
|
|
315 |
|
|
|
316 |
Results tables are generated using the function :meth:`.DESeqDataSet.results`, |
|
|
317 |
which extracts a results table with log2 fold changes, *p*-values and adjusted |
|
|
318 |
*p*-values. With no additional arguments to :meth:`.DESeqDataSet.results`, the |
|
|
319 |
log2 fold change and Wald test *p*-value will be for the **last variable** in |
|
|
320 |
the design formula, and if this is a factor, the comparison will be the **last |
|
|
321 |
level** of this variable over the **reference level** (see previous :ref:`note |
|
|
322 |
on factor levels<factorlevels>`). However, the order of the variables of the |
|
|
323 |
design does not matter so long as the user specifies the comparison to build a |
|
|
324 |
results table for, using the :code:`name` or :code:`contrast` arguments of |
|
|
325 |
:meth:`.DESeqDataSet.results`. |
|
|
326 |
|
|
|
327 |
Details about the comparison are printed to the console, directly above the |
|
|
328 |
results table. The text, `condition treated vs untreated`, tells you that the |
|
|
329 |
estimates are of the logarithmic fold change log2(treated/untreated): |
|
|
330 |
|
|
|
331 |
.. repl:: |
|
|
332 |
|
|
|
333 |
dds.design = "~ condition" |
|
|
334 |
dds = DESeq(dds) |
|
|
335 |
res = dds.results() |
|
|
336 |
res.head() |
|
|
337 |
|
|
|
338 |
Note that we could have specified the coefficient or contrast we want to build a |
|
|
339 |
results table for, using either of the following equivalent commands: |
|
|
340 |
|
|
|
341 |
>>> res = dds.results(name="condition_treated_vs_untreated") |
|
|
342 |
>>> res = dds.results(contrast=["condition","treated","untreated"]) |
|
|
343 |
|
|
|
344 |
One exception to the equivalence of these two commands, is that, using |
|
|
345 |
:code:`contrast` will additionally set to 0 the estimated LFC in a comparison of |
|
|
346 |
two groups, where all of the counts in the two groups are equal to 0 (while |
|
|
347 |
other groups have positive counts). As this may be a desired feature to have the |
|
|
348 |
LFC in these cases set to 0, one can use :code:`contrast` to build these results |
|
|
349 |
tables. More information about extracting specific coefficients from a fitted |
|
|
350 |
:class:`~DESeqDataSet.DESeqDataSet` object can be found in the help page |
|
|
351 |
:meth:`.DESeqDataSet.results`. The use of the :code:`contrast` argument is also |
|
|
352 |
further discussed :ref:`below<contrasts>`. |
|
|
353 |
|
|
|
354 |
.. _lfcShrink: |
|
|
355 |
|
|
|
356 |
Log fold change shrinkage for visualization and ranking |
|
|
357 |
------------------------------------------------------- |
|
|
358 |
|
|
|
359 |
Shrinkage of effect size (LFC estimates) is useful for visualization and ranking |
|
|
360 |
of genes. To shrink the LFC, we pass the :code:`dds` object to the function |
|
|
361 |
:func:`lfcShrink`. Below we specify to use the :code:`apeglm` method for effect |
|
|
362 |
size shrinkage [Zhu2018]_, which improves on the previous estimator. |
|
|
363 |
|
|
|
364 |
We provide the :code:`dds` object and the name or number of the coefficient we |
|
|
365 |
want to shrink, where the number refers to the order of the coefficient as it |
|
|
366 |
appears in :code:`dds.resultsNames()`: |
|
|
367 |
|
|
|
368 |
>>> dds.resultsNames() |
|
|
369 |
>>> resLFC = dds.lfcShrink(coef="condition_treated_vs_untreated", type="apeglm") |
|
|
370 |
>>> resLFC |
|
|
371 |
|
|
|
372 |
Shrinkage estimation is discussed more in a :ref:`later section<shrink>`. |
|
|
373 |
|
|
|
374 |
*p*-values and adjusted *p*-values |
|
|
375 |
---------------------------------- |
|
|
376 |
|
|
|
377 |
We can order our results table by the smallest *p*-value: |
|
|
378 |
|
|
|
379 |
.. repl:: |
|
|
380 |
resOrdered = res.sort_values(by="pvalue") |
|
|
381 |
|
|
|
382 |
We can summarize some basic tallies using the :meth:`.DESeqResults.summary` |
|
|
383 |
function: |
|
|
384 |
|
|
|
385 |
.. repl:: |
|
|
386 |
print(res.summary()) |
|
|
387 |
|
|
|
388 |
How many adjusted *p*-values were less than 0.1: |
|
|
389 |
|
|
|
390 |
.. repl:: |
|
|
391 |
(res.padj < 0.1).sum() |
|
|
392 |
|
|
|
393 |
The :meth:`.DESeqDataSet.results` function contains a number of arguments to |
|
|
394 |
customize the results table which is generated. You can read about these |
|
|
395 |
arguments by looking up the documentation of :meth:`.DESeqDataSet.results`. |
|
|
396 |
Note that the :meth:`.DESeqDataSet.results` function automatically performs |
|
|
397 |
independent filtering based on the mean of normalized counts for each gene, |
|
|
398 |
optimizing the number of genes which will have an adjusted *p*-value below a |
|
|
399 |
given FDR cutoff, :code:`alpha`. Independent filtering is further discussed |
|
|
400 |
:ref:`below<indfilt>`. By default the argument :code:`alpha` is set to 0.1. If |
|
|
401 |
the adjusted *p*-value cutoff will be a value other than 0.1, :code:`alpha` |
|
|
402 |
should be set to that value: |
|
|
403 |
|
|
|
404 |
.. repl:: |
|
|
405 |
res05 = dds.results(alpha=0.05) |
|
|
406 |
print(res05.summary()) |
|
|
407 |
(res05.padj < 0.05).sum() |
|
|
408 |
|
|
|
409 |
.. _IHW: |
|
|
410 |
|
|
|
411 |
.. |
|
|
412 |
Independent hypothesis weighting |
|
|
413 |
-------------------------------- |
|
|
414 |
|
|
|
415 |
A generalization of the idea of *p*-value filtering is to *weight* hypotheses to |
|
|
416 |
optimize power. A Bioconductor package, `IHW |
|
|
417 |
<http://bioconductor.org/packages/IHW>`_, is available that implements the |
|
|
418 |
method of *Independent Hypothesis Weighting* [Ignatiadis2016]_. Here we show |
|
|
419 |
the use of *IHW* for *p*-value adjustment of DESeq2 results. For more details, |
|
|
420 |
please see the vignette of the `IHW <http://bioconductor.org/packages/IHW>`_ |
|
|
421 |
package. The *IHW* result object is stored in the metadata:: |
|
|
422 |
|
|
|
423 |
# (unevaluated code chunk) |
|
|
424 |
library("IHW") |
|
|
425 |
resIHW <- results(dds, filterFun=ihw) |
|
|
426 |
summary(resIHW) |
|
|
427 |
sum(resIHW$padj < 0.1, na.rm=TRUE) |
|
|
428 |
metadata(resIHW)$ihwResult |
|
|
429 |
|
|
|
430 |
.. note:: |
|
|
431 |
If the results of independent hypothesis weighting are used in published |
|
|
432 |
research, please cite: |
|
|
433 |
|
|
|
434 |
Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W. (2016) |
|
|
435 |
Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. |
|
|
436 |
*Nature Methods*, **13**:7. |
|
|
437 |
:doi:`http://dx.doi.org/10.1038/nmeth.3885` |
|
|
438 |
|
|
|
439 |
For advanced users, note that all the values calculated by the DESeq2 package |
|
|
440 |
are stored in the :class:`~DESeqDataSet.DESeqDataSet` object or the |
|
|
441 |
:class:`DESeqResults` object, and access to these values is discussed |
|
|
442 |
:ref:`below<access>`. |
|
|
443 |
|
|
|
444 |
Exploring and exporting results |
|
|
445 |
=============================== |
|
|
446 |
|
|
|
447 |
MA-plot |
|
|
448 |
------- |
|
|
449 |
|
|
|
450 |
In DESeq2, the function :meth:`.DESeqResults.plotMA` shows the log2 fold changes |
|
|
451 |
attributable to a given variable over the mean of normalized counts for all the |
|
|
452 |
samples in the :class:`~DESeqDataSet.DESeqDataSet`. Points will be colored red |
|
|
453 |
if the adjusted *p*-value is less than 0.1. Points which fall out of the window |
|
|
454 |
are plotted as open triangles pointing either up or down: |
|
|
455 |
|
|
|
456 |
.. repl:: |
|
|
457 |
res.plotMA(ylim=[-2,2]) |
|
|
458 |
plt.show() |
|
|
459 |
|
|
|
460 |
It is more useful visualize the MA-plot for the shrunken log2 fold changes, |
|
|
461 |
which remove the noise associated with log2 fold changes from low count genes |
|
|
462 |
without requiring arbitrary filtering thresholds: |
|
|
463 |
|
|
|
464 |
.. repl:: |
|
|
465 |
resLFC.plotMA(ylim=[-2,2]) |
|
|
466 |
plt.show() |
|
|
467 |
|
|
|
468 |
.. |
|
|
469 |
After calling :meth:`.DESeqResults.plotMA`, one can use the function |
|
|
470 |
:func:`identify` to interactively detect the row number of individual genes by |
|
|
471 |
clicking on the plot. One can then recover the gene identifiers by saving the |
|
|
472 |
resulting indices:: |
|
|
473 |
|
|
|
474 |
idx = identify(res.baseMean, res.log2FoldChange) |
|
|
475 |
res.index[idx] |
|
|
476 |
|
|
|
477 |
.. _shrink: |
|
|
478 |
|
|
|
479 |
Alternative shrinkage estimators |
|
|
480 |
-------------------------------- |
|
|
481 |
|
|
|
482 |
The moderated log fold changes proposed by [Love2014]_ use a normal prior |
|
|
483 |
distribution, centered on zero and with a scale that is fit to the data. The |
|
|
484 |
shrunken log fold changes are useful for ranking and visualization, without the |
|
|
485 |
need for arbitrary filters on low count genes. The normal prior can sometimes |
|
|
486 |
produce too strong of shrinkage for certain datasets. In DESeq2 version 1.18, we |
|
|
487 |
include two additional adaptive shrinkage estimators, available via the |
|
|
488 |
:code:`type` argument of :func:`lfcShrink`. |
|
|
489 |
|
|
|
490 |
The options for :code:`type` are: |
|
|
491 |
|
|
|
492 |
* :code:`apeglm` is the adaptive t prior shrinkage estimator from the `apeglm |
|
|
493 |
<http://bioconductor.org/packages/apeglm>`_ package [Zhu2018]_. As of version |
|
|
494 |
1.28.0, it is the default estimator. |
|
|
495 |
* :code:`ashr` is the adaptive shrinkage estimator from the `ashr |
|
|
496 |
<https://github.com/stephens999/ashr>`_ package [Stephens2016]_. Here DESeq2 |
|
|
497 |
uses the ashr option to fit a mixture of Normal distributions to form the |
|
|
498 |
prior, with :code:`method="shrinkage"`. |
|
|
499 |
* :code:`normal`: is the the original DESeq2 shrinkage estimator, an adaptive |
|
|
500 |
Normal distribution as prior. |
|
|
501 |
|
|
|
502 |
If the shrinkage estimator :code:`apeglm` is used in published research, please |
|
|
503 |
cite [Zhu2018]_. |
|
|
504 |
|
|
|
505 |
If the shrinkage estimator :code:`ashr` is used in published research, please |
|
|
506 |
cite [Stephens2016]_. |
|
|
507 |
|
|
|
508 |
In the LFC shrinkage code above, we specified |
|
|
509 |
:code:`coef="condition_treated_vs_untreated"`. We can also just specify the |
|
|
510 |
coefficient by the order that it appears in :code:`dds.resultsNames()`, in this |
|
|
511 |
case :code:`coef=2`. For more details explaining how the shrinkage estimators |
|
|
512 |
differ, and what kinds of designs, contrasts and output is provided by each, see |
|
|
513 |
the :ref:`extended section on shrinkage estimators<moreshrink>`: |
|
|
514 |
|
|
|
515 |
>>> dds.resultsNames() |
|
|
516 |
>>> # because we are interested in treated vs untreated, we set 'coef=2' |
|
|
517 |
>>> resNorm = dds.lfcShrink(coef=2, type="normal") |
|
|
518 |
>>> resAsh = dds.lfcShrink(coef=2, type="ashr") |
|
|
519 |
|
|
|
520 |
.. |
|
|
521 |
```{r fig.width=8, fig.height=3} |
|
|
522 |
par(mfrow=c(1,3), mar=c(4,4,2,1)) |
|
|
523 |
xlim <- c(1,1e5); ylim <- c(-3,3) |
|
|
524 |
plotMA(resLFC, xlim=xlim, ylim=ylim, main="apeglm") |
|
|
525 |
plotMA(resNorm, xlim=xlim, ylim=ylim, main="normal") |
|
|
526 |
plotMA(resAsh, xlim=xlim, ylim=ylim, main="ashr") |
|
|
527 |
``` |
|
|
528 |
|
|
|
529 |
.. |
|
|
530 |
.. note:: |
|
|
531 |
We have sped up the :code:`apeglm` method so it takes roughly about the same |
|
|
532 |
amount of time as :code:`normal`, e.g. ~5 seconds for the :code:`pasilla` |
|
|
533 |
dataset of ~10,000 genes and 7 samples. If fast shrinkage estimation of LFC |
|
|
534 |
is needed, **but the posterior standard deviation is not needed**, setting |
|
|
535 |
:code:`apeMethod="nbinomC"` will produce a ~10x speedup, but the |
|
|
536 |
:code:`lfcSE` column will be returned with :code:`NA`. A variant of this |
|
|
537 |
fast method, :code:`apeMethod="nbinomC*"` includes random starts. |
|
|
538 |
|
|
|
539 |
|
|
|
540 |
.. note:: |
|
|
541 |
If there is unwanted variation present in the data (*e.g.* batch effects) it |
|
|
542 |
is always recommended to correct for this, which can be accommodated in |
|
|
543 |
DESeq2 by including in the design any known batch variables or by using |
|
|
544 |
functions/packages such as :func:`.pycombat_seq`, :code:`svaseq` in `sva |
|
|
545 |
<http://bioconductor.org/packages/sva>`_ [Leek2014]_ or the :code:`RUV` |
|
|
546 |
functions in `RUVSeq <http://bioconductor.org/packages/RUVSeq>`_ [Risso2014]_ |
|
|
547 |
to estimate variables that capture the unwanted variation. In addition, the |
|
|
548 |
ashr developers have a `specific method <https://github.com/dcgerard/vicar>`_ |
|
|
549 |
for accounting for unwanted variation in combination with ashr [Gerard2020]_. |
|
|
550 |
|
|
|
551 |
.. |
|
|
552 |
Plot counts |
|
|
553 |
----------- |
|
|
554 |
|
|
|
555 |
It can also be useful to examine the counts of reads for a single gene across |
|
|
556 |
the groups. A simple function for making this plot is :func:`plotCounts`, which |
|
|
557 |
normalizes counts by the estimated size factors (or normalization factors if |
|
|
558 |
these were used) and adds a pseudocount of 1/2 to allow for log scale plotting. |
|
|
559 |
The counts are grouped by the variables in :code:`intgroup`, where more than one |
|
|
560 |
variable can be specified. Here we specify the gene which had the smallest |
|
|
561 |
*p*-value from the results table created above. You can select the gene to plot |
|
|
562 |
by rowname or by numeric index:: |
|
|
563 |
|
|
|
564 |
plotCounts(dds, gene=which.min(res$padj), intgroup="condition") |
|
|
565 |
|
|
|
566 |
For customized plotting, an argument :code:`returnData` specifies that the |
|
|
567 |
function should only return a data frame for plotting with :code:`ggplot`:: |
|
|
568 |
|
|
|
569 |
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", |
|
|
570 |
returnData=TRUE) |
|
|
571 |
library("ggplot2") |
|
|
572 |
ggplot(d, aes(x=condition, y=count)) + |
|
|
573 |
geom_point(position=position_jitter(w=0.1,h=0)) + |
|
|
574 |
scale_y_log10(breaks=c(25,100,400)) |
|
|
575 |
|
|
|
576 |
|
|
|
577 |
More information on results columns |
|
|
578 |
----------------------------------- |
|
|
579 |
|
|
|
580 |
Information about which variables and tests were used can be found by inspecting |
|
|
581 |
the attribute :code:`description` on the :class:`~results.DESeqResults` object: |
|
|
582 |
|
|
|
583 |
.. repl:: |
|
|
584 |
res.description |
|
|
585 |
|
|
|
586 |
For a particular gene, a log2 fold change of -1 for :code:`condition treated vs |
|
|
587 |
untreated` means that the treatment induces a multiplicative change in observed |
|
|
588 |
gene expression level of :math:`2^{-1} = 0.5` compared to the untreated |
|
|
589 |
condition. If the variable of interest is continuous-valued, then the reported |
|
|
590 |
log2 fold change is per unit of change of that variable. |
|
|
591 |
|
|
|
592 |
.. _pvaluesNA: |
|
|
593 |
|
|
|
594 |
.. admonition:: Note on *p*-values set to :code:`NA` |
|
|
595 |
|
|
|
596 |
Some values in the results table can be set to :code:`NA` for one of the |
|
|
597 |
following reasons: |
|
|
598 |
|
|
|
599 |
* If within a row, all samples have zero counts, the :code:`baseMean` column |
|
|
600 |
will be zero, and the log2 fold change estimates, *p*-value and adjusted |
|
|
601 |
*p*-value will all be set to :code:`NA`. |
|
|
602 |
* If a row contains a sample with an extreme count outlier then the *p*-value |
|
|
603 |
and adjusted *p*-value will be set to :code:`NA`. These outlier counts are |
|
|
604 |
detected by Cook's distance. Customization of this outlier filtering and |
|
|
605 |
description of functionality for replacement of outlier counts and |
|
|
606 |
refitting is described :ref:`below<outlier>`. |
|
|
607 |
* If a row is filtered by automatic independent filtering, for having a low |
|
|
608 |
mean normalized count, then only the adjusted *p*-value will be set to |
|
|
609 |
:code:`NA`. Description and customization of independent filtering is |
|
|
610 |
described :ref:`below<indfilt>`. |
|
|
611 |
|
|
|
612 |
.. |
|
|
613 |
Rich visualization and reporting of results |
|
|
614 |
------------------------------------------- |
|
|
615 |
|
|
|
616 |
**ReportingTools** An HTML report of the results with plots and |
|
|
617 |
sortable/filterable columns can be generated using the `ReportingTools |
|
|
618 |
<http://bioconductor.org/packages/ReportingTools>`_ package on a |
|
|
619 |
:class:`~DESeqDataSet.DESeqDataSet` that has been processed by the :func:`DESeq` |
|
|
620 |
function. For a code example, see the *RNA-seq differential expression* |
|
|
621 |
vignette at the `ReportingTools |
|
|
622 |
<http://bioconductor.org/packages/ReportingTools>`_ page, or the manual page for |
|
|
623 |
the :meth:`publish` method for the :class:`~DESeqDataSet.DESeqDataSet` class. |
|
|
624 |
|
|
|
625 |
**regionReport** An HTML and PDF summary of the results with plots can also be |
|
|
626 |
generated using the `regionReport |
|
|
627 |
<http://bioconductor.org/packages/regionReport>`_ package. The |
|
|
628 |
:func:`DESeq2Report` function should be run on a |
|
|
629 |
:class:`~DESeqDataSet.DESeqDataSet` that has been processed by the :func:`DESeq` |
|
|
630 |
function. For more details see the manual page for :func:`DESeq2Report` and an |
|
|
631 |
example vignette in the `regionReport |
|
|
632 |
<http://bioconductor.org/packages/regionReport>`_ package. |
|
|
633 |
|
|
|
634 |
**Glimma** Interactive visualization of DESeq2 output, including MA-plots (also |
|
|
635 |
called MD-plot) can be generated using the `Glimma |
|
|
636 |
<http://bioconductor.org/packages/Glimma>`_ package. See the manual page for |
|
|
637 |
*glMDPlot.DESeqResults*. |
|
|
638 |
|
|
|
639 |
**pcaExplorer** Interactive visualization of DESeq2 output, including PCA plots, |
|
|
640 |
boxplots of counts and other useful summaries can be generated using the |
|
|
641 |
`pcaExplorer <http://bioconductor.org/packages/pcaExplorer>`_ package. See the |
|
|
642 |
*Launching the application* section of the package vignette. |
|
|
643 |
|
|
|
644 |
**iSEE** Provides functions for creating an interactive Shiny-based graphical |
|
|
645 |
user interface for exploring data stored in SummarizedExperiment objects, |
|
|
646 |
including row- and column-level metadata. Particular attention is given to |
|
|
647 |
single-cell data in a SingleCellExperiment object with visualization of |
|
|
648 |
dimensionality reduction results. `iSEE |
|
|
649 |
<https://bioconductor.org/packages/iSEE>`_ is on Bioconductor. An example |
|
|
650 |
wrapper function for converting a :class:`~DESeqDataSet.DESeqDataSet` to a |
|
|
651 |
SingleCellExperiment object for use with *iSEE* can be found at the following |
|
|
652 |
gist, written by Federico Marini: |
|
|
653 |
|
|
|
654 |
* <https://gist.github.com/federicomarini/4a543eebc7e7091d9169111f76d59de1> |
|
|
655 |
|
|
|
656 |
**DEvis** DEvis is a powerful, integrated solution for the analysis of |
|
|
657 |
differential expression data. This package includes an array of tools for |
|
|
658 |
manipulating and aggregating data, as well as a wide range of customizable |
|
|
659 |
visualizations, and project management functionality that simplify RNA-Seq |
|
|
660 |
analysis and provide a variety of ways of exploring and analyzing data. *DEvis* |
|
|
661 |
can be found on `CRAN <https://cran.r-project.org/package=DEVis>` and `GitHub |
|
|
662 |
<https://github.com/price0416/DEvis>`. |
|
|
663 |
|
|
|
664 |
|
|
|
665 |
Exporting results to CSV files |
|
|
666 |
------------------------------ |
|
|
667 |
|
|
|
668 |
A plain-text file of the results can be exported using the method |
|
|
669 |
:meth:`.DESeqResults.to_csv`. We suggest using a descriptive file name |
|
|
670 |
indicating the variable and level which were tested: |
|
|
671 |
|
|
|
672 |
.. repl:: |
|
|
673 |
resOrdered.to_csv("condition_treated_results.csv") |
|
|
674 |
|
|
|
675 |
Exporting only the results which pass an adjusted *p*-value threshold can be |
|
|
676 |
accomplished by subsetting the results: |
|
|
677 |
|
|
|
678 |
.. repl:: |
|
|
679 |
resSig = resOrdered[resOrdered.padj < 0.1] |
|
|
680 |
resSig.head() |
|
|
681 |
|
|
|
682 |
|
|
|
683 |
Multi-factor designs |
|
|
684 |
==================== |
|
|
685 |
|
|
|
686 |
Experiments with more than one factor influencing the counts can be analyzed |
|
|
687 |
using design formula that include the additional variables. In fact, DESeq2 can |
|
|
688 |
analyze any possible experimental design that can be expressed with fixed |
|
|
689 |
effects terms (multiple factors, designs with interactions, designs with |
|
|
690 |
continuous variables, splines, and so on are all possible). |
|
|
691 |
|
|
|
692 |
By adding variables to the design, one can control for additional variation in |
|
|
693 |
the counts. For example, if the condition samples are balanced across |
|
|
694 |
experimental batches, by including the :code:`batch` factor to the design, one |
|
|
695 |
can increase the sensitivity for finding differences due to :code:`condition`. |
|
|
696 |
There are multiple ways to analyze experiments when the additional variables are |
|
|
697 |
of interest and not just controlling factors (see :ref:`section on |
|
|
698 |
interactions<interactions>`). |
|
|
699 |
|
|
|
700 |
**Experiments with many samples**: in experiments with many samples (e.g. 50, |
|
|
701 |
100, etc.) it is highly likely that there will be technical variation affecting |
|
|
702 |
the observed counts. Failing to model this additional technical variation will |
|
|
703 |
lead to spurious results. Many methods exist that can be used to model technical |
|
|
704 |
variation, which can be easily included in the DESeq2 design to control for |
|
|
705 |
technical variation which estimating effects of interest. See the `RNA-seq |
|
|
706 |
workflow <http://www.bioconductor.org/help/workflows/rnaseqGene>`_ for examples |
|
|
707 |
of using RUV or SVA in combination with DESeq2. For more details on why it is |
|
|
708 |
important to control for technical variation in large sample experiments, see |
|
|
709 |
the following `thread |
|
|
710 |
<https://twitter.com/mikelove/status/1513468597288452097>`_, also archived `here |
|
|
711 |
<https://htmlpreview.github.io/?https://github.com/frederikziebell/science_tweetorials/blob/master/DESeq2_many_samples.html>`_ |
|
|
712 |
by Frederik Ziebell. |
|
|
713 |
|
|
|
714 |
The data in the :doc:`/pasilla` module have a condition of interest (the column |
|
|
715 |
:code:`condition`), as well as information on the type of sequencing which was |
|
|
716 |
performed (the column :code:`type`), as we can see below: |
|
|
717 |
|
|
|
718 |
.. repl:: |
|
|
719 |
dds.obs |
|
|
720 |
|
|
|
721 |
We create a copy of the :class:`~DESeqDataSet.DESeqDataSet`, so that we can |
|
|
722 |
rerun the analysis using a multi-factor design: |
|
|
723 |
|
|
|
724 |
.. repl:: |
|
|
725 |
ddsMF = dds.copy() |
|
|
726 |
|
|
|
727 |
We change the categories of :code:`type` so it only contains letters. Be |
|
|
728 |
careful when changing level names to use the same order as the current |
|
|
729 |
categories: |
|
|
730 |
|
|
|
731 |
.. repl:: |
|
|
732 |
import re |
|
|
733 |
dds.obs["type"].cat.categories |
|
|
734 |
dds.obs["type"].cat.rename_categories([re.sub("-.*", "", c) for c in dds.obs["type"].cat.categories]) |
|
|
735 |
dds.obs["type"].dtype.categories |
|
|
736 |
|
|
|
737 |
We can account for the different types of sequencing, and get a clearer picture |
|
|
738 |
of the differences attributable to the treatment. As :code:`condition` is the |
|
|
739 |
variable of interest, we put it at the end of the formula. Thus the |
|
|
740 |
:meth:`.DESeqDataSet.results` function will by default pull the |
|
|
741 |
:code:`condition` results unless :code:`contrast` or :code:`name` arguments are |
|
|
742 |
specified. |
|
|
743 |
|
|
|
744 |
Then we can re-run :func:`DESeq`: |
|
|
745 |
|
|
|
746 |
.. repl:: |
|
|
747 |
ddsMF.design = "~ type + condition" |
|
|
748 |
ddsMF = DESeq(ddsMF) |
|
|
749 |
|
|
|
750 |
Again, we access the results using the :meth:`.DESeqDataSet.results` function: |
|
|
751 |
|
|
|
752 |
.. repl:: |
|
|
753 |
resMF = ddsMF.results() |
|
|
754 |
resMF.head() |
|
|
755 |
|
|
|
756 |
It is also possible to retrieve the log2 fold changes, *p*-values and adjusted |
|
|
757 |
*p*-values of variables other than the last one in the design. While in this |
|
|
758 |
case, :code:`type` is not biologically interesting as it indicates differences |
|
|
759 |
across sequencing protocols, for other hypothetical designs, such as |
|
|
760 |
:code:`~genotype + condition + genotype:condition`, we may actually be |
|
|
761 |
interested in the difference in baseline expression across genotype, which is |
|
|
762 |
not the last variable in the design. |
|
|
763 |
|
|
|
764 |
In any case, the :code:`contrast` argument of the function |
|
|
765 |
:meth:`.DESeqDataSet.results` takes a string list of length three: the name of |
|
|
766 |
the variable, the name of the factor level for the numerator of the log2 ratio, |
|
|
767 |
and the name of the factor level for the denominator. The :code:`contrast` |
|
|
768 |
argument can also take other forms, as described in the help page for |
|
|
769 |
:meth:`.DESeqDataSet.results` and :ref:`below<contrasts>`: |
|
|
770 |
|
|
|
771 |
.. repl:: |
|
|
772 |
resMFType = ddsMF.results(contrast=("type", "single", "paired")) |
|
|
773 |
resMFType.head() |
|
|
774 |
|
|
|
775 |
If the variable is continuous or an interaction term (see :ref:`section on |
|
|
776 |
interactions<interactions>`) then the results can be extracted using the |
|
|
777 |
:code:`name` argument to :meth:`.DESeqDataSet.results`, where the name is one of |
|
|
778 |
elements returned by :code:`dds.resultsNames()`. |
|
|
779 |
|
|
|
780 |
.. _transform: |
|
|
781 |
|
|
|
782 |
.. |
|
|
783 |
Data transformations and visualization |
|
|
784 |
====================================== |
|
|
785 |
|
|
|
786 |
Count data transformations |
|
|
787 |
-------------------------- |
|
|
788 |
|
|
|
789 |
In order to test for differential expression, we operate on raw counts and use |
|
|
790 |
discrete distributions as described in the previous section on differential |
|
|
791 |
expression. However for other downstream analyses -- *e.g.* for visualization |
|
|
792 |
or clustering -- it might be useful to work with transformed versions of the |
|
|
793 |
count data. |
|
|
794 |
|
|
|
795 |
Maybe the most obvious choice of transformation is the logarithm. Since count |
|
|
796 |
values for a gene can be zero in some conditions (and non-zero in others), some |
|
|
797 |
advocate the use of *pseudocounts*, i.e. transformations of the form :math:`y = |
|
|
798 |
\log(n + n_0)`, where :math:`n` represents the count values and :math:`n_0` is a |
|
|
799 |
positive constant. |
|
|
800 |
|
|
|
801 |
In this section, we discuss two alternative approaches that offer more |
|
|
802 |
theoretical justification and a rational way of choosing parameters equivalent |
|
|
803 |
to :math:`n_0` above. One makes use of the concept of variance stabilizing |
|
|
804 |
transformations (VST) [Tibshirani1988]_ [sagmb2003]_ [Anders2010]_, and the |
|
|
805 |
other is the *regularized logarithm* or *rlog*, which incorporates a prior on |
|
|
806 |
the sample differences [Love2014]_. Both transformations produce transformed |
|
|
807 |
data on the log2 scale which has been normalized with respect to library size or |
|
|
808 |
other normalization factors. |
|
|
809 |
|
|
|
810 |
The point of these two transformations, the VST and the rlog, is to remove the |
|
|
811 |
dependence of the variance on the mean, particularly the high variance of the |
|
|
812 |
logarithm of count data when the mean is low. Both VST and rlog use the |
|
|
813 |
experiment-wide trend of variance over mean, in order to transform the data to |
|
|
814 |
remove the experiment-wide trend. Note that we do not require or desire that all |
|
|
815 |
the genes have *exactly* the same variance after transformation. Indeed, in a |
|
|
816 |
figure below, you will see that after the transformations the genes with the |
|
|
817 |
same mean do not have exactly the same standard deviations, but that the |
|
|
818 |
experiment-wide trend has flattened. It is those genes with row variance above |
|
|
819 |
the trend which will allow us to cluster samples into interesting groups. |
|
|
820 |
|
|
|
821 |
.. admonition:: Note on running time: |
|
|
822 |
|
|
|
823 |
If you have many samples (*e.g.* 100s), the :func:`rlog` function might take |
|
|
824 |
too long, and so the :func:`vst` function will be a faster choice. The rlog |
|
|
825 |
and VST have similar properties, but the rlog requires fitting a shrinkage |
|
|
826 |
term for each sample and each gene which takes time. See the DESeq2 paper for |
|
|
827 |
more discussion on the differences [Love2014]_. |
|
|
828 |
|
|
|
829 |
Blind dispersion estimation |
|
|
830 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
831 |
|
|
|
832 |
The two functions, :func:`vst` and :func:`rlog`: have an argument :code:`blind`, |
|
|
833 |
for whether the transformation should be blind to the sample information |
|
|
834 |
specified by the design formula. When :code:`blind` equals :code:`True` (the |
|
|
835 |
default), the functions will re-estimate the dispersions using only an |
|
|
836 |
intercept. This setting should be used in order to compare samples in a manner |
|
|
837 |
wholly unbiased by the information about experimental groups, for example to |
|
|
838 |
perform sample QA (quality assurance) as demonstrated below. |
|
|
839 |
|
|
|
840 |
However, blind dispersion estimation is not the appropriate choice if one |
|
|
841 |
expects that many or the majority of genes will have large differences in counts |
|
|
842 |
which are explainable by the experimental design, and one wishes to transform |
|
|
843 |
the data for downstream analysis. In this case, using blind dispersion |
|
|
844 |
estimation will lead to large estimates of dispersion, as it attributes |
|
|
845 |
differences due to experimental design as unwanted *noise*, and will result in |
|
|
846 |
overly shrinking the transformed values towards each other. By setting |
|
|
847 |
:code:`blind` to :code:`False`, the dispersions already estimated will be used |
|
|
848 |
to perform transformations, or if not present, they will be estimated using the |
|
|
849 |
current design formula. Note that only the fitted dispersion estimates from |
|
|
850 |
mean-dispersion trend line are used in the transformation (the global dependence |
|
|
851 |
of dispersion on mean for the entire experiment). So setting :code:`blind` to |
|
|
852 |
:code:`False` is still for the most part not using the information about which |
|
|
853 |
samples were in which experimental group in applying the transformation. |
|
|
854 |
|
|
|
855 |
Extracting transformed values |
|
|
856 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
857 |
|
|
|
858 |
These transformation functions return an object of class :class:`DESeqTransform` |
|
|
859 |
which is a subclass of :class:`AnnData`. For ~20 samples, running on a newly |
|
|
860 |
created :class:`~DESeqDataSet.DESeqDataSet`, :func:`rlog` may take 30 seconds, |
|
|
861 |
while :func:`vst` takes less than 1 second. The running times are shorter when |
|
|
862 |
using `blind=False` and if the function :func:`DESeq` has already been run, |
|
|
863 |
because then it is not necessary to re-estimate the dispersion values. The |
|
|
864 |
matrix of normalized values is stored in the :attr:`X` attribute: |
|
|
865 |
|
|
|
866 |
>>> vsd = vst(dds, blind=FALSE) |
|
|
867 |
>>> rld = rlog(dds, blind=FALSE) |
|
|
868 |
>>> vsd.X.head(3) |
|
|
869 |
|
|
|
870 |
Variance stabilizing transformation |
|
|
871 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
872 |
|
|
|
873 |
Above, we used a parametric fit for the dispersion. In this case, the |
|
|
874 |
closed-form expression for the variance stabilizing transformation is used by |
|
|
875 |
the :func:`vst` function. If a local fit is used (option |
|
|
876 |
:code:`fitType="locfit"` to :meth:`.DESeqDataSet.estimateDispersions`) a |
|
|
877 |
numerical integration is used instead. The transformed data should be |
|
|
878 |
approximated variance stabilized and also includes correction for size factors |
|
|
879 |
or normalization factors. The transformed data is on the log2 scale for large |
|
|
880 |
counts. |
|
|
881 |
|
|
|
882 |
Regularized log transformation |
|
|
883 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
884 |
|
|
|
885 |
The function :func:`rlog`, stands for *regularized log*, transforming the |
|
|
886 |
original count data to the log2 scale by fitting a model with a term for each |
|
|
887 |
sample and a prior distribution on the coefficients which is estimated from the |
|
|
888 |
data. This is the same kind of shrinkage (sometimes referred to as |
|
|
889 |
regularization, or moderation) of log fold changes used by :func:`DESeq` and |
|
|
890 |
:func:`nbinomWaldTest`. The resulting data contains elements defined as: |
|
|
891 |
|
|
|
892 |
.. math:: |
|
|
893 |
\log_2(q_{ij}) = \beta_{i0} + \beta_{ij} |
|
|
894 |
|
|
|
895 |
where :math:`q_{ij}` is a parameter proportional to the expected true |
|
|
896 |
concentration of fragments for gene :math:`i` and sample :math:`j` (see formula |
|
|
897 |
:ref:`below<theory>`), :math:`\beta_{i0}` is an intercept which does not undergo |
|
|
898 |
shrinkage, and :math:`\beta_{ij}` is the sample-specific effect which is shrunk |
|
|
899 |
toward zero based on the dispersion-mean trend over the entire dataset. The |
|
|
900 |
trend typically captures high dispersions for low counts, and therefore these |
|
|
901 |
genes exhibit higher shrinkage from the :func:`rlog`. |
|
|
902 |
|
|
|
903 |
Note that, as :math:`q_{ij}` represents the part of the mean value |
|
|
904 |
:math:`\mu_{ij}` after the size factor :math:`s_j` has been divided out, it is |
|
|
905 |
clear that the rlog transformation inherently accounts for differences in |
|
|
906 |
sequencing depth. Without priors, this design matrix would lead to a non-unique |
|
|
907 |
solution, however the addition of a prior on non-intercept betas allows for a |
|
|
908 |
unique solution to be found. |
|
|
909 |
|
|
|
910 |
Effects of transformations on the variance |
|
|
911 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
912 |
|
|
|
913 |
The figure below plots the standard deviation of the transformed data, across |
|
|
914 |
samples, against the mean, using the shifted logarithm transformation, the |
|
|
915 |
regularized log transformation and the variance stabilizing transformation. The |
|
|
916 |
shifted logarithm has elevated standard deviation in the lower count range, and |
|
|
917 |
the regularized log to a lesser extent, while for the variance stabilized data |
|
|
918 |
the standard deviation is roughly constant along the whole dynamic range. |
|
|
919 |
|
|
|
920 |
Note that the vertical axis in such plots is the square root of the variance |
|
|
921 |
over all samples, so including the variance due to the experimental conditions. |
|
|
922 |
While a flat curve of the square root of variance over the mean may seem like |
|
|
923 |
the goal of such transformations, this may be unreasonable in the case of |
|
|
924 |
datasets with many true differences due to the experimental conditions. |
|
|
925 |
|
|
|
926 |
.. |
|
|
927 |
```{r meansd} |
|
|
928 |
# this gives log2(n + 1) |
|
|
929 |
ntd <- normTransform(dds) |
|
|
930 |
library("vsn") |
|
|
931 |
meanSdPlot(assay(ntd)) |
|
|
932 |
meanSdPlot(assay(vsd)) |
|
|
933 |
meanSdPlot(assay(rld)) |
|
|
934 |
``` |
|
|
935 |
|
|
|
936 |
Data quality assessment by sample clustering and visualization |
|
|
937 |
-------------------------------------------------------------- |
|
|
938 |
|
|
|
939 |
Data quality assessment and quality control (*i.e.* the removal of |
|
|
940 |
insufficiently good data) are essential steps of any data analysis. These steps |
|
|
941 |
should typically be performed very early in the analysis of a new data set, |
|
|
942 |
preceding or in parallel to the differential expression testing. |
|
|
943 |
|
|
|
944 |
We define the term *quality* as *fitness for purpose*. Our purpose is the |
|
|
945 |
detection of differentially expressed genes, and we are looking in particular |
|
|
946 |
for samples whose experimental treatment suffered from an anormality that |
|
|
947 |
renders the data points obtained from these particular samples detrimental to |
|
|
948 |
our purpose. |
|
|
949 |
|
|
|
950 |
Heatmap of the count matrix |
|
|
951 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
952 |
|
|
|
953 |
To explore a count matrix, it is often instructive to look at it as a heatmap. |
|
|
954 |
Below we show how to produce such a heatmap for various transformations of the |
|
|
955 |
data. |
|
|
956 |
|
|
|
957 |
TODO use clustermap instead of pheatmap |
|
|
958 |
|
|
|
959 |
.. repl:: |
|
|
960 |
|
|
|
961 |
from seaborn import clustermap |
|
|
962 |
select = np.sort(dds.counts(normalized=True).mean(axis=0))[::-1][:20] |
|
|
963 |
df = dds.obs[["condition", "type"]] |
|
|
964 |
|
|
|
965 |
.. |
|
|
966 |
```{r heatmap} |
|
|
967 |
library("pheatmap") |
|
|
968 |
select <- order(rowMeans(counts(dds,normalized=TRUE)), |
|
|
969 |
decreasing=TRUE)[1:20] |
|
|
970 |
df <- as.data.frame(colData(dds)[,c("condition","type")]) |
|
|
971 |
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE, |
|
|
972 |
cluster_cols=FALSE, annotation_col=df) |
|
|
973 |
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE, |
|
|
974 |
cluster_cols=FALSE, annotation_col=df) |
|
|
975 |
pheatmap(assay(rld)[select,], cluster_rows=FALSE, show_rownames=FALSE, |
|
|
976 |
cluster_cols=FALSE, annotation_col=df) |
|
|
977 |
``` |
|
|
978 |
|
|
|
979 |
Heatmap of the sample-to-sample distances |
|
|
980 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
981 |
|
|
|
982 |
Another use of the transformed data is sample clustering. Here, we apply the |
|
|
983 |
:func:`dist` function to the transpose of the transformed count matrix to get |
|
|
984 |
sample-to-sample distances:: |
|
|
985 |
|
|
|
986 |
sampleDists = dist(vsd.layers.T) |
|
|
987 |
|
|
|
988 |
A heatmap of this distance matrix gives us an overview over similarities and |
|
|
989 |
dissimilarities between samples. We have to provide a hierarchical clustering |
|
|
990 |
:code:`hc` to the heatmap function based on the sample distances, or else the |
|
|
991 |
heatmap function would calculate a clustering based on the distances between the |
|
|
992 |
rows/columns of the distance matrix. |
|
|
993 |
|
|
|
994 |
TODO |
|
|
995 |
|
|
|
996 |
.. |
|
|
997 |
```{r figHeatmapSamples, fig.height=4, fig.width=6} |
|
|
998 |
library("RColorBrewer") |
|
|
999 |
sampleDistMatrix <- as.matrix(sampleDists) |
|
|
1000 |
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-") |
|
|
1001 |
colnames(sampleDistMatrix) <- NULL |
|
|
1002 |
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) |
|
|
1003 |
pheatmap(sampleDistMatrix, |
|
|
1004 |
clustering_distance_rows=sampleDists, |
|
|
1005 |
clustering_distance_cols=sampleDists, |
|
|
1006 |
col=colors) |
|
|
1007 |
``` |
|
|
1008 |
|
|
|
1009 |
Principal component plot of the samples |
|
|
1010 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
1011 |
|
|
|
1012 |
Related to the distance matrix is the PCA plot, which shows the samples in the |
|
|
1013 |
2D plane spanned by their first two principal components. This type of plot is |
|
|
1014 |
useful for visualizing the overall effect of experimental covariates and batch |
|
|
1015 |
effects. |
|
|
1016 |
|
|
|
1017 |
TODO |
|
|
1018 |
|
|
|
1019 |
.. |
|
|
1020 |
```{r figPCA} |
|
|
1021 |
plotPCA(vsd, intgroup=c("condition", "type")) |
|
|
1022 |
``` |
|
|
1023 |
|
|
|
1024 |
It is also possible to customize the PCA plot using the |
|
|
1025 |
*ggplot* function. |
|
|
1026 |
|
|
|
1027 |
TODO |
|
|
1028 |
|
|
|
1029 |
.. |
|
|
1030 |
```{r figPCA2} |
|
|
1031 |
pcaData <- plotPCA(vsd, intgroup=c("condition", "type"), returnData=TRUE) |
|
|
1032 |
percentVar <- round(100 * attr(pcaData, "percentVar")) |
|
|
1033 |
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=type)) + |
|
|
1034 |
geom_point(size=3) + |
|
|
1035 |
xlab(paste0("PC1: ",percentVar[1],"% variance")) + |
|
|
1036 |
ylab(paste0("PC2: ",percentVar[2],"% variance")) + |
|
|
1037 |
coord_fixed() |
|
|
1038 |
``` |
|
|
1039 |
|
|
|
1040 |
Variations to the standard workflow |
|
|
1041 |
=================================== |
|
|
1042 |
|
|
|
1043 |
Wald test individual steps |
|
|
1044 |
-------------------------- |
|
|
1045 |
|
|
|
1046 |
The function :func:`DESeq` runs the following functions in order: |
|
|
1047 |
|
|
|
1048 |
>>> dds = dds.estimateSizeFactors() |
|
|
1049 |
>>> dds = dds.estimateDispersions(dds) |
|
|
1050 |
>>> dds = nbinomWaldTest(dds) |
|
|
1051 |
|
|
|
1052 |
Control features for estimating size factors |
|
|
1053 |
-------------------------------------------- |
|
|
1054 |
|
|
|
1055 |
In some experiments, it may not be appropriate to assume that a minority of |
|
|
1056 |
features (genes) are affected greatly by the condition, such that the standard |
|
|
1057 |
median-ratio method for estimating the size factors will not provide correct |
|
|
1058 |
inference (the log fold changes for features that were truly un-changing will |
|
|
1059 |
not centered on zero). This is a difficult inference problem for any method, but |
|
|
1060 |
there is an important feature that can be used: the :code:`controlGenes` |
|
|
1061 |
argument of :meth:`.DESeqDataSet.estimateSizeFactors`. If there is any prior |
|
|
1062 |
information about features (genes) that should not be changing with respect to |
|
|
1063 |
the condition, providing this set of features to :code:`controlGenes` will |
|
|
1064 |
ensure that the log fold changes for these features will be centered around 0. |
|
|
1065 |
The paradigm then becomes: |
|
|
1066 |
|
|
|
1067 |
>>> dds = dds.estimateSizeFactors(controlGenes=ctrlGenes) |
|
|
1068 |
>>> dds = DESeq(dds) |
|
|
1069 |
|
|
|
1070 |
.. _contrasts: |
|
|
1071 |
|
|
|
1072 |
Contrasts |
|
|
1073 |
--------- |
|
|
1074 |
|
|
|
1075 |
A contrast is a linear combination of estimated log2 fold changes, which can be |
|
|
1076 |
used to test if differences between groups are equal to zero. The simplest use |
|
|
1077 |
case for contrasts is an experimental design containing a factor with three |
|
|
1078 |
levels, say A, B and C. Contrasts enable the user to generate results for all 3 |
|
|
1079 |
possible differences: log2 fold change of B vs A, of C vs A, and of C vs B. The |
|
|
1080 |
:code:`contrast` argument of :meth:`.DESeqDataSet.results` function is used to |
|
|
1081 |
extract test results of log2 fold changes of interest, for example: |
|
|
1082 |
|
|
|
1083 |
>>> dds.results(contrast=["condition","C","B"]) |
|
|
1084 |
|
|
|
1085 |
Log2 fold changes can also be added and subtracted by providing a list to the |
|
|
1086 |
:code:`contrast` argument which has two elements: the names of the log2 fold |
|
|
1087 |
changes to add, and the names of the log2 fold changes to subtract. The names |
|
|
1088 |
used in the list should come from :code:`dds.resultsNames()`. Alternatively, a |
|
|
1089 |
numeric vector of the length of :code:`dds.resultsNames()` can be provided, for |
|
|
1090 |
manually specifying the linear combination of terms. A `tutorial |
|
|
1091 |
<https://github.com/tavareshugo/tutorial_DESeq2_contrasts>`_ describing the use |
|
|
1092 |
of numeric contrasts for DESeq2 explains a general approach to comparing across |
|
|
1093 |
groups of samples. Demonstrations of the use of contrasts for various designs |
|
|
1094 |
can be found in the examples section of the help page |
|
|
1095 |
:meth:`.DESeqDataSet.results`. The mathematical formula that is used to |
|
|
1096 |
generate the contrasts can be found :ref:`below<theory>`. |
|
|
1097 |
|
|
|
1098 |
.. _interactions: |
|
|
1099 |
|
|
|
1100 |
Interactions |
|
|
1101 |
------------ |
|
|
1102 |
|
|
|
1103 |
Interaction terms can be added to the design formula, in order to test, for |
|
|
1104 |
example, if the log2 fold change attributable to a given condition is |
|
|
1105 |
*different* based on another factor, for example if the condition effect differs |
|
|
1106 |
across genotype. |
|
|
1107 |
|
|
|
1108 |
Preliminary remarks |
|
|
1109 |
^^^^^^^^^^^^^^^^^^^ |
|
|
1110 |
|
|
|
1111 |
Many users begin to add interaction terms to the design formula, when in fact a |
|
|
1112 |
much simpler approach would give all the results tables that are desired. We |
|
|
1113 |
will explain this approach first, because it is much simpler to perform. If the |
|
|
1114 |
comparisons of interest are, for example, the effect of a condition for |
|
|
1115 |
different sets of samples, a simpler approach than adding interaction terms |
|
|
1116 |
explicitly to the design formula is to perform the following steps: |
|
|
1117 |
|
|
|
1118 |
* combine the factors of interest into a single factor with all |
|
|
1119 |
combinations of the original factors |
|
|
1120 |
* change the design to include just this factor, *e.g.* :code:`"~ group"` |
|
|
1121 |
|
|
|
1122 |
Using this design is similar to adding an interaction term, in that it models |
|
|
1123 |
multiple condition effects which can be easily extracted with |
|
|
1124 |
:meth:`.DESeqDataSet.results`. Suppose we have two factors :code:`genotype` |
|
|
1125 |
(with values I, II, and III) and :code:`condition` (with values A and B), and we |
|
|
1126 |
want to extract the condition effect specifically for each genotype. We could |
|
|
1127 |
use the following approach to obtain, *e.g.* the condition effect for genotype |
|
|
1128 |
I: |
|
|
1129 |
|
|
|
1130 |
>>> dds.obs["group"] = Factor([f"{g}{c}" for (g,c) in zip(dds.obs["genotype"], dds.obs["condition"])]) |
|
|
1131 |
>>> dds.design = "~ group" |
|
|
1132 |
>>> dds = DESeq(dds) |
|
|
1133 |
>>> dds.resultsNames() |
|
|
1134 |
>>> dds.results(contrast=["group", "IB", "IA"]) |
|
|
1135 |
|
|
|
1136 |
|
|
|
1137 |
Adding interactions to the design |
|
|
1138 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
1139 |
|
|
|
1140 |
The following two plots diagram genotype-specific condition effects, which could |
|
|
1141 |
be modeled with interaction terms by using a design of :code:`~genotype + |
|
|
1142 |
condition + genotype:condition`. |
|
|
1143 |
|
|
|
1144 |
In the first plot (Gene 1), note that the condition effect is consistent across |
|
|
1145 |
genotypes. Although condition A has a different baseline for I, II, and III, the |
|
|
1146 |
condition effect is a log2 fold change of about 2 for each genotype. Using a |
|
|
1147 |
model with an interaction term :code:`genotype:condition`, the interaction terms |
|
|
1148 |
for genotype II and genotype III will be nearly 0. |
|
|
1149 |
|
|
|
1150 |
Here, the y-axis represents :math:`\log(n+1)`, and each group has 20 samples |
|
|
1151 |
(black dots). A red line connects the mean of the groups within each genotype. |
|
|
1152 |
|
|
|
1153 |
.. repl-quiet:: |
|
|
1154 |
from inmoose.utils import rnbinom |
|
|
1155 |
import seaborn as sns |
|
|
1156 |
|
|
|
1157 |
npg = 20 |
|
|
1158 |
mu = 2**np.array([8,10,9,11,10,12]) |
|
|
1159 |
cond = np.repeat([np.repeat(["A","B"], npg)], 3, axis=0).flatten() |
|
|
1160 |
geno = np.repeat(["I","II","III"], 2*npg) |
|
|
1161 |
#table(cond, geno) |
|
|
1162 |
counts = rnbinom(6*npg, mu = np.repeat(mu, npg), size=1/.01) |
|
|
1163 |
d = pd.DataFrame({"log2c": np.log2(counts+1), "cond": cond, "geno": geno}) |
|
|
1164 |
|
|
|
1165 |
def plotit(d, title): |
|
|
1166 |
g = sns.FacetGrid(d, col="geno") |
|
|
1167 |
g.map(sns.stripplot, "cond", "log2c", color="black") |
|
|
1168 |
g.map(sns.pointplot, "cond", "log2c", color="red") |
|
|
1169 |
g.fig.subplots_adjust(top=0.9) |
|
|
1170 |
g.fig.suptitle(title) |
|
|
1171 |
plt.show() |
|
|
1172 |
|
|
|
1173 |
plotit(d, "Gene 1") |
|
|
1174 |
|
|
|
1175 |
In the second plot (Gene 2), we can see that the condition effect is not |
|
|
1176 |
consistent across genotype. Here the main condition effect (the effect for the |
|
|
1177 |
reference genotype I) is again 2. However, this time the interaction terms will |
|
|
1178 |
be around 1 for genotype II and -4 for genotype III. This is because the |
|
|
1179 |
condition effect is higher by 1 for genotype II compared to genotype I, and |
|
|
1180 |
lower by 4 for genotype III compared to genotype I. The condition effect for |
|
|
1181 |
genotype II (or III) is obtained by adding the main condition effect and the |
|
|
1182 |
interaction term for that genotype. Such a plot can be made using the |
|
|
1183 |
:func:`plotCounts` function as shown above. |
|
|
1184 |
|
|
|
1185 |
.. repl-quiet:: |
|
|
1186 |
mu[3] = 2**12 |
|
|
1187 |
mu[5] = 2**8 |
|
|
1188 |
counts = rnbinom(6*npg, mu=np.repeat(mu, npg), size=1/.01) |
|
|
1189 |
d2 = pd.DataFrame({"log2c": np.log2(counts+1), "cond": cond, "geno": geno}) |
|
|
1190 |
plotit(d2, "Gene 2") |
|
|
1191 |
|
|
|
1192 |
Now we will continue to explain the use of interactions in order to test for |
|
|
1193 |
*differences* in condition effects. We continue with the example of condition |
|
|
1194 |
effects across three genotypes (I, II, and III). |
|
|
1195 |
|
|
|
1196 |
The key point to remember about designs with interaction terms is that, unlike |
|
|
1197 |
for a design :code:`~genotype + condition`, where the condition effect |
|
|
1198 |
represents the *overall* effect controlling for differences due to genotype, by |
|
|
1199 |
adding :code:`genotype:condition`, the main condition effect only represents the |
|
|
1200 |
effect of condition for the *reference level* of genotype (I, or whichever level |
|
|
1201 |
was defined by the user as the reference level). The interaction terms |
|
|
1202 |
:code:`genotypeII.conditionB` and :code:`genotypeIII.conditionB` give the |
|
|
1203 |
*difference* between the condition effect for a given genotype and the condition |
|
|
1204 |
effect for the reference genotype. |
|
|
1205 |
|
|
|
1206 |
This genotype-condition interaction example is examined in further detail in |
|
|
1207 |
Example 3 in the help page for :meth:`.DESeqDataSet.results`. In particular, we |
|
|
1208 |
show how to test for differences in the condition effect across genotype, and we |
|
|
1209 |
show how to obtain the condition effect for non-reference genotypes. |
|
|
1210 |
|
|
|
1211 |
Time-series experiments |
|
|
1212 |
----------------------- |
|
|
1213 |
|
|
|
1214 |
There are a number of ways to analyze time-series experiments, depending on the |
|
|
1215 |
biological question of interest. In order to test for any differences over |
|
|
1216 |
multiple time points, once can use a design including the time factor, and then |
|
|
1217 |
test using the likelihood ratio test as described in the following section, |
|
|
1218 |
where the time factor is removed in the reduced formula. For a control and |
|
|
1219 |
treatment time series, one can use a design formula containing the condition |
|
|
1220 |
factor, the time factor, and the interaction of the two. In this case, using the |
|
|
1221 |
likelihood ratio test with a reduced model which does not contain the |
|
|
1222 |
interaction terms will test whether the condition induces a change in gene |
|
|
1223 |
expression at any time point after the reference level time point (time 0). An |
|
|
1224 |
example of the later analysis is provided in the `RNA-seq workflow |
|
|
1225 |
<http://www.bioconductor.org/help/workflows/rnaseqGene>`_. |
|
|
1226 |
|
|
|
1227 |
Likelihood ratio test |
|
|
1228 |
--------------------- |
|
|
1229 |
|
|
|
1230 |
DESeq2 offers two kinds of hypothesis tests: the Wald test, where we use the |
|
|
1231 |
estimated standard error of a log2 fold change to test if it is equal to zero, |
|
|
1232 |
and the likelihood ratio test (LRT). The LRT examines two models for the counts, |
|
|
1233 |
a *full* model with a certain number of terms and a *reduced* model, in which |
|
|
1234 |
some of the terms of the *full* model are removed. The test determines if the |
|
|
1235 |
increased likelihood of the data using the extra terms in the *full* model is |
|
|
1236 |
more than expected if those extra terms are truly zero. |
|
|
1237 |
|
|
|
1238 |
The LRT is therefore useful for testing multiple terms at once, for example |
|
|
1239 |
testing 3 or more levels of a factor at once, or all interactions between two |
|
|
1240 |
variables. The LRT for count data is conceptually similar to an analysis of |
|
|
1241 |
variance (ANOVA) calculation in linear regression, except that in the case of |
|
|
1242 |
the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the |
|
|
1243 |
*deviance* captures the difference in likelihood between a full and a reduced |
|
|
1244 |
model. |
|
|
1245 |
|
|
|
1246 |
The likelihood ratio test can be performed by specifying :code:`test="LRT"` when |
|
|
1247 |
using the :func:`DESeq` function, and providing a reduced design formula, *e.g.* |
|
|
1248 |
one in which a number of terms from :code:`dds.design` are removed. The degrees |
|
|
1249 |
of freedom for the test is obtained from the difference between the number of |
|
|
1250 |
parameters in the two models. A simple likelihood ratio test, if the full |
|
|
1251 |
design was :code:`~condition` would look like: |
|
|
1252 |
|
|
|
1253 |
>>> dds = DESeq(dds, test="LRT", reduced="~1") |
|
|
1254 |
>>> res = dds.results() |
|
|
1255 |
|
|
|
1256 |
If the full design contained other variables, such as a batch variable, *e.g.* |
|
|
1257 |
:code:`~batch + condition` then the likelihood ratio test would look like: |
|
|
1258 |
|
|
|
1259 |
>>> dds = DESeq(dds, test="LRT", reduced="~batch") |
|
|
1260 |
>>> res = dds.results() |
|
|
1261 |
|
|
|
1262 |
.. _moreshrink: |
|
|
1263 |
|
|
|
1264 |
Extended section on shrinkage estimators |
|
|
1265 |
---------------------------------------- |
|
|
1266 |
|
|
|
1267 |
Here we extend the :ref:`discussion of shrinkage estimators<shrink>`. Below is |
|
|
1268 |
a summary table of differences between methods available in :func:`lfcShrink` |
|
|
1269 |
via the :code:`type` argument (and for further technical reference on use of |
|
|
1270 |
arguments please the documentation of :func:`lfcShrink`): |
|
|
1271 |
|
|
|
1272 |
.. list-table:: |
|
|
1273 |
:header-rows: 1 |
|
|
1274 |
:stub-columns: 1 |
|
|
1275 |
|
|
|
1276 |
* - method |
|
|
1277 |
- :code:`apeglm` [Zhu2018]_ |
|
|
1278 |
- :code:`ashr` [Stephens2016]_ |
|
|
1279 |
- :code:`normal` [Love2014]_ |
|
|
1280 |
* - Good for ranking by LFC |
|
|
1281 |
- ✓ |
|
|
1282 |
- ✓ |
|
|
1283 |
- ✓ |
|
|
1284 |
* - Preserves size of large LFC |
|
|
1285 |
- ✓ |
|
|
1286 |
- ✓ |
|
|
1287 |
- |
|
|
1288 |
* - Can compute *s*-values [Stephens2016]_ |
|
|
1289 |
- ✓ |
|
|
1290 |
- ✓ |
|
|
1291 |
- |
|
|
1292 |
* - Allows use of :code:`coef` |
|
|
1293 |
- ✓ |
|
|
1294 |
- ✓ |
|
|
1295 |
- ✓ |
|
|
1296 |
* - Allows use of :code:`lfcThreshold` |
|
|
1297 |
- ✓ |
|
|
1298 |
- ✓ |
|
|
1299 |
- ✓ |
|
|
1300 |
* - Allows use of :code:`contrast` |
|
|
1301 |
- ✓ |
|
|
1302 |
- ✓ |
|
|
1303 |
- |
|
|
1304 |
* - Can shrink interaction terms |
|
|
1305 |
- ✓ |
|
|
1306 |
- ✓ |
|
|
1307 |
- |
|
|
1308 |
|
|
|
1309 |
Beginning with the first row, all shrinkage methods provided by DESeq2 are good |
|
|
1310 |
for ranking genes by "effect size", that is the log2 fold change (LFC) across |
|
|
1311 |
groups, or associated with an interaction term. It is useful to contrast ranking |
|
|
1312 |
by effect size with ranking by a *p*-value or adjusted *p*-value associated with |
|
|
1313 |
a null hypothesis: while increasing the number of samples will tend to decrease |
|
|
1314 |
the associated *p*-value for a gene that is differentially expressed, the |
|
|
1315 |
estimated effect size or LFC becomes more precise. Also, a gene can have a small |
|
|
1316 |
*p*-value although the change in expression is not great, as long as the |
|
|
1317 |
standard error associated with the estimated LFC is small. |
|
|
1318 |
|
|
|
1319 |
The next two rows point out that :code:`apeglm` and :code:`ashr` shrinkage |
|
|
1320 |
methods help to preserve the size of large LFC, and can be used to compute |
|
|
1321 |
*s-values*. These properties are related. As noted in the :ref:`previous |
|
|
1322 |
section<shrink>`, the original DESeq2 shrinkage estimator used a Normal |
|
|
1323 |
distribution, with a scale that adapts to the spread of the observed LFCs. |
|
|
1324 |
Because the tails of the Normal distribution become thin relatively quickly, it |
|
|
1325 |
was important when we designed the method that the prior scaling is sensitive to |
|
|
1326 |
the very largest observed LFCs. As you can read in the DESeq2 paper, under the |
|
|
1327 |
section, "*Empirical prior estimate*", we used the top 5% of the LFCs by |
|
|
1328 |
absolute value to set the scale of the Normal prior (we later added weighting |
|
|
1329 |
the quantile by precision). :code:`ashr`, published in 2016, and :code:`apeglm` |
|
|
1330 |
use wide-tailed priors to avoid shrinking large LFCs. While a typical RNA-seq |
|
|
1331 |
experiment may have many LFCs between -1 and 1, we might consider a LFC of >4 to |
|
|
1332 |
be very large, as they represent 16-fold increases or decreases in expression. |
|
|
1333 |
:code:`ashr` and :code:`apeglm` can adapt to the scale of the entirety of LFCs, |
|
|
1334 |
while not over-shrinking the few largest LFCs. The potential for over-shrinking |
|
|
1335 |
LFC is also why DESeq2's shrinkage estimator is not recommended for designs with |
|
|
1336 |
interaction terms. |
|
|
1337 |
|
|
|
1338 |
What are *s-values*? This quantity proposed by [Stephens2016]_ gives the |
|
|
1339 |
estimated rate of *false sign* among genes with equal or smaller *s*-value. |
|
|
1340 |
[Stephens2016]_ points out they are analogous to the *q*-value of [Storey2003]_. |
|
|
1341 |
The *s*-value has a desirable property relative to the adjusted *p*-value or |
|
|
1342 |
*q*-value, in that it does not require supposing there be a set of null genes |
|
|
1343 |
with LFC = 0 (the most commonly used null hypothesis). Therefore, it can be |
|
|
1344 |
benchmarked by comparing estimated LFC and *s*-value to the "true LFC" in a |
|
|
1345 |
setting where this can be reasonably defined. For these estimated probabilities |
|
|
1346 |
to be accurate, the scale of the prior needs to match the scale of the |
|
|
1347 |
distribution of effect sizes, and so the original DESeq2 shrinkage method is not |
|
|
1348 |
really compatible with computing *s*-values. |
|
|
1349 |
|
|
|
1350 |
The last four rows explain differences in whether coefficients or contrasts can |
|
|
1351 |
have shrinkage applied by the various methods. All three methods can use |
|
|
1352 |
:code:`coef` with either the name or numeric index from |
|
|
1353 |
:code:`dds.resultsNames()` to specify which coefficient to shrink. All three |
|
|
1354 |
methods allow for a positive :code:`lfcThreshold` to be specified, in which |
|
|
1355 |
case, they will return *p*-values and adjusted *p*-values or *s*-values for the |
|
|
1356 |
LFC being greater in absolute value than the threshold (see :ref:`this |
|
|
1357 |
section<thresh>` for :code:`normal`). For :code:`apeglm` and :code:`ashr`, |
|
|
1358 |
setting a threshold means that the *s*-values will give the "false sign or |
|
|
1359 |
small" rate (FSOS) among genes with equal or small *s*-value. We found FSOS to |
|
|
1360 |
be a useful description for when the LFC is either the wrong sign or less than |
|
|
1361 |
the threshold distance from 0. |
|
|
1362 |
|
|
|
1363 |
TODO |
|
|
1364 |
|
|
|
1365 |
.. |
|
|
1366 |
```{r apeThresh} |
|
|
1367 |
resApeT <- lfcShrink(dds, coef=2, type="apeglm", lfcThreshold=1) |
|
|
1368 |
plotMA(resApeT, ylim=c(-3,3), cex=.8) |
|
|
1369 |
abline(h=c(-1,1), col="dodgerblue", lwd=2) |
|
|
1370 |
``` |
|
|
1371 |
|
|
|
1372 |
```{r ashThresh} |
|
|
1373 |
resAshT <- lfcShrink(dds, coef=2, type="ashr", lfcThreshold=1) |
|
|
1374 |
plotMA(resAshT, ylim=c(-3,3), cex=.8) |
|
|
1375 |
abline(h=c(-1,1), col="dodgerblue", lwd=2) |
|
|
1376 |
``` |
|
|
1377 |
|
|
|
1378 |
Finally, :code:`normal` and :code:`ashr` can be used with arbitrary specified |
|
|
1379 |
:code:`contrast` because :code:`normal` shrinks multiple coefficients |
|
|
1380 |
simultaneously (:code:`apeglm` does not), and because :code:`ashr` does not |
|
|
1381 |
estimate a vector of coefficients but models estimated coefficients and their |
|
|
1382 |
standard errors from upstream methods (here, DESeq2's MLE). Although |
|
|
1383 |
:code:`apeglm` cannot be used with :code:`contrast`, we note that many designs |
|
|
1384 |
can be easily rearranged such that what was a contrast becomes its own |
|
|
1385 |
coefficient. In this case, the dispersion does not have to be estimated again, |
|
|
1386 |
as the designs are equivalent, up to the meaning of the coefficients. Instead, |
|
|
1387 |
one need only run :code:`nbinomWaldTest` to re-estimate MLE coefficients -- |
|
|
1388 |
these are necessary for :code:`apeglm` -- and then run :code:`lfcShrink` |
|
|
1389 |
specifying the coefficient of interest in `dds.resultsNames()`. |
|
|
1390 |
|
|
|
1391 |
We give some examples below of producing equivalent designs for use with |
|
|
1392 |
:code:`coef`. We show how the coefficients change with :code:`patsy.dmatrix`, |
|
|
1393 |
but the user would, for example, either change the levels of `dds.obs.condition` |
|
|
1394 |
or replace the :attr:`.DESeqDataSet.design`, then run :func:`nbinomWaldTest` |
|
|
1395 |
followed by :func:`lfcShrink`. |
|
|
1396 |
|
|
|
1397 |
Three groups: |
|
|
1398 |
|
|
|
1399 |
.. repl:: |
|
|
1400 |
condition = Factor(["A", "A", "B", "B", "C", "C"]) |
|
|
1401 |
patsy.dmatrix("~condition") |
|
|
1402 |
# to compare C vs B, make B the reference level, |
|
|
1403 |
# and select the last coefficient |
|
|
1404 |
condition = condition.reorder_categories(["B", "A", "C"]) |
|
|
1405 |
patsy.dmatrix("~condition") |
|
|
1406 |
|
|
|
1407 |
Three groups, compare condition effects: |
|
|
1408 |
|
|
|
1409 |
.. repl:: |
|
|
1410 |
grp = Factor([1,1,1,1,2,2,2,2,3,3,3,3]) |
|
|
1411 |
cnd = Factor(["A","A","B","B","A","A","B","B","A","A","B","B"]) |
|
|
1412 |
patsy.dmatrix("~ grp + cnd + grp:cnd") |
|
|
1413 |
# to compare condition effect in group 3 vs 2, |
|
|
1414 |
# make group 2 the reference level, |
|
|
1415 |
# and select the last coefficient |
|
|
1416 |
grp = grp.reorder_categories([2,1,3]) |
|
|
1417 |
patsy.dmatrix("~ grp + cnd + grp:cnd") |
|
|
1418 |
|
|
|
1419 |
Two groups, two individuals per group, compare within-individual condition |
|
|
1420 |
effects: |
|
|
1421 |
|
|
|
1422 |
.. repl:: |
|
|
1423 |
grp = Factor([1,1,1,1,2,2,2,2]) |
|
|
1424 |
ind = Factor([1,1,2,2,1,1,2,2]) |
|
|
1425 |
cnd = Factor(["A","B","A","B","A","B","A","B"]) |
|
|
1426 |
patsy.dmatrix("~ grp + grp:ind + grp:cnd") |
|
|
1427 |
# to compare condition effect across group, |
|
|
1428 |
# add a main effect for 'cnd', |
|
|
1429 |
# and select the last coefficient |
|
|
1430 |
patsy.dmatrix("~ grp + cnd + grp:ind + grp:cnd") |
|
|
1431 |
|
|
|
1432 |
.. _singlecell: |
|
|
1433 |
|
|
|
1434 |
Recommendations for single-cell analysis |
|
|
1435 |
---------------------------------------- |
|
|
1436 |
|
|
|
1437 |
The DESeq2 developers and collaborating groups have published recommendations |
|
|
1438 |
for the best use of DESeq2 for single-cell datasets, which have been described |
|
|
1439 |
first in [Berge2018]_. Default values for DESeq2 were designed for bulk data and |
|
|
1440 |
will not be appropriate for single-cell datasets. These settings and additional |
|
|
1441 |
improvements have also been tested subsequently and published in [Zhu2018]_ and |
|
|
1442 |
[AhlmannEltze2020]_. |
|
|
1443 |
|
|
|
1444 |
* Use :code:`test="LRT"` for significance testing when working with single-cell |
|
|
1445 |
data, over the Wald test. This has been observed across multiple single-cell |
|
|
1446 |
benchmarks. |
|
|
1447 |
* Set the following :func:`DESeq` arguments to these values: :code:`useT=TRUE`, |
|
|
1448 |
:code:`minmu=1e-6`, and :code:`minReplicatesForReplace=Inf`. The default |
|
|
1449 |
setting of :code:`minmu` was benchmarked on bulk RNA-seq and is not |
|
|
1450 |
appropriate for single cell data when the expected count is often much less |
|
|
1451 |
than 1. |
|
|
1452 |
* The default size factors are not optimal for single cell count matrices, |
|
|
1453 |
instead consider setting :attr:`.DESeqDataSet.sizeFactors` from |
|
|
1454 |
:code:`scran::computeSumFactors`. |
|
|
1455 |
* One important concern for single-cell data analysis is the size of the |
|
|
1456 |
datasets and associated processing time. To address the speed concerns, DESeq2 |
|
|
1457 |
provides an interface to `glmGamPoi |
|
|
1458 |
<https://bioconductor.org/packages/glmGamPoi/>`_, which implements faster |
|
|
1459 |
dispersion and parameter estimation routines for single-cell data |
|
|
1460 |
[AhlmannEltze2020]_. To use this feature, set :code:`fitType = "glmGamPoi"`. |
|
|
1461 |
Alternatively, one can use *glmGamPoi* as a standalone package. This |
|
|
1462 |
provides the additional option to process data on-disk if the full dataset |
|
|
1463 |
does not fit in memory, a quasi-likelihood framework for differential testing, |
|
|
1464 |
and the ability to form pseudobulk samples (more details how to use |
|
|
1465 |
*glmGamPoi* are in its `README <https://github.com/const-ae/glmGamPoi>`_). |
|
|
1466 |
|
|
|
1467 |
Optionally, one can consider using the `zinbwave |
|
|
1468 |
<https://bioconductor.org/packages/zinbwave>`_ package to directly model the |
|
|
1469 |
zero inflation of the counts, and take account of these in the DESeq2 model. |
|
|
1470 |
This allows for the DESeq2 inference to apply to the part of the data which is |
|
|
1471 |
not due to zero inflation. Not all single cell datasets exhibit zero inflation, |
|
|
1472 |
and instead may just reflect low conditional estimated counts (conditional on |
|
|
1473 |
cell type or cell state). There is example code for combining *zinbwave* and |
|
|
1474 |
*DESeq2* package functions in the *zinbwave* vignette. We also have an example |
|
|
1475 |
of ZINB-WaVE + DESeq2 integration using the `splatter |
|
|
1476 |
<https://bioconductor.org/packages/splatter>`_ package for simulation at the |
|
|
1477 |
`zinbwave-deseq2 <https://github.com/mikelove/zinbwave-deseq2>`_ GitHub |
|
|
1478 |
repository. |
|
|
1479 |
|
|
|
1480 |
.. _outlier: |
|
|
1481 |
|
|
|
1482 |
Approach to count outliers |
|
|
1483 |
-------------------------- |
|
|
1484 |
|
|
|
1485 |
RNA-seq data sometimes contain isolated instances of very large counts that are |
|
|
1486 |
apparently unrelated to the experimental or study design, and which may be |
|
|
1487 |
considered outliers. There are many reasons why outliers can arise, including |
|
|
1488 |
rare technical or experimental artifacts, read mapping problems in the case of |
|
|
1489 |
genetically differing samples, and genuine, but rare biological events. In many |
|
|
1490 |
cases, users appear primarily interested in genes that show a consistent |
|
|
1491 |
behavior, and this is the reason why by default, genes that are affected by such |
|
|
1492 |
outliers are set aside by DESeq2, or if there are sufficient samples, outlier |
|
|
1493 |
counts are replaced for model fitting. These two behaviors are described below. |
|
|
1494 |
|
|
|
1495 |
The :func:`DESeq` function calculates, for every gene and for every sample, a |
|
|
1496 |
diagnostic test for outliers called *Cook's distance*. Cook's distance is a |
|
|
1497 |
measure of how much a single sample is influencing the fitted coefficients for a |
|
|
1498 |
gene, and a large value of Cook's distance is intended to indicate an outlier |
|
|
1499 |
count. The Cook's distances are stored as a matrix available in |
|
|
1500 |
:code:`dds.layers["cooks"]`. |
|
|
1501 |
|
|
|
1502 |
The :meth:`.DESeqDataSet.results` function automatically flags genes which |
|
|
1503 |
contain a Cook's distance above a cutoff for samples which have 3 or more |
|
|
1504 |
replicates. The *p*-values and adjusted *p*-values for these genes are set to |
|
|
1505 |
:code:`NA`. At least 3 replicates are required for flagging, as it is difficult |
|
|
1506 |
to judge which sample might be an outlier with only 2 replicates. This |
|
|
1507 |
filtering can be turned off with :code:`dds.results(cooksCutoff=FALSE)`. |
|
|
1508 |
|
|
|
1509 |
With many degrees of freedom -- *i.e.* many more samples than number of |
|
|
1510 |
parameters to be estimated -- it is undesirable to remove entire genes from the |
|
|
1511 |
analysis just because their data include a single count outlier. When there are |
|
|
1512 |
7 or more replicates for a given sample, the :func:`DESeq` function will |
|
|
1513 |
automatically replace counts with large Cook's distance with the trimmed mean |
|
|
1514 |
over all samples, scaled up by the size factor or normalization factor for that |
|
|
1515 |
sample. This approach is conservative, it will not lead to false positives, as |
|
|
1516 |
it replaces the outlier value with the value predicted by the null hypothesis. |
|
|
1517 |
This outlier replacement only occurs when there are 7 or more replicates, and |
|
|
1518 |
can be turned off with :code:`DESeq(dds, minReplicatesForReplace=Inf)`. |
|
|
1519 |
|
|
|
1520 |
The default Cook's distance cutoff for the two behaviors described above depends |
|
|
1521 |
on the sample size and number of parameters to be estimated. The default is to |
|
|
1522 |
use the 99% quantile of the :math:`F(p,m-p)` distribution (with :math:`p` the |
|
|
1523 |
number of parameters including the intercept and :math:`m` the number of |
|
|
1524 |
samples). The default for gene flagging can be modified using the |
|
|
1525 |
:code:`cooksCutoff` argument to the :meth:`.DESeqDataSet.results` function. For |
|
|
1526 |
outlier replacement, :func:`DESeq` preserves the original counts in |
|
|
1527 |
:code:`dds.counts()` saving the replacement counts as a matrix named |
|
|
1528 |
:code:`"replaceCounts"` in :code:`dds.layers`. Note that with continuous |
|
|
1529 |
variables in the design, outlier detection and replacement is not automatically |
|
|
1530 |
performed, as our current methods involve a robust estimation of within-group |
|
|
1531 |
variance which does not extend easily to continuous covariates. However, users |
|
|
1532 |
can examine the Cook's distances in :code:`dds.layers["cooks"]`, in order to |
|
|
1533 |
perform manual visualization and filtering if necessary. |
|
|
1534 |
|
|
|
1535 |
.. note:: |
|
|
1536 |
If there are very many outliers (*e.g.* many hundreds or thousands) reported |
|
|
1537 |
by :code:`res.summary()`, one might consider further exploration to see if a |
|
|
1538 |
single sample or a few samples should be removed due to low quality. The |
|
|
1539 |
automatic outlier filtering/replacement is most useful in situations which |
|
|
1540 |
the number of outliers is limited. When there are thousands of reported |
|
|
1541 |
outliers, it might make more sense to turn off the outlier |
|
|
1542 |
filtering/replacement (:func:`DESeq` with |
|
|
1543 |
:code:`minReplicatesForReplace=np.inf` and :meth:`.DESeqDataSet.results` with |
|
|
1544 |
:code:`cooksCutoff=FALSE`) and perform manual inspection: |
|
|
1545 |
|
|
|
1546 |
- first it would be advantageous to make a PCA plot as described above to |
|
|
1547 |
spot individual sample outliers; |
|
|
1548 |
- second, one can make a boxplot of the Cook's distances to see if one sample |
|
|
1549 |
is consistently higher than others (here this is not the case): |
|
|
1550 |
|
|
|
1551 |
.. repl:: |
|
|
1552 |
sns.boxplot(pd.DataFrame(np.log10(dds.layers["cooks"].T), |
|
|
1553 |
columns=dds.obs_names)) |
|
|
1554 |
plt.xticks(rotation=25) |
|
|
1555 |
plt.show() |
|
|
1556 |
|
|
|
1557 |
Dispersion plot and fitting alternatives |
|
|
1558 |
---------------------------------------- |
|
|
1559 |
|
|
|
1560 |
Plotting the dispersion estimates is a useful diagnostic. The dispersion plot |
|
|
1561 |
below is typical, with the final estimates shrunk from the gene-wise estimates |
|
|
1562 |
towards the fitted estimates. Some gene-wise estimates are flagged as outliers |
|
|
1563 |
and not shrunk towards the fitted value, (this outlier detection is described in |
|
|
1564 |
the manual page for :func:`estimateDispersionsMAP`). The amount |
|
|
1565 |
of shrinkage can be more or less than seen here, depending on the sample size, |
|
|
1566 |
the number of coefficients, the row mean and the variability of the gene-wise |
|
|
1567 |
estimates. |
|
|
1568 |
|
|
|
1569 |
.. repl:: |
|
|
1570 |
dds.plotDispEsts() |
|
|
1571 |
|
|
|
1572 |
Local or mean dispersion fit |
|
|
1573 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
1574 |
|
|
|
1575 |
A local smoothed dispersion fit is automatically substituted in the case that |
|
|
1576 |
the parametric curve does not fit the observed dispersion mean relationship. |
|
|
1577 |
This can be prespecified by providing the argument :code:`fitType="local"` to |
|
|
1578 |
either :func:`DESeq` or :meth:`.DESeqDataSet.estimateDispersions`. |
|
|
1579 |
Additionally, using the mean of gene-wise disperion estimates as the fitted |
|
|
1580 |
value can be specified by providing the argument :code:`fitType="mean"`. |
|
|
1581 |
|
|
|
1582 |
Supply a custom dispersion fit |
|
|
1583 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
1584 |
|
|
|
1585 |
Any fitted values can be provided during dispersion estimation, using the |
|
|
1586 |
lower-level functions described in the manual page for |
|
|
1587 |
:func:`estimateDispersionsGeneEst`. In the code chunk below, we store the |
|
|
1588 |
gene-wise estimates which were already calculated and saved in the metadata |
|
|
1589 |
column :code:`dispGeneEst`. Then we calculate the median value of the dispersion |
|
|
1590 |
estimates above a threshold, and save these values as the fitted dispersions, |
|
|
1591 |
using the replacement function for :attr:`.DESeqDataSet.dispersionFunction`. In |
|
|
1592 |
the last line, the function :func:`estimateDispersionsMAP`, uses the fitted |
|
|
1593 |
dispersions to generate maximum *a posteriori* (MAP) estimates of dispersion: |
|
|
1594 |
|
|
|
1595 |
.. repl:: |
|
|
1596 |
ddsCustom = dds.copy() |
|
|
1597 |
useForMedian = ddsCustom.var["dispGeneEst"] > 1e-7 |
|
|
1598 |
medianDisp = np.nanmedian(ddsCustom.var["dispGeneEst"][useForMedian]) |
|
|
1599 |
ddsCustom.setDispFunction(lambda mu: medianDisp) |
|
|
1600 |
ddsCustom = estimateDispersionsMAP(ddsCustom) |
|
|
1601 |
|
|
|
1602 |
.. _indfilt: |
|
|
1603 |
|
|
|
1604 |
Independent filtering of results |
|
|
1605 |
-------------------------------- |
|
|
1606 |
|
|
|
1607 |
The :meth:`.DESeqDataSet.results` function of the DESeq2 package performs |
|
|
1608 |
independent filtering by default using the mean of normalized counts as a filter |
|
|
1609 |
statistic. A threshold on the filter statistic is found which optimizes the |
|
|
1610 |
number of adjusted *p*-values lower than a significance level :code:`alpha` (we |
|
|
1611 |
use the standard variable name for significance level, though it is unrelated to |
|
|
1612 |
the dispersion parameter :math:`\alpha`). The theory behind independent |
|
|
1613 |
filtering is discussed in greater detail :ref:`below<indfilttheory>`. The |
|
|
1614 |
adjusted *p*-values for the genes which do not pass the filter threshold are set |
|
|
1615 |
to :code:`NA`. |
|
|
1616 |
|
|
|
1617 |
The default independent filtering is performed using the |
|
|
1618 |
:func:`~.results.filtered_p` function of the `genefilter |
|
|
1619 |
<http://bioconductor.org/packages/genefilter>`_ package, and all of the |
|
|
1620 |
arguments of :func:`~.results.filtered_p` can be passed to the |
|
|
1621 |
:meth:`.DESeqDataSet.results` function. The filter threshold value and the |
|
|
1622 |
number of rejections at each quantile of the filter statistic are available as |
|
|
1623 |
metadata of the object returned by :meth:`.DESeqDataSet.results`. |
|
|
1624 |
|
|
|
1625 |
For example, we can visualize the optimization by plotting the |
|
|
1626 |
:code:`filterNumRej` attribute of the results object. The |
|
|
1627 |
:meth:`.DESeqDataSet.results` function maximizes the number of rejections |
|
|
1628 |
(adjusted *p*-value less than a significance level), over the quantiles of a |
|
|
1629 |
filter statistic (the mean of normalized counts). The threshold chosen (vertical |
|
|
1630 |
line) is the lowest quantile of the filter for which the number of rejections is |
|
|
1631 |
within 1 residual standard deviation to the peak of a curve fit to the number of |
|
|
1632 |
rejections over the filter quantiles: |
|
|
1633 |
|
|
|
1634 |
.. repl:: |
|
|
1635 |
res.alpha |
|
|
1636 |
res.filterThreshold |
|
|
1637 |
plt.plot(res.filterNumRej["theta"], |
|
|
1638 |
res.filterNumRej["numRej"], |
|
|
1639 |
'o', color="black") |
|
|
1640 |
plt.xlabel("quantiles of filter") |
|
|
1641 |
plt.ylabel("number of rejections") |
|
|
1642 |
plt.plot(res.lo_fit[:,0], res.lo_fit[:,1], color="red") |
|
|
1643 |
plt.axvline(x=res.filterTheta, color="black") |
|
|
1644 |
plt.show() |
|
|
1645 |
|
|
|
1646 |
Independent filtering can be turned off by setting :code:`independentFiltering` |
|
|
1647 |
to :code:`FALSE`. |
|
|
1648 |
|
|
|
1649 |
.. repl:: |
|
|
1650 |
resNoFilt = dds.results(independentFiltering=False) |
|
|
1651 |
df = pd.DataFrame({"filtering": (res.padj < .1), |
|
|
1652 |
"noFiltering": (resNoFilt.padj < .1)}) |
|
|
1653 |
df.groupby(["filtering", "noFiltering"]).size() |
|
|
1654 |
|
|
|
1655 |
|
|
|
1656 |
.. _thresh: |
|
|
1657 |
|
|
|
1658 |
Tests of log2 fold change above or below a threshold |
|
|
1659 |
---------------------------------------------------- |
|
|
1660 |
|
|
|
1661 |
It is also possible to provide thresholds for constructing Wald tests of |
|
|
1662 |
significance. Two arguments to the :meth:`.DESeqDataSet.results` function allow |
|
|
1663 |
for threshold-based Wald tests: :code:`lfcThreshold`, which takes a numeric of a |
|
|
1664 |
non-negative threshold value, and :code:`altHypothesis`, which specifies the |
|
|
1665 |
kind of test. Note that the *alternative hypothesis* is specified by the user, |
|
|
1666 |
*i.e.* those genes which the user is interested in finding, and the test |
|
|
1667 |
provides *p*-values for the null hypothesis, the complement of the set defined |
|
|
1668 |
by the alternative. The :code:`altHypothesis` argument can take one of the |
|
|
1669 |
following four values, where :math:`\beta` is the log2 fold change specified by |
|
|
1670 |
the :code:`name` argument, and :math:`x` is the :code:`lfcThreshold`. |
|
|
1671 |
|
|
|
1672 |
* `greaterAbs` - :math:`|\beta| > x` - tests are two-tailed |
|
|
1673 |
* `lessAbs` - :math:`|\beta| < x` - *p*-values are the maximum of the upper and lower tests |
|
|
1674 |
* `greater` - :math:`\beta > x` |
|
|
1675 |
* `less` - :math:`\beta < -x` |
|
|
1676 |
|
|
|
1677 |
The four possible values of :code:`altHypothesis` are demonstrated in the |
|
|
1678 |
following code and visually by MA-plots in the following figures. |
|
|
1679 |
|
|
|
1680 |
.. repl:: |
|
|
1681 |
ylim = [-2.5, 2.5] |
|
|
1682 |
resGA = dds.results(lfcThreshold=.5, altHypothesis="greaterAbs") |
|
|
1683 |
resLA = dds.results(lfcThreshold=.5, altHypothesis="lessAbs") |
|
|
1684 |
resG = dds.results(lfcThreshold=.5, altHypothesis="greater") |
|
|
1685 |
resL = dds.results(lfcThreshold=.5, altHypothesis="less") |
|
|
1686 |
|
|
|
1687 |
def drawlines(ax): |
|
|
1688 |
ax.axhline(y=-.5, c="dodgerblue") |
|
|
1689 |
ax.axhline(y=.5, c="dodgerblue") |
|
|
1690 |
plt.show() |
|
|
1691 |
|
|
|
1692 |
drawlines(resGA.plotMA(ylim=ylim)) |
|
|
1693 |
drawlines(resLA.plotMA(ylim=ylim)) |
|
|
1694 |
drawlines(resG.plotMA(ylim=ylim)) |
|
|
1695 |
drawlines(resL.plotMA(ylim=ylim)) |
|
|
1696 |
|
|
|
1697 |
.. _access: |
|
|
1698 |
|
|
|
1699 |
Access to all calculated values |
|
|
1700 |
------------------------------- |
|
|
1701 |
|
|
|
1702 |
All row-wise calculated values (intermediate dispersion calculations, |
|
|
1703 |
coefficients, standard errors, etc.) are stored in the |
|
|
1704 |
:class:`~DESeqDataSet.DESeqDataSet` object, *e.g.* :code:`dds` in this vignette. |
|
|
1705 |
These values are accessible by inspecting the :code:`var` attribute of |
|
|
1706 |
:code:`dds`. Descriptions of the columns are accessible through the |
|
|
1707 |
:code:`description` attribute: |
|
|
1708 |
|
|
|
1709 |
.. repl:: |
|
|
1710 |
dds.var.iloc[:4,:4] |
|
|
1711 |
dds.var.columns |
|
|
1712 |
dds.var.description |
|
|
1713 |
|
|
|
1714 |
The mean values :math:`\mu_{ij} = s_j q_{ij}` and the Cook's distances for each |
|
|
1715 |
gene and sample are stored as matrices in the :attr:`layers` attribute: |
|
|
1716 |
|
|
|
1717 |
.. repl:: |
|
|
1718 |
dds.layers["mu"] |
|
|
1719 |
dds.layers["cooks"] |
|
|
1720 |
|
|
|
1721 |
The dispersions :math:`\alpha_i` can be accessed with the |
|
|
1722 |
:attr:`.DESeqDataSet.dispersions` attribute: |
|
|
1723 |
|
|
|
1724 |
.. repl:: |
|
|
1725 |
dds.dispersions.head() |
|
|
1726 |
dds.var["dispersion"].head() |
|
|
1727 |
|
|
|
1728 |
The size factors :math:`s_j` are accessible via the |
|
|
1729 |
:attr:`.DESeqDataSet.sizeFactors` attribute: |
|
|
1730 |
|
|
|
1731 |
.. repl:: |
|
|
1732 |
dds.sizeFactors |
|
|
1733 |
|
|
|
1734 |
For advanced users, we also include a convenience function :func:`coef` for |
|
|
1735 |
extracting the matrix :math:`[\beta_{ir}]` for all genes :math:`i` and model |
|
|
1736 |
coefficients :math:`r`. This function can also return a matrix of standard |
|
|
1737 |
errors, see the documentation of :func:`coef`. The columns of this matrix |
|
|
1738 |
correspond to the effects returned by :meth:`.DESeqDataSet.resultsNames`. Note |
|
|
1739 |
that the :meth:`.DESeqDataSet.results` function is best for building results |
|
|
1740 |
tables with *p*-values and adjusted *p*-values: |
|
|
1741 |
|
|
|
1742 |
.. repl:: |
|
|
1743 |
coef(dds).head() |
|
|
1744 |
|
|
|
1745 |
The beta prior variance :math:`\sigma_r^2` is stored as an attribute of the |
|
|
1746 |
:class:`~DESeqDataSet.DESeqDataSet`: |
|
|
1747 |
|
|
|
1748 |
.. repl:: |
|
|
1749 |
dds.betaPriorVar |
|
|
1750 |
|
|
|
1751 |
General information about the prior used for log fold change shrinkage is also |
|
|
1752 |
stored in a slot of the :class:`.DESeqResults` object. This would also contain |
|
|
1753 |
information about what other packages were used for log2 fold change shrinkage: |
|
|
1754 |
|
|
|
1755 |
.. repl:: |
|
|
1756 |
resLFC.priorInfo |
|
|
1757 |
resNorm.priorInfo |
|
|
1758 |
resAsh.priorInfo |
|
|
1759 |
|
|
|
1760 |
The dispersion prior variance :math:`\sigma_d^2` is stored as an attribute of |
|
|
1761 |
the dispersion function: |
|
|
1762 |
|
|
|
1763 |
.. repl:: |
|
|
1764 |
dds.dispersionFunction |
|
|
1765 |
dds.dispersionFunction.dispPriorVar |
|
|
1766 |
|
|
|
1767 |
.. |
|
|
1768 |
The version of DESeq2 which was used to construct the |
|
|
1769 |
:class:`~DESeqDataSet.DESeqDataSet` object, or the version used when |
|
|
1770 |
:func:`DESeq` was run, is stored here: |
|
|
1771 |
|
|
|
1772 |
.. repl:: |
|
|
1773 |
dds.version |
|
|
1774 |
|
|
|
1775 |
|
|
|
1776 |
Sample-/gene-dependent normalization factors |
|
|
1777 |
-------------------------------------------- |
|
|
1778 |
|
|
|
1779 |
In some experiments, there might be gene-dependent dependencies which vary |
|
|
1780 |
across samples. For instance, GC-content bias or length bias might vary across |
|
|
1781 |
samples coming from different labs or processed at different times. We use the |
|
|
1782 |
terms *normalization factors* for a gene x sample matrix, and *size factors* for |
|
|
1783 |
a single number per sample. Incorporating normalization factors, the mean |
|
|
1784 |
parameter :math:`\mu_{ij}` becomes: |
|
|
1785 |
|
|
|
1786 |
.. math:: |
|
|
1787 |
\mu_{ij} = NF_{ij} q_{ij} |
|
|
1788 |
|
|
|
1789 |
with normalization factor matrix :math:`NF` having the same dimensions as the |
|
|
1790 |
counts matrix :math:`K`. This matrix can be incorporated as shown below. We |
|
|
1791 |
recommend providing a matrix with gene-wise geometric means of 1, so that the |
|
|
1792 |
mean of normalized counts for a gene is close to the mean of the unnormalized |
|
|
1793 |
counts. This can be accomplished by dividing out the current gene geometric |
|
|
1794 |
means: |
|
|
1795 |
|
|
|
1796 |
>>> normFactors = normFactors / np.exp(np.mean(np.log(normFactors), axis=0)) |
|
|
1797 |
>>> dds.normalizationFactors = normFactors |
|
|
1798 |
|
|
|
1799 |
These steps then replace :meth:`.DESeqDataSet.estimateSizeFactors` which occurs |
|
|
1800 |
within the :func:`DESeq` function. The :func:`DESeq` function will look for |
|
|
1801 |
pre-existing normalization factors and use these in the place of size factors |
|
|
1802 |
(and a message will be printed confirming this). |
|
|
1803 |
|
|
|
1804 |
.. |
|
|
1805 |
The methods provided by the `cqn <http://bioconductor.org/packages/cqn>`_ or |
|
|
1806 |
`EDASeq <http://bioconductor.org/packages/EDASeq>`_ packages can help correct |
|
|
1807 |
for GC or length biases. They both describe in their vignettes how to create |
|
|
1808 |
matrices which can be used by DESeq2. From the formula above, we see that |
|
|
1809 |
normalization factors should be on the scale of the counts, like size factors, |
|
|
1810 |
and unlike offsets which are typically on the scale of the predictors (*i.e.* |
|
|
1811 |
the logarithmic scale for the negative binomial GLM). At the time of writing, |
|
|
1812 |
the transformation from the matrices provided by these packages should be: |
|
|
1813 |
|
|
|
1814 |
```{r offsetTransform, eval=FALSE} |
|
|
1815 |
cqnOffset <- cqnObject$glm.offset |
|
|
1816 |
cqnNormFactors <- exp(cqnOffset) |
|
|
1817 |
EDASeqNormFactors <- exp(-1 * EDASeqOffset) |
|
|
1818 |
``` |
|
|
1819 |
|
|
|
1820 |
"Model matrix not full rank" |
|
|
1821 |
---------------------------- |
|
|
1822 |
|
|
|
1823 |
While most experimental designs run easily using design formula, some design |
|
|
1824 |
formulas can cause problems and result in the :func:`DESeq` function returning |
|
|
1825 |
an error with the text: "the model matrix is not full rank, so the model cannot |
|
|
1826 |
be fit as specified." There are two main reasons for this problem: either one |
|
|
1827 |
or more columns in the model matrix are linear combinations of other columns, or |
|
|
1828 |
there are levels of factors or combinations of levels of multiple factors which |
|
|
1829 |
are missing samples. We address these two problems below and discuss possible |
|
|
1830 |
solutions. |
|
|
1831 |
|
|
|
1832 |
Linear combinations |
|
|
1833 |
^^^^^^^^^^^^^^^^^^^ |
|
|
1834 |
|
|
|
1835 |
The simplest case is the linear combination, or linear dependency problem, when |
|
|
1836 |
two variables contain exactly the same information, such as in the following |
|
|
1837 |
sample table. The software cannot fit an effect for :code:`batch` and |
|
|
1838 |
:code:`condition`, because they produce identical columns in the model matrix. |
|
|
1839 |
This is also referred to as *perfect confounding*. A unique solution of |
|
|
1840 |
coefficients (the :math:`\beta_i` in the formula :ref:`below<theory>`) is not |
|
|
1841 |
possible: |
|
|
1842 |
|
|
|
1843 |
.. repl:: |
|
|
1844 |
pd.DataFrame({"batch": Factor([1,1,2,2]), |
|
|
1845 |
"condition": Factor(["A", "A", "B", "B"])}) |
|
|
1846 |
|
|
|
1847 |
Another situation which will cause problems is when the variables are not |
|
|
1848 |
identical, but one variable can be formed by the combination of other factor |
|
|
1849 |
levels. In the following example, the effect of batch 2 vs 1 cannot be fit |
|
|
1850 |
because it is identical to a column in the model matrix which represents the |
|
|
1851 |
condition C vs A effect: |
|
|
1852 |
|
|
|
1853 |
.. repl:: |
|
|
1854 |
pd.DataFrame({"batch": Factor([1,1,1,1,2,2]), |
|
|
1855 |
"condition": Factor(["A", "A", "B", "B", "C", "C"])}) |
|
|
1856 |
|
|
|
1857 |
In both of these cases above, the batch effect cannot be fit and must be removed |
|
|
1858 |
from the model formula. There is just no way to tell apart the condition effects |
|
|
1859 |
and the batch effects. The options are either to assume there is no batch effect |
|
|
1860 |
(which we know is highly unlikely given the literature on batch effects in |
|
|
1861 |
sequencing datasets) or to repeat the experiment and properly balance the |
|
|
1862 |
conditions across batches. A balanced design would look like: |
|
|
1863 |
|
|
|
1864 |
.. repl:: |
|
|
1865 |
pd.DataFrame({"batch": Factor([1,1,1,2,2,2]), |
|
|
1866 |
"condition": Factor(["A", "B", "C", "A", "B", "C"])}) |
|
|
1867 |
|
|
|
1868 |
.. _nested-div: |
|
|
1869 |
|
|
|
1870 |
Group-specific condition effects, individuals nested within groups |
|
|
1871 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
1872 |
|
|
|
1873 |
Finally, there is a case where we *can* in fact perform inference, but we may |
|
|
1874 |
need to re-arrange terms to do so. Consider an experiment with grouped |
|
|
1875 |
individuals, where we seek to test the group-specific effect of a condition or |
|
|
1876 |
treatment, while controlling for individual effects. The individuals are nested |
|
|
1877 |
within the groups: an individual can only be in one of the groups, although each |
|
|
1878 |
individual has one or more observations across condition. |
|
|
1879 |
|
|
|
1880 |
An example of such an experiment is below: |
|
|
1881 |
|
|
|
1882 |
.. repl:: |
|
|
1883 |
colData = pd.DataFrame({"grp": Factor(["X","X","X","X","X","X","Y","Y","Y","Y","Y","Y"]), |
|
|
1884 |
"ind": Factor([1,1,2,2,3,3,4,4,5,5,6,6]), |
|
|
1885 |
"cnd": Factor(["A","B","A","B","A","B","A","B","A","B","A","B"])}) |
|
|
1886 |
colData |
|
|
1887 |
|
|
|
1888 |
Note that individual (:code:`ind`) is a *factor* not a numeric. This is very |
|
|
1889 |
important. |
|
|
1890 |
|
|
|
1891 |
We have two groups of samples X and Y, each with three distinct individuals |
|
|
1892 |
(labeled here 1-6). For each individual, we have conditions A and B (for |
|
|
1893 |
example, this could be control and treated). |
|
|
1894 |
|
|
|
1895 |
This design can be analyzed by DESeq2 but requires a bit of refactoring in order |
|
|
1896 |
to fit the model terms. Here we will use a trick described in the `edgeR |
|
|
1897 |
<http://bioconductor.org/packages/edgeR>`_ user guide, from the section |
|
|
1898 |
*Comparisons Both Between and Within Subjects*. If we try to analyze with a |
|
|
1899 |
formula such as, :code:`"~ ind + grp*cnd"`, we will obtain an error, because the |
|
|
1900 |
effect for group is a linear combination of the individuals. |
|
|
1901 |
|
|
|
1902 |
However, the following steps allow for an analysis of group-specific condition |
|
|
1903 |
effects, while controlling for differences in individual. For object |
|
|
1904 |
construction, you can use a simple design, such as :code:`"~ ind + cnd"`, as |
|
|
1905 |
long as you remember to replace it before running :func:`DESeq`. Then add a |
|
|
1906 |
column :code:`ind_n` which distinguishes the individuals nested within a group. |
|
|
1907 |
Here, we add this column to :code:`colData`, but in practice you would add this |
|
|
1908 |
column to :code:`dds`: |
|
|
1909 |
|
|
|
1910 |
.. repl:: |
|
|
1911 |
colData["ind_n"] = Factor([1,1,2,2,3,3,1,1,2,2,3,3]) |
|
|
1912 |
colData |
|
|
1913 |
|
|
|
1914 |
Now we can reassign our :class:`~DESeqDataSet.DESeqDataSet` a design of |
|
|
1915 |
:code:`"~ grp + grp:ind.n + grp:cnd"`, before we call :func:`DESeq`. This new |
|
|
1916 |
design will result in the following model matrix: |
|
|
1917 |
|
|
|
1918 |
.. repl:: |
|
|
1919 |
patsy.dmatrix("~ grp + grp:ind_n + grp:cnd", colData) |
|
|
1920 |
|
|
|
1921 |
|
|
|
1922 |
Note that, if you have unbalanced numbers of individuals in the two groups, you |
|
|
1923 |
will have zeros for some of the interactions between :code:`grp` and |
|
|
1924 |
:code:`ind.n`. You can remove these columns manually from the model matrix and |
|
|
1925 |
pass the corrected model matrix to the :code:`full` argument of the |
|
|
1926 |
:func:`DESeq` function. See example code in the next section. Note that, in this |
|
|
1927 |
case, you will not be able to create the :class:`~DESeqDataSet.DESeqDataSet` |
|
|
1928 |
with the design that leads to less than full rank model matrix. You can either |
|
|
1929 |
use :code:`design="~1"` when creating the dataset object, or you can provide the |
|
|
1930 |
corrected model matrix to the :attr:`.DESeqDataSet.design` attribute of the |
|
|
1931 |
dataset from the start. |
|
|
1932 |
|
|
|
1933 |
Above, the terms :code:`grpX.cndB` and :code:`grpY.cndB` give the group-specific |
|
|
1934 |
condition effects, in other words, the condition B vs A effect for group X |
|
|
1935 |
samples, and likewise for group Y samples. These terms control for all of the |
|
|
1936 |
six individual effects. These group-specific condition effects can be extracted |
|
|
1937 |
using :meth:`.DESeqDataSet.results` with the :code:`name` argument. |
|
|
1938 |
|
|
|
1939 |
Furthermore, :code:`grpX.cndB` and :code:`grpY.cndB` can be contrasted using the |
|
|
1940 |
:code:`contrast` argument, in order to test if the condition effect is different |
|
|
1941 |
across group: |
|
|
1942 |
|
|
|
1943 |
>>> dds.results(contrast=("grpY.cndB", "grpX.cndB")) |
|
|
1944 |
|
|
|
1945 |
|
|
|
1946 |
Levels without samples |
|
|
1947 |
^^^^^^^^^^^^^^^^^^^^^^ |
|
|
1948 |
|
|
|
1949 |
The function :func:`patsy.dmatrix` will produce a column of zeros if a level is |
|
|
1950 |
missing from a factor or a combination of levels is missing from an interaction |
|
|
1951 |
of factors. The solution to the first case is to call |
|
|
1952 |
:code:`remove_unused_categories` on the column, which will remove levels without |
|
|
1953 |
samples. This was shown in the beginning of this vignette. |
|
|
1954 |
|
|
|
1955 |
The second case is also solvable, by manually editing the model matrix, and then |
|
|
1956 |
providing this to :func:`DESeq`. Here we construct an example dataset to |
|
|
1957 |
illustrate it: |
|
|
1958 |
|
|
|
1959 |
.. repl:: |
|
|
1960 |
group = Factor([1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]) |
|
|
1961 |
condition = Factor(["A","A","B","B","C","C","A","A","B","B","C","C","A","A","B","B","C","C"]) |
|
|
1962 |
d = pd.DataFrame({"group": group, "condition": condition})[:16] |
|
|
1963 |
d |
|
|
1964 |
|
|
|
1965 |
Note that if we try to estimate all interaction terms, we introduce a column |
|
|
1966 |
with all zeros, as there are no condition C samples for group 3 |
|
|
1967 |
(:code:`np.asarray` is used to display the matrix): |
|
|
1968 |
|
|
|
1969 |
.. repl:: |
|
|
1970 |
m1 = patsy.dmatrix("~condition*group", d) |
|
|
1971 |
m1.design_info.column_names |
|
|
1972 |
np.asarray(m1) |
|
|
1973 |
all_zero = (m1 == 0).all(axis=0) |
|
|
1974 |
all_zero |
|
|
1975 |
|
|
|
1976 |
We can remove this column like so: |
|
|
1977 |
|
|
|
1978 |
.. repl:: |
|
|
1979 |
m1 = m1[:,~all_zero] |
|
|
1980 |
m1 |
|
|
1981 |
|
|
|
1982 |
|
|
|
1983 |
Now this matrix :code:`m1` can be provided to the :code:`full` argument of |
|
|
1984 |
:func:`DESeq`. For a likelihood ratio test of interactions, a model matrix |
|
|
1985 |
using a reduced design such as :code:`"~ condition + group"` can be given to the |
|
|
1986 |
:code:`reduced` argument. Wald tests can also be generated instead of the |
|
|
1987 |
likelihood ratio test, but for user-supplied model matrices, the argument |
|
|
1988 |
:code:`betaPrior` must be set to :code:`False`. |
|
|
1989 |
|
|
|
1990 |
.. _theory: |
|
|
1991 |
|
|
|
1992 |
Theory behind DESeq2 |
|
|
1993 |
==================== |
|
|
1994 |
|
|
|
1995 |
The DESeq2 model |
|
|
1996 |
---------------- |
|
|
1997 |
|
|
|
1998 |
The DESeq2 model and all the steps taken in the software are described in detail |
|
|
1999 |
in our publication [Love2014]_, and we include the formula and descriptions in |
|
|
2000 |
this section as well. The differential expression analysis in DESeq2 uses a |
|
|
2001 |
generalized linear model of the form: |
|
|
2002 |
|
|
|
2003 |
.. math:: |
|
|
2004 |
K_{ij} \sim \textrm{NB}(\mu_{ij}, \alpha_i) |
|
|
2005 |
|
|
|
2006 |
\mu_{ij} = s_j q_{ij} |
|
|
2007 |
|
|
|
2008 |
\log_2(q_{ij}) = x_{j.} \beta_i |
|
|
2009 |
|
|
|
2010 |
where counts :math:`K_{ij}` for gene :math:`i`, sample :math:`j` are modeled |
|
|
2011 |
using a negative binomial distribution with fitted mean :math:`\mu_{ij}` and a |
|
|
2012 |
gene-specific dispersion parameter :math:`\alpha_i`. The fitted mean is |
|
|
2013 |
composed of a sample-specific size factor :math:`s_j` and a parameter |
|
|
2014 |
:math:`q_{ij}` proportional to the expected true concentration of fragments for |
|
|
2015 |
sample :math:`j`. The coefficients :math:`\beta_i` give the log2 fold changes |
|
|
2016 |
for gene :math:`i` for each column of the model matrix :math:`X`. Note that the |
|
|
2017 |
model can be generalized to use sample- and gene-dependent normalization factors |
|
|
2018 |
:math:`s_{ij}`. |
|
|
2019 |
|
|
|
2020 |
The dispersion parameter :math:`\alpha_i` defines the relationship between the |
|
|
2021 |
variance of the observed count and its mean value. In other words, how far do we |
|
|
2022 |
expected the observed count will be from the mean value, which depends both on |
|
|
2023 |
the size factor :math:`s_j` and the covariate-dependent part :math:`q_{ij}` as |
|
|
2024 |
defined above. |
|
|
2025 |
|
|
|
2026 |
.. math:: |
|
|
2027 |
\textrm{Var}(K_{ij}) = E[ (K_{ij} - \mu_{ij})^2 ] = \mu_{ij} + \alpha_i \mu_{ij}^2 |
|
|
2028 |
|
|
|
2029 |
An option in DESeq2 is to provide maximum *a posteriori* estimates of the log2 |
|
|
2030 |
fold changes in :math:`\beta_i` after incorporating a zero-centered Normal prior |
|
|
2031 |
(:code:`betaPrior`). While previously, these moderated, or shrunken, estimates |
|
|
2032 |
were generated by :func:`DESeq` or :func:`nbinomWaldTest` functions, they are |
|
|
2033 |
now produced by the :func:`lfcShrink` function. Dispersions are estimated using |
|
|
2034 |
expected mean values from the maximum likelihood estimate of log2 fold changes, |
|
|
2035 |
and optimizing the Cox-Reid adjusted profile likelihood, as first implemented |
|
|
2036 |
for RNA-seq data in `edgeR <http://bioconductor.org/packages/edgeR>`_ |
|
|
2037 |
[@CR,edgeR_GLM]. The steps performed by the :func:`DESeq` function are |
|
|
2038 |
documented in its manual page; briefly, they are: |
|
|
2039 |
|
|
|
2040 |
1. estimation of size factors :math:`s_j` by |
|
|
2041 |
:meth:`.DESeqDataSet.estimateSizeFactors` |
|
|
2042 |
2. estimation of dispersion :math:`\alpha_i` by |
|
|
2043 |
:meth:`.DESeqDataSet.estimateDispersions` |
|
|
2044 |
3. negative binomial GLM fitting for :math:`\beta_i` and Wald statistics by |
|
|
2045 |
:func:`nbinomWaldTest` |
|
|
2046 |
|
|
|
2047 |
For access to all the values calculated during these steps, see the section |
|
|
2048 |
:ref:`above<access>`. |
|
|
2049 |
|
|
|
2050 |
Changes compared to R DESeq2 |
|
|
2051 |
---------------------------- |
|
|
2052 |
|
|
|
2053 |
This module is a Python port of the R package `DESeq2 |
|
|
2054 |
<https://bioconductor.org/packages/release/bioc/html/DESeq2.html>`_. The port is |
|
|
2055 |
based on version 1.39.3, and changes to the R package since this version have |
|
|
2056 |
not been reflected in the present module. Also note that the port is partial: |
|
|
2057 |
not all features may have been ported yet. To help us prioritize our future |
|
|
2058 |
focus, please open a "feature request" issue on `inmoose GitHub |
|
|
2059 |
repository <https://github.com/epigenelabs/inmoose>`_. |
|
|
2060 |
|
|
|
2061 |
The present page mirrors DESeq2 vignette, and despite our efforts, some code |
|
|
2062 |
examples may not accurately reflect the state of the Python module. |
|
|
2063 |
|
|
|
2064 |
The main changes in this module compared to the original DESeq2 package are as |
|
|
2065 |
follows: |
|
|
2066 |
|
|
|
2067 |
- :class:`anndata.AnnData` is used as the superclass for storage of input |
|
|
2068 |
data, intermediate calculations and results. |
|
|
2069 |
|
|
|
2070 |
|
|
|
2071 |
.. _changes: |
|
|
2072 |
|
|
|
2073 |
Methods changes since the 2014 DESeq2 paper |
|
|
2074 |
------------------------------------------- |
|
|
2075 |
|
|
|
2076 |
.. note:: |
|
|
2077 |
|
|
|
2078 |
The changes below are present in the R DESeq2 package, and "we" stands for |
|
|
2079 |
the authors of the R package, not the authors of InMoose. |
|
|
2080 |
|
|
|
2081 |
* In version 1.18 (November 2017), we add two :ref:`alternative shrinkage |
|
|
2082 |
estimators<shrink>`, which can be used via :func:`lfcShrink`: an estimator |
|
|
2083 |
using a t prior from the apeglm packages, and an estimator with a fitted |
|
|
2084 |
mixture of normals prior from the ashr package. |
|
|
2085 |
* In version 1.16 (November 2016), the log2 fold change shrinkage is no longer |
|
|
2086 |
default for the :func:`DESeq` and :func:`nbinomWaldTest` functions, by setting |
|
|
2087 |
the defaults of these to :code:`betaPrior=FALSE`, and by introducing a |
|
|
2088 |
separate function :func:`lfcShrink`, which performs log2 fold change shrinkage |
|
|
2089 |
for visualization and ranking of genes. While for the majority of bulk |
|
|
2090 |
RNA-seq experiments, the LFC shrinkage did not affect statistical testing, |
|
|
2091 |
DESeq2 has become used as an inference engine by a wider community, and |
|
|
2092 |
certain sequencing datasets show better performance with the testing separated |
|
|
2093 |
from the use of the LFC prior. Also, the separation of LFC shrinkage to a |
|
|
2094 |
separate function :func:`lfcShrink` allows for easier methods development of |
|
|
2095 |
alternative effect size estimators. |
|
|
2096 |
* A small change to the independent filtering routine: instead of taking the |
|
|
2097 |
quantile of the filter (the mean of normalized counts) which directly |
|
|
2098 |
*maximizes* the number of rejections, the threshold chosen is the lowest |
|
|
2099 |
quantile of the filter for which the number of rejections is close to the peak |
|
|
2100 |
of a curve fit to the number of rejections over the filter quantiles. "Close |
|
|
2101 |
to" is defined as within 1 residual standard deviation. This change was |
|
|
2102 |
introduced in version 1.10 (October 2015). |
|
|
2103 |
* For the calculation of the beta prior variance, instead of matching the |
|
|
2104 |
empirical quantile to the quantile of a Normal distribution, DESeq2 now uses |
|
|
2105 |
the weighted quantile function of the Hmisc package. The weighting is |
|
|
2106 |
described in the manual page for :func:`nbinomWaldTest`. The weights are the |
|
|
2107 |
inverse of the expected variance of log counts (as used in the diagonals of |
|
|
2108 |
the matrix :math:`W` in the GLM). The effect of the change is that the |
|
|
2109 |
estimated prior variance is robust against noisy estimates of log fold change |
|
|
2110 |
from genes with very small counts. This change was introduced in version 1.6 |
|
|
2111 |
(October 2014). |
|
|
2112 |
|
|
|
2113 |
Count outlier detection |
|
|
2114 |
----------------------- |
|
|
2115 |
|
|
|
2116 |
DESeq2 relies on the negative binomial distribution to make estimates and |
|
|
2117 |
perform statistical inference on differences. While the negative binomial is |
|
|
2118 |
versatile in having a mean and dispersion parameter, extreme counts in |
|
|
2119 |
individual samples might not fit well to the negative binomial. For this reason, |
|
|
2120 |
we perform automatic detection of count outliers. We use Cook's distance, which |
|
|
2121 |
is a measure of how much the fitted coefficients would change if an individual |
|
|
2122 |
sample were removed [Cook1977]_. For more on the implementation of Cook's |
|
|
2123 |
distance see the manual page for the :meth:`.DESeqDataSet.results` function. |
|
|
2124 |
Below we plot the maximum value of Cook's distance for each row over the rank of |
|
|
2125 |
the test statistic to justify its use as a filtering criterion. |
|
|
2126 |
|
|
|
2127 |
.. repl:: |
|
|
2128 |
W = res["stat"] |
|
|
2129 |
maxCooks = dds.layers["cooks"].max(axis=0) |
|
|
2130 |
idx = ~np.isnan(W) |
|
|
2131 |
plt.scatter(np.argsort(W[idx]), maxCooks[idx], c="grey") |
|
|
2132 |
plt.xlabel("rank of Wald statistic") |
|
|
2133 |
plt.ylabel("maximum Cook's distance per gene") |
|
|
2134 |
m = dds.n_obs |
|
|
2135 |
p = 3 |
|
|
2136 |
plt.axhline(y = scipy.stats.f.ppf(.99, p, m-p)) |
|
|
2137 |
plt.show() |
|
|
2138 |
|
|
|
2139 |
|
|
|
2140 |
Contrasts |
|
|
2141 |
--------- |
|
|
2142 |
|
|
|
2143 |
Contrasts can be calculated for a :class:`~DESeqDataSet.DESeqDataSet` object for |
|
|
2144 |
which the GLM coefficients have already been fit using the Wald test steps |
|
|
2145 |
(:func:`DESeq` with :code:`test="Wald"` or using :func:`nbinomWaldTest`). The |
|
|
2146 |
vector of coefficients :math:`\beta` is left multiplied by the contrast vector |
|
|
2147 |
:math:`c` to form the numerator of the test statistic. The denominator is formed |
|
|
2148 |
by multiplying the covariance matrix :math:`\Sigma` for the coefficients on |
|
|
2149 |
either side by the contrast vector :math:`c`. The square root of this product is |
|
|
2150 |
an estimate of the standard error for the contrast. The contrast statistic is |
|
|
2151 |
then compared to a Normal distribution as are the Wald statistics for the DESeq2 |
|
|
2152 |
package. |
|
|
2153 |
|
|
|
2154 |
.. math:: |
|
|
2155 |
W = \frac{c^t \beta}{\sqrt{c^t \Sigma c}} |
|
|
2156 |
|
|
|
2157 |
|
|
|
2158 |
Expanded model matrices |
|
|
2159 |
----------------------- |
|
|
2160 |
|
|
|
2161 |
For the specific combination of :func:`lfcShrink` with the type :code:`normal` |
|
|
2162 |
and using :code:`contrast`, DESeq2 uses *expanded model matrices* to produce |
|
|
2163 |
shrunken log2 fold change estimates where the shrinkage is independent of the |
|
|
2164 |
choice of reference level. In all other cases, DESeq2 uses standard model |
|
|
2165 |
matrices, as produced by :func:`patsy.dmatrix`. The expanded model matrices |
|
|
2166 |
differ from the standard model matrices, in that they have an indicator column |
|
|
2167 |
(and therefore a coefficient) for each level of factors in the design formula in |
|
|
2168 |
addition to an intercept. This is described in the DESeq2 paper. Using type |
|
|
2169 |
:code:`normal` with :func:`coef` uses standard model matrices, as does the |
|
|
2170 |
:code:`apeglm` shrinkage estimator. |
|
|
2171 |
|
|
|
2172 |
.. _indfilttheory: |
|
|
2173 |
|
|
|
2174 |
Independent filtering and multiple testing |
|
|
2175 |
------------------------------------------ |
|
|
2176 |
|
|
|
2177 |
Filtering criteria |
|
|
2178 |
^^^^^^^^^^^^^^^^^^ |
|
|
2179 |
|
|
|
2180 |
The goal of independent filtering is to filter out those tests from the |
|
|
2181 |
procedure that have no, or little chance of showing significant evidence, |
|
|
2182 |
without even looking at their test statistic. Typically, this results in |
|
|
2183 |
increased detection power at the same experiment-wide type I error. Here, we |
|
|
2184 |
measure experiment-wide type I error in terms of the false discovery rate. |
|
|
2185 |
|
|
|
2186 |
A good choice for a filtering criterion is one that: |
|
|
2187 |
|
|
|
2188 |
1. is statistically independent from the test statistic under the null |
|
|
2189 |
hypothesis, |
|
|
2190 |
2. is correlated with the test statistic under the alternative, and |
|
|
2191 |
3. does not notably change the dependence structure -- if there is any -- |
|
|
2192 |
between the tests that pass the filter, compared to the dependence |
|
|
2193 |
structure between the tests before filtering. |
|
|
2194 |
|
|
|
2195 |
The benefit from filtering relies on property (2), and we will explore it |
|
|
2196 |
further below. Its statistical validity relies on property (1) -- which is |
|
|
2197 |
simple to formally prove for many combinations of filter criteria with test |
|
|
2198 |
statistics -- and (3), which is less easy to theoretically imply from first |
|
|
2199 |
principles, but rarely a problem in practice. We refer to [Bourgon2010]_ for |
|
|
2200 |
further discussion of this topic. |
|
|
2201 |
|
|
|
2202 |
A simple filtering criterion readily available in the results object is the mean |
|
|
2203 |
of normalized counts irrespective of biological condition, and so this is the |
|
|
2204 |
criterion which is used automatically by the :meth:`.DESeqDataSet.results` |
|
|
2205 |
function to perform independent filtering. Genes with very low counts are not |
|
|
2206 |
likely to see significant differences typically due to high dispersion. For |
|
|
2207 |
example, we can plot the :math:`-\log_{10}` *p*-values from all genes over the |
|
|
2208 |
normalized mean counts: |
|
|
2209 |
|
|
|
2210 |
.. repl:: |
|
|
2211 |
plt.scatter(res["baseMean"]+1, -np.log10(res["pvalue"]), c="black") |
|
|
2212 |
plt.xscale("log") |
|
|
2213 |
plt.xlabel("mean of normalized counts") |
|
|
2214 |
plt.ylabel("-log[10](pvalue)") |
|
|
2215 |
plt.ylim(0,30) |
|
|
2216 |
plt.show() |
|
|
2217 |
|
|
|
2218 |
|
|
|
2219 |
Why does it work? |
|
|
2220 |
^^^^^^^^^^^^^^^^^ |
|
|
2221 |
|
|
|
2222 |
Consider the *p*-value histogram below: it shows how the filtering ameliorates |
|
|
2223 |
the multiple testing problem -- and thus the severity of a multiple testing |
|
|
2224 |
adjustment -- by removing a background set of hypotheses whose *p*-values are |
|
|
2225 |
distributed more or less uniformly in [0,1]. |
|
|
2226 |
|
|
|
2227 |
.. repl:: |
|
|
2228 |
use = res["baseMean"] > res.filterThreshold |
|
|
2229 |
plt.hist(res["pvalue"][~use], bins=50, color="khaki", label="do not pass") |
|
|
2230 |
plt.hist(res["pvalue"][use], bins=50, color="powderblue", label="pass") |
|
|
2231 |
plt.legend(loc="upper right") |
|
|
2232 |
plt.show() |
|
|
2233 |
|
|
|
2234 |
Histogram of *p*-values for all tests. The area shaded in blue indicates the |
|
|
2235 |
subset of those that pass the filtering, the area in khaki those that do not |
|
|
2236 |
pass. |
|
|
2237 |
|
|
|
2238 |
|
|
|
2239 |
References |
|
|
2240 |
========== |
|
|
2241 |
|
|
|
2242 |
.. [AhlmannEltze2020] C. Ahlmann-Eltze, W. Huber. 2020. glmGamPoi: fitting |
|
|
2243 |
Gamma-Poisson generalized linear models on single-cell count data. |
|
|
2244 |
*Bioinformatics*, 36(24). :doi:`10.1093/bioinformatics/btaa1009` |
|
|
2245 |
|
|
|
2246 |
.. [Anders2010] S. Anders and W. Huber. 2010. Differential expression for |
|
|
2247 |
sequence count data. *Genome Biology*, 11:106. |
|
|
2248 |
:doi:`10.1186/gb-2010-11-10-r106` |
|
|
2249 |
|
|
|
2250 |
.. [Berge2018] K. van den Berge, F. Perraudeau, C. Soneson, M.I. Loce, D. Risso, |
|
|
2251 |
J.P. Vert, M.D. Robinson, S. Dudoit, L. Clement. 2018. Observation weights |
|
|
2252 |
unlock bulk RNA-seq tools for zero inflation and single-cell applications. |
|
|
2253 |
*Genome Biology*, 19(24). :doi:`10.1186/s13059-018-1406-4` |
|
|
2254 |
|
|
|
2255 |
.. [Bourgon2010] R. Bourgon, R. Gentleman, W. Huber. 2010. Independent filtering |
|
|
2256 |
increases detection power for high-throughput experiments. *PNAS*, |
|
|
2257 |
107(21):9546-51. :doi:`10.1073/pnas.0914005107` |
|
|
2258 |
|
|
|
2259 |
.. [Cook1977] R.D. Cook. 1977. Detection of inferential observation in linear |
|
|
2260 |
regression. *Technometrics*, 19(1):15-18. :doi:`10.2307/1268249` |
|
|
2261 |
|
|
|
2262 |
.. [Gerard2020] D. Gerard, M. Stephens. 2020. Empirical Bayes shrinkage and |
|
|
2263 |
false discovery rate estimation, allowing for unwanted variation. |
|
|
2264 |
*Biostatistics*, 21(1):15-32. :doi:`10.1093/biostatistics/kxy029` |
|
|
2265 |
|
|
|
2266 |
.. [Leek2014] J.T. Leek. 2014. svaseq: removing batch effects and other unwanted |
|
|
2267 |
noise from sequencing data. *Nucleic Acids Research*, 42(21). |
|
|
2268 |
:doi:`10.1093/nar/gku864` |
|
|
2269 |
|
|
|
2270 |
.. [Liao2013] Y. Liao, G.K. Smyth, W. Shi. 2013. featureCounts: an efficient |
|
|
2271 |
general purpose program for assigning sequence reads to genomic features. |
|
|
2272 |
*Bioinformatics*, 30(7):923-30. :doi:`10.1093/bioinformatics/btt656` |
|
|
2273 |
|
|
|
2274 |
.. [Love2014] M.I. Love, W. Huber, S. Anders. 2014. Moderated estimation of fold |
|
|
2275 |
change and dispersion for RNA-seq data with DESeq2. *Genome Biology*, 15(12), |
|
|
2276 |
550. :doi:`10.1186/s13059-014-0550-8` |
|
|
2277 |
|
|
|
2278 |
.. [McCarthy2012] D.J. McCarthy, Y. Chen, G.K. Smyth. 2012. Differential |
|
|
2279 |
expression analysis of multifactor RNA-Seq experiments with respect to |
|
|
2280 |
biological variation. *Nucleic Acids Research* 40, 4288-4297. |
|
|
2281 |
:doi:`10.1093/nar/gks042` |
|
|
2282 |
|
|
|
2283 |
.. [Risso2014] D. Risso, J. Ngai. T.P. Speed, S. Dudoit. 2014. Normalization of |
|
|
2284 |
RNA-seq data using factor analysis of control genes or samples. *Nature |
|
|
2285 |
Biotechnology*, 32(9). :doi:`10.1038/nbt.2931` |
|
|
2286 |
|
|
|
2287 |
.. [Stephens2016] M. Stephens. 2016. False discovery rates: a new deal. |
|
|
2288 |
*Biostatistics*, 18(2). :doi:`10.1093/biostatistics/kxw041` |
|
|
2289 |
|
|
|
2290 |
.. [Storey2003] J. Storey. 2003. The positive false discovery rate: a Bayesian |
|
|
2291 |
interpretation and the *q*-value. *The Annals of Statistics*, |
|
|
2292 |
31(6):2013-2035. |
|
|
2293 |
|
|
|
2294 |
.. [Wu2012] H. Wu, C. Wang, Z. Wu. 2012. A new shrinkage estimator for |
|
|
2295 |
dispersion improves differential detection in RNA-Seq data. *Biostatistics*. |
|
|
2296 |
:doi:`10.1093/biostatistics/kxs033` |
|
|
2297 |
|
|
|
2298 |
.. [Zhu2018] A. Zhu, J.G. Ibrahim, M.I. Love. 2018. Heavy-tailed prior |
|
|
2299 |
distributions for sequence count data: removing the noise and preserving |
|
|
2300 |
large differences. *Bioinformatics*, 35(12), 2084-2092. |
|
|
2301 |
:doi:`10.1093/bioinformatics/bty895` |
|
|
2302 |
|
|
|
2303 |
Code documentation |
|
|
2304 |
================== |
|
|
2305 |
|
|
|
2306 |
.. autosummary:: |
|
|
2307 |
:toctree: generated/ |
|
|
2308 |
|
|
|
2309 |
~DESeqDataSet.DESeqDataSet |
|
|
2310 |
~results.DESeqResults |
|
|
2311 |
|
|
|
2312 |
DESeq |
|
|
2313 |
collapseReplicates |
|
|
2314 |
estimateBetaPriorVar |
|
|
2315 |
estimateDispersionsFit |
|
|
2316 |
estimateDispersionsGeneEst |
|
|
2317 |
estimateDispersionsMAP |
|
|
2318 |
estimateDispersionsPriorVar |
|
|
2319 |
estimateSizeFactorsForMatrix |
|
|
2320 |
~results.filtered_p |
|
|
2321 |
lfcShrink |
|
|
2322 |
makeExampleDESeqDataSet |
|
|
2323 |
nbinomLRT |
|
|
2324 |
nbinomWaldTest |
|
|
2325 |
~results.p_adjust |
|
|
2326 |
replaceOutliers |
|
|
2327 |
varianceStabilizingTransformation |
|
|
2328 |
~Hmisc.wtd_quantile |
|
|
2329 |
|