Diff of /docs/guide.rst [000000] .. [5ad33c]

Switch to unified view

a b/docs/guide.rst
1
.. highlight:: terminal
2
3
User Guide
4
==========
5
6
scVAE model count data, primarily single-cell gene transcript counts, using variational auto-encoders (:ref:`Kingma and Welling, 2014 <kingma2014>`; :ref:`Rezende et al., 2014 <rezende2014>`).
7
8
Installing scVAE
9
----------------
10
11
scVAE requires Python 3.6--3.7, which can be installed in `several ways`_, for example, using Miniconda_.
12
13
.. _several ways: https://realpython.com/installing-python/
14
.. _Miniconda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
15
16
With Python in place, scVAE can be installed using pip::
17
18
   $ python3 -m pip install scvae
19
20
Using scVAE
21
-----------
22
23
In general, scVAE is used in the following way::
24
25
   $ scvae $COMMAND $DATA_SET
26
27
where ``$COMMAND`` can be ``analyse`` (data analysis), ``train`` (model training), or ``evaluate`` (model evaluation and analysis). ``$DATA_SET`` is a path to a data set file or a short name for a data set.
28
29
By default, data are placed and cached in the subfolder ``data/``, models are saved in the subfolder ``models/``, and analyses are saved in the subfolder ``analyses/``.
30
31
In the following, the most relevant options are described. Use the help option to list all options for each command::
32
33
   $ scvae $COMMAND --help
34
35
Data sets
36
^^^^^^^^^
37
38
Several data sets are already included in scVAE:
39
40
.. Non-breaking space
41
.. |_| unicode:: 0xA0
42
   :trim:
43
   
44
* ``Macosko-MRC``: `GSE63472`_.
45
* ``10x-MBC``: `1.3 Million Brain Cells from E18 Mice`_ from `10x Genomics`_.
46
   * ``10x-MBC-20k``: 20 |_| 000 sampled cells.
47
* ``10x-PBMC-PP``: Nine data sets of `purified PBMC populations`_ from 10x Genomics as specified in :ref:`Grønbech et al. (2020) <groenbech2020>`.
48
* ``10x-PBMC-68k``: `Fresh 68k PBMCs (Donor A)`_ from 10x Genomics.
49
* ``TCGA-RSEM``: `"transcript expression RNAseq - TOIL RSEM expected_count"`_ data set from the `TCGA Pan-Cancer (PANCAN)`_ cohort.
50
51
.. _GSE63472: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63472
52
.. _1.3 Million Brain Cells from E18 Mice: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons
53
.. _10x Genomics: https://www.10xgenomics.com
54
.. _purified PBMC populations: https://support.10xgenomics.com/single-cell-gene-expression/datasets/
55
.. _Fresh 68k PBMCs (Donor A): https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a
56
.. _"transcript expression RNAseq - TOIL RSEM expected_count": https://xenabrowser.net/datapages/?dataset=tcga_expected_count&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
57
.. _TCGA Pan-Cancer (PANCAN): https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443
58
59
Data sets will be cached in the data directory, which defaults to ``data/``. This can be changed using the option ``--data-directory`` (or ``-D``).
60
61
Be aware that it might take some time to load and preprocess the data the first time for large data sets. Also note that to load and analyse the ``10x-MBC`` data set, 47 GB of memory is required (32 GB for the original data set in sparse representation and 15 GB for the reconstructed test set in dense representation).
62
63
The default model can be trained on, for example, the ``10x-PBMC-PP`` data set like this::
64
65
   $ scvae train 10x-PBMC-PP
66
67
Custom data sets
68
""""""""""""""""
69
70
scVAE can read `Loom`_ files, and it can read a dense data matrix from a TSV file or a sparse one from a HDF5 file without further configuration. As an example, a data set in Loom format can be imported and modelled in the following way::
71
72
   $ scvae train data_set.loom
73
74
.. _Loom: https://loompy.org
75
76
The TSV files can be compressed using gzip, but each row should represent a cell or sample and each column a gene (for the reverse case, see below). If a header row and/or a header column are provided, they are used as gene IDs/names and/or cell/sample names, respectively.
77
78
For Loom files, scVAE follows `Loompy's conventions`_: each column represent a cell or sample and each row a gene. Cell or sample names are specified using the column attribute ``CellID`` (or just ``Cell``), and the row attribute ``Gene`` is used for gene names.
79
80
.. _Loompy's conventions: http://linnarssonlab.org/loompy/conventions/index.html
81
82
HDF5 files should include a single directory containing arrays for the sparse matrix (with names as for SciPy's CSR/CSC sparse matrix format: ``data``, ``indices``, ``indptr``, ``shape``). Arrays for example/cell and feature/gene names are also supported with a varieity of naming conventions: for example, ``barcodes``, ``cells``, ``samples``, ``examples`` for the former; ``genes`` and ``features`` for the latter. If either name array or both are present, scVAE will try to orient the matrix to match their dimensions.
83
84
scVAE also supports the following formats (supplied using the ``--format`` option):
85
86
* ``10x``: Output format for 10x Genomics's Cell Ranger.
87
* ``gtex``: Format for data sets from `GTEx`_.
88
* ``matrix_ebf``: (gzip compressed) TSV file with cells/samples/examples as rows and gene/features as columns (examples-by-features).
89
* ``matrix_fbe``: (gzip compressed) TSV file with gene/features as rows and cells/samples/examples as columns (features-by-examples).
90
91
.. _GTEx: https://gtexportal.org/home/index.html
92
93
The last of these formats can be used to read a TSV file, which is in reverse order of the default case::
94
95
   $ scvae train data_set.tsv.gz --format matrix_fbe
96
97
Using the Loom format, included cell types and batch indices can also be imported without further configuration by using the column attributes "ClusterName" and "BatchID", respectively.
98
99
Cell types for other formats can be imported in TSV format by instead providing a `JSON`_ file with a ``values`` field with the filename for the read counts, a ``labels`` field with the filename the cell types, and ``format`` field with the format.
100
101
.. _JSON: https://en.wikipedia.org/wiki/JSON
102
103
A JSON file for a GTEx data set would look like this:
104
105
.. code-block:: json
106
107
   {
108
      "values": "GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz",
109
      "labels": "GTEx_v7_Annotations_SampleAttributesDS.txt",
110
      "format": "gtex"
111
   }
112
113
Naming this file ``gtex.json``, the GTEx data set can then be imported and modelled::
114
115
   $ scvae train gtex.json
116
117
Withheld data
118
"""""""""""""
119
120
Any data set can be split into a training, a validation, and a test set using the ``--split-data-set`` option::
121
122
   $ scvae train $DATA_SET --split-data-set
123
124
Then, the training set is used to train the model, the validation set is used for early stopping as well as finding the best model parameters, and the test set is used when evaluating the model.
125
126
The data set can be split either randomly (``random``) or in the sequence in which it already is [#sequence]_ (``sequential``). This is done by specifying either value using the option ``--splitting-method``::
127
128
   $ scvae train $DATA_SET --split-data-set --splitting-method random
129
130
The fraction of the data set used for the training and validation sets is set using the option ``--splitting-fraction``::
131
132
   $ scvae train $DATA_SET --split-data-set --splitting-fraction 0.9
133
134
This option also determines the fraction of the training and validation sets used when training a model. The above command will then split the data sets into training, validation, and test sets using a :math:`81 \, \%`- :math:`9 \, \%`-:math:`10 \, \%` split.
135
136
Training a model
137
^^^^^^^^^^^^^^^^
138
139
The command ``train`` is used to train a model on a data set::
140
141
   $ scvae train $DATA_SET
142
143
By default, a VAE model with a Poisson likelihood function, two-dimensional latent variable, and one hidden layer of 100 units will be trained on the specified data set for 200 epochs with a learning rate of :math:`10^{-4}`.
144
145
The default model can be changed by using the following options:
146
147
* ``-m``: The model type, either ``VAE`` or ``GMVAE``.
148
* ``-r``: Likelihood function (or reconstruction distribution):
149
   * ``poisson``,
150
   * ``negative_binomial``,
151
   * ``zero_inflated_poisson``,
152
   * ``zero_inflated_negative_binomial``,
153
   * ``constrained_poisson``,
154
   * ``bernoulli``,
155
   * ``gaussian``, and
156
   * ``log_normal``.
157
* ``-k``: The threshold for modelling low counts using discrete probabilities and high counts using a shifted likelihood function (denoted by :math:`k_\text{max}` in :ref:`Grønbech et al., 2020 <groenbech2020>`). This turns the likelihood function into a corresponding piecewise categorical likehood function.
158
* ``-q``: The latent prior distribution. For the VAE model, this can only be a normal isotropic Gaussian distribution (``gaussian``) or one with unit variance (``unit_variance_gaussian``). For the GMVAE model, this can either be a Gaussian-mixture model with a diagonal covariance matrix (``gaussian_mixture``) or a full covariance matrix (``full_covariance_gaussian_mixture``). Note that a full covariance matrix should only be used for simpler GMVAE models.
159
* ``--prior-probabilites-method``: Method for how to set the mixture coefficients for the latent prior distribution of the GMVAE model. They can be fixed to either uniform values (``uniform``) or inferred values from labelled data (``infer``), or they can be learnt by the model (``learn``).
160
* ``-l``: The dimension of the latent variable.
161
* ``-H``: The number of hidden units in each layer separated by spaces. For example, ``-H 200 100`` will make both the inference (encoder) and the generative (decoder) networks two-layered with the first inference layer and the last generative layer consisting of 200 hidden units and the last inference layer and the first generative layer consisting of 100 hidden units.
162
* ``-K``: The number of components for the GMVAE (if possible, this is inferred from labelled data, but it can be overridden using this option).
163
* ``-w``: The number of epochs during the start of training with a linear weight on the KL divergence (the warm-up optimisation scheme described in :ref:`Grønbech et al., 2020 <groenbech2020>`). This weight is gradually increased linearly from 0 to 1 for this number of epochs.
164
* ``--batch-correction``: Perform batch correction if batch indices are available in data set (currently only possible with Loom data sets).
165
166
The training procedure can be changed using the following options (only applicable to the ``train`` command):
167
168
* ``-e``: The number of epochs to train the model.
169
* ``--learning-rate``: The learning rate of the model. The model is optimised using the Adam optimisation algorithm (:ref:`Kingma and Ba, 2015 <kingma2015>`).
170
171
A GMVAE model with a negative binomial likelihood function, a 100-dimensional latent variable, two hidden layers of each 100 units, and 200 epochs using the warm-up scheme is trained for 500 epochs on the ``10x-PBMC-PP`` data set like this::
172
173
   $ scvae train 10x-PBMC-PP -m GMVAE -l 100 -H 100 100 -w 200 -e 500
174
175
Trained models are saved to the subdirectory ``models/`` by default. This can be changed using the option ``--models-directory`` (or ``-M``).
176
177
Evaluating a model
178
^^^^^^^^^^^^^^^^^^
179
180
The command ``evaluate`` is used to evaluate a model on a data set::
181
182
   $ scvae evalaute $DATA_SET
183
184
Note the model has to have been trained already on the same data set.
185
186
The model is specified in the same way as when training the model, and the model will be evaluated at the last epoch to which it was trained. If withheld data were used, the model will also be evaluated at the early-stopping epoch and epoch with the most optimal marginal log-likelihood lower bound (if available). A number of analyses are conducted of the models and results, and these saved in the subdirectory ``analyses/``. This can be changed using the option ``--analyses-directory`` (or ``-A``). If you want the tool to perform all available analyses, you can use this option and argument: ``--included-analyses all``.
187
188
Cells can be clustered and cell types can be predicted using the option ``--prediction-method``. Currently only *k*-means clustering (``kmeans``) is supported. The GMVAE clusters cells and predict cell types using its built-in density-based clustering by default.
189
190
To visualise the data sets or latent spaces thereof, these are decomposed using a decomposition method. By default, this method is PCA. This can be changed using the option ``--decomposition-methods``, and as the name implies, multiple methods can be specified: PCA (``pca``), ICA (``ica``), SVD (``svd``), and *t*-SNE (``tsne``).
191
192
Decompositions of the data sets and of the latent values as well as predictions and the latent values themselves are also saved to compressed TSV files in the same directory.
193
194
The GMVAE model trained in the previous section is evaluated with PCA and *t*-SNE decomposition methods like this::
195
196
   $ scvae evaluate 10x-PBMC-PP -m GMVAE -l 100 -H 100 100 -w 200 --decomposition-methods pca tsne
197
198
Examples
199
^^^^^^^^
200
201
To reproduce the main results from :ref:`Grønbech et al. (2020) <groenbech2020>`, you can run the following commands:
202
203
* Combined PBMC data set from 10x Genomics::
204
205
      $ scvae train 10x-PBMC-PP --split-data-set -m GMVAE -r negative_binomial -l 100 -H 100 100 -w 200 -e 500
206
      $ scvae evaluate 10x-PBMC-PP --split-data-set -m GMVAE -r negative_binomial -l 100 -H 100 100 -w 200 --decomposition-methods pca tsne
207
208
* TCGA data set::
209
210
      $ scvae train TCGA-RSEM --map-features --feature-selection keep_highest_variances 5000 --split-data-set -m GMVAE -r negative_binomial -l 50 -H 1000 1000 -e 500
211
      $ scvae evaluate TCGA-RSEM --map-features --feature-selection keep_highest_variances 5000 --split-data-set -m GMVAE -r negative_binomial -l 50 -H 1000 1000 --decomposition-methods pca tsne
212
213
Tutorial
214
--------
215
216
Say you have a data set consisting of:
217
218
* single-cell transcript counts a file called ``"transcript_counts.tsv.gz"`` with genes as rows and cells as columns, and
219
* associated cell types in file called ``"cell_types.tsv"``.
220
221
To load these, you make a JSON file with the following contents:
222
223
.. code-block:: json
224
225
   {
226
      "values": "transcript_counts.tsv.gz",
227
      "labels": "cell_types.tsv",
228
      "format": "matrix_fbe"
229
   }
230
231
(See :ref:`Custom data sets` for more loading options.)
232
233
You then save the JSON file in the same directory as the TSV files with a memorable name like ``"data_set.json"``.
234
235
To load and split this data set with scVAE and train a GMVAE model with a Poisson distribution on the training set, you run the following command in the same directory::
236
237
   $ scvae train data_set.json --split-data-set -m GMVAE -r poisson
238
239
(See :ref:`Training a model` for more model options.)
240
241
You evaluate this model on the test set using the following command::
242
243
   $ scvae evaluate data_set.json --split-data-set -m GMVAE -r poisson
244
245
The resulting plots are saved in a subfolder called ``"analyses"``. If you want *t*-SNE plots, you use this command instead::
246
247
   $ scvae evaluate data_set.json --split-data-set -m GMVAE -r poisson --decomposition-methods tsne
248
249
----
250
251
.. [#sequence] With the first part becoming the training set, the second part the validation set, and the remaining part the test set.