Switch to unified view

a b/docs/getting-started-normalisation.md
1
# Getting Started with Normalisation
2
3
All the functions below assume you've already arranged your data in the format
4
expected by `mixOmics`.  In other words, you should have samples in rows and
5
features in columns, as either a data frame, matrix or other table like data
6
structure.
7
8
Data loaded to demonstrate each function is demo data from the `mixOmics`
9
package so that should be installed and loaded in order to try out these
10
functions.
11
12
## Low Count Removal
13
14
`low.count.removal()` removes features from the data which are unlikely to
15
contribute to the fit of a model because they show low counts/expression
16
relative to the rest of the data. The higher the percentage provided, the more
17
features will be discarded.
18
19
```R
20
data(Koren.16S)
21
dim(Koren.16S$data.raw)
22
## [1]  43 980
23
24
normalised <- low.count.removal(Koren.16S$data.raw, 0.03)
25
dim(normalised)
26
## [1]  43 816
27
```
28
29
## Total Sum Scaling
30
31
`normalise.tss()` normalises count data sample-by-sample, to a scale of 0..1,
32
using Total Sum Scaling.  This accounts for sequencing differences between
33
samples.  After this transformation, all samples will sum to 1.0 and values for
34
each feature will be relative.  Values can be offset from zero by providing the
35
optional `offset` argument.  In the example below, we compare the TSS function
36
from OmicsFold with the pre-normalised TSS data in the Koren 16S data set.
37
38
```R
39
data(Koren.16S)
40
Koren.16S$data.TSS[1:3, 3:5]
41
##                410908       177792      4294607
42
## Feces659 0.0002961208 0.0293159609 0.0002961208
43
## Feces309 0.0003447087 0.0003447087 0.0003447087
44
## Mouth599 0.0004083299 0.0002041650 0.0002041650
45
46
# Now apply our own TSS normalisation to the raw data
47
normalised <- normalise.tss(Koren.16S$data.raw)
48
normalised[1:3, 3:5]
49
##                410908       177792      4294607
50
## Feces659 0.0002961208 0.0293159609 0.0002961208
51
## Feces309 0.0003447087 0.0003447087 0.0003447087
52
## Mouth599 0.0004083299 0.0002041650 0.0002041650
53
```
54
55
## Cumulative Sum Scaling
56
57
`normalise.css()` applies cumulative sum scaling normalisation to count data for
58
inter-sample depth.  This is an alternative to using total sum scaling and
59
relies on the implementation by `metagenomeSeq`.  It reformats `mixOmics` input
60
data so that it can be processed by `metagenomeSeq` and then converts the CSS
61
normalised output back to `mixOmics` input.  The definition for this
62
normalisation approach according to `metagenomeSeq` is as follows:
63
64
> Calculates each column's quantile and calculates the sum up to and including
65
> that quantile.
66
67
Below is an example of applying this normalisation to the same Koren 16S data
68
set as was used in the TSS example above:
69
70
```R
71
data(Koren.16S)
72
Koren.16S$data.raw[1:3, 3:5]
73
##            410908   177792  4294607
74
## Feces659        1       99        1
75
## Feces309        1        1        1
76
## Mouth599        2        1        1
77
78
# Now apply our CSS normalisation to the raw data
79
normalised <- normalise.css(Koren.16S$data.raw)
80
normalised[1:3, 3:5]
81
##            410908   177792  4294607
82
## Feces659 1.187222 6.993638 1.187222
83
## Feces309 1.179016 1.179016 1.179016
84
## Mouth599 1.711633 1.096030 1.096030
85
```
86
87
Here we see that the lowest counts of 1 for each feature / sample have much less
88
variance under CSS scaling, when compared to TSS scaling.
89
90
## Logit
91
92
`normalise.logit()` provides normalisation based on the [logit
93
function](https://en.wikipedia.org/wiki/Logit) which transforms 0.5 to zero,
94
values below 0.5 become negative and above 0.5 become positive.  The scale of
95
that negative or positive value is exponential and reaches negative/positive
96
infinity at 0.0 and 1.0 respectively.  This can be a useful transformation for
97
values in the 0..1 scale, bringing them back into Euclidean space after TSS
98
normalisation.  Below is an example of transforming values in this way.
99
100
```R
101
data(Koren.16S)
102
Koren.16S$data.TSS[1:3, 3:5]
103
##                410908       177792      4294607
104
## Feces659 0.0002961208 0.0293159609 0.0002961208
105
## Feces309 0.0003447087 0.0003447087 0.0003447087
106
## Mouth599 0.0004083299 0.0002041650 0.0002041650
107
108
# Now apply our logit normalisation to the TSS data
109
normalised <- normalise.logit(Koren.16S$data.TSS)
110
normalised[1:3, 3:5]
111
##                410908       177792      4294607
112
## Feces659    -8.124447    -3.499869    -8.124447
113
## Feces309    -7.972466    -7.972466    -7.972466
114
## Mouth599    -7.803027    -8.496378    -8.496378
115
```
116
117
As can be seen, this adds more distance between values, which can be more
118
beneficial for the model fitting.  In addition, values which are virtually zero
119
will be heavily modified towards a very negative value.  If any values to be
120
transformed are actually at 0.0 or 1.0 the logit function will generate infinity
121
values, which are inappropriate for modelling.  For this reason, a second
122
empirical function is provided by OmicsFold, `normalise.logit.empirical()`,
123
which moves measurements away from 0.0 and 1.0 on a per-feature basis, avoiding
124
the generation of infinity values.
125
126
## Centered Log-Ratio
127
128
`normalise.clr()` applies the centered log-ratio (CLR) transformation to the
129
data, where each measurement is divided by the mean of all the measurements and
130
the log of that ratio returned.  The ratio centers the mean of the data about
131
zero after the log is applied.  This transformation can be used with sum scaled
132
OTU data as an alternative to the logit transformation above.
133
134
This CLR function is a convenience wrapper around the
135
[`logratio.transfo()`](https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/logratio.transfo)
136
function from the `mixOmics` package, maintaining the structure of the input
137
data after the transformation is applied.  Logs of zero produce infinity values,
138
so if the data being normalised contains zero values, a small offset can be
139
provided to prevent this problem.  The following is an example of applying the
140
CLR transformation with and without a small offset to the Koren 16S TSS data.
141
The offset is unnecessary in this data as there are no zero values, but the
142
results are similar for non-zero values.
143
144
```R
145
data(Koren.16S)
146
Koren.16S$data.TSS[1:3, 3:5]
147
##                410908       177792      4294607
148
## Feces659 0.0002961208 0.0293159609 0.0002961208
149
## Feces309 0.0003447087 0.0003447087 0.0003447087
150
## Mouth599 0.0004083299 0.0002041650 0.0002041650
151
152
# Now apply our CLR normalisation to the TSS data
153
normalised.1 <- normalise.clr(Koren.16S$data.TSS)
154
normalised.1[1:3, 3:5]
155
##                410908       177792      4294607
156
## Feces659   -0.3861923    4.2089276   -0.3861923
157
## Feces309   -0.3387886   -0.3387886   -0.3387886
158
## Mouth599    0.5184764   -0.1746708   -0.1746708
159
160
# Re-apply the CLR normalisation with a small offset
161
normalised.2 <- normalise.clr(Koren.16S$data.TSS, offset = 0.000001)
162
normalised.2[1:3, 3:5]
163
##                410908       177792      4294607
164
## Feces659   -0.3856632    4.2061195   -0.3856632
165
## Feces309   -0.3383694   -0.3383694   -0.3383694
166
## Mouth599    0.5164012   -0.1743060   -0.1743060
167
```
168
169
Where the intention is to apply the centered log-ratio to non-OTU data, the
170
function above should be avoided as it applies an inter-sample normalisation.
171
For this purpose OmicsFold also provides a related function
172
`normalise.clr.within.features()` which ensures that the means used to center
173
the log-ratio are calculated within a feature instead.  This can be more
174
appropriate for non-OTU data.