|
a |
|
b/docs/getting-started-normalisation.md |
|
|
1 |
# Getting Started with Normalisation |
|
|
2 |
|
|
|
3 |
All the functions below assume you've already arranged your data in the format |
|
|
4 |
expected by `mixOmics`. In other words, you should have samples in rows and |
|
|
5 |
features in columns, as either a data frame, matrix or other table like data |
|
|
6 |
structure. |
|
|
7 |
|
|
|
8 |
Data loaded to demonstrate each function is demo data from the `mixOmics` |
|
|
9 |
package so that should be installed and loaded in order to try out these |
|
|
10 |
functions. |
|
|
11 |
|
|
|
12 |
## Low Count Removal |
|
|
13 |
|
|
|
14 |
`low.count.removal()` removes features from the data which are unlikely to |
|
|
15 |
contribute to the fit of a model because they show low counts/expression |
|
|
16 |
relative to the rest of the data. The higher the percentage provided, the more |
|
|
17 |
features will be discarded. |
|
|
18 |
|
|
|
19 |
```R |
|
|
20 |
data(Koren.16S) |
|
|
21 |
dim(Koren.16S$data.raw) |
|
|
22 |
## [1] 43 980 |
|
|
23 |
|
|
|
24 |
normalised <- low.count.removal(Koren.16S$data.raw, 0.03) |
|
|
25 |
dim(normalised) |
|
|
26 |
## [1] 43 816 |
|
|
27 |
``` |
|
|
28 |
|
|
|
29 |
## Total Sum Scaling |
|
|
30 |
|
|
|
31 |
`normalise.tss()` normalises count data sample-by-sample, to a scale of 0..1, |
|
|
32 |
using Total Sum Scaling. This accounts for sequencing differences between |
|
|
33 |
samples. After this transformation, all samples will sum to 1.0 and values for |
|
|
34 |
each feature will be relative. Values can be offset from zero by providing the |
|
|
35 |
optional `offset` argument. In the example below, we compare the TSS function |
|
|
36 |
from OmicsFold with the pre-normalised TSS data in the Koren 16S data set. |
|
|
37 |
|
|
|
38 |
```R |
|
|
39 |
data(Koren.16S) |
|
|
40 |
Koren.16S$data.TSS[1:3, 3:5] |
|
|
41 |
## 410908 177792 4294607 |
|
|
42 |
## Feces659 0.0002961208 0.0293159609 0.0002961208 |
|
|
43 |
## Feces309 0.0003447087 0.0003447087 0.0003447087 |
|
|
44 |
## Mouth599 0.0004083299 0.0002041650 0.0002041650 |
|
|
45 |
|
|
|
46 |
# Now apply our own TSS normalisation to the raw data |
|
|
47 |
normalised <- normalise.tss(Koren.16S$data.raw) |
|
|
48 |
normalised[1:3, 3:5] |
|
|
49 |
## 410908 177792 4294607 |
|
|
50 |
## Feces659 0.0002961208 0.0293159609 0.0002961208 |
|
|
51 |
## Feces309 0.0003447087 0.0003447087 0.0003447087 |
|
|
52 |
## Mouth599 0.0004083299 0.0002041650 0.0002041650 |
|
|
53 |
``` |
|
|
54 |
|
|
|
55 |
## Cumulative Sum Scaling |
|
|
56 |
|
|
|
57 |
`normalise.css()` applies cumulative sum scaling normalisation to count data for |
|
|
58 |
inter-sample depth. This is an alternative to using total sum scaling and |
|
|
59 |
relies on the implementation by `metagenomeSeq`. It reformats `mixOmics` input |
|
|
60 |
data so that it can be processed by `metagenomeSeq` and then converts the CSS |
|
|
61 |
normalised output back to `mixOmics` input. The definition for this |
|
|
62 |
normalisation approach according to `metagenomeSeq` is as follows: |
|
|
63 |
|
|
|
64 |
> Calculates each column's quantile and calculates the sum up to and including |
|
|
65 |
> that quantile. |
|
|
66 |
|
|
|
67 |
Below is an example of applying this normalisation to the same Koren 16S data |
|
|
68 |
set as was used in the TSS example above: |
|
|
69 |
|
|
|
70 |
```R |
|
|
71 |
data(Koren.16S) |
|
|
72 |
Koren.16S$data.raw[1:3, 3:5] |
|
|
73 |
## 410908 177792 4294607 |
|
|
74 |
## Feces659 1 99 1 |
|
|
75 |
## Feces309 1 1 1 |
|
|
76 |
## Mouth599 2 1 1 |
|
|
77 |
|
|
|
78 |
# Now apply our CSS normalisation to the raw data |
|
|
79 |
normalised <- normalise.css(Koren.16S$data.raw) |
|
|
80 |
normalised[1:3, 3:5] |
|
|
81 |
## 410908 177792 4294607 |
|
|
82 |
## Feces659 1.187222 6.993638 1.187222 |
|
|
83 |
## Feces309 1.179016 1.179016 1.179016 |
|
|
84 |
## Mouth599 1.711633 1.096030 1.096030 |
|
|
85 |
``` |
|
|
86 |
|
|
|
87 |
Here we see that the lowest counts of 1 for each feature / sample have much less |
|
|
88 |
variance under CSS scaling, when compared to TSS scaling. |
|
|
89 |
|
|
|
90 |
## Logit |
|
|
91 |
|
|
|
92 |
`normalise.logit()` provides normalisation based on the [logit |
|
|
93 |
function](https://en.wikipedia.org/wiki/Logit) which transforms 0.5 to zero, |
|
|
94 |
values below 0.5 become negative and above 0.5 become positive. The scale of |
|
|
95 |
that negative or positive value is exponential and reaches negative/positive |
|
|
96 |
infinity at 0.0 and 1.0 respectively. This can be a useful transformation for |
|
|
97 |
values in the 0..1 scale, bringing them back into Euclidean space after TSS |
|
|
98 |
normalisation. Below is an example of transforming values in this way. |
|
|
99 |
|
|
|
100 |
```R |
|
|
101 |
data(Koren.16S) |
|
|
102 |
Koren.16S$data.TSS[1:3, 3:5] |
|
|
103 |
## 410908 177792 4294607 |
|
|
104 |
## Feces659 0.0002961208 0.0293159609 0.0002961208 |
|
|
105 |
## Feces309 0.0003447087 0.0003447087 0.0003447087 |
|
|
106 |
## Mouth599 0.0004083299 0.0002041650 0.0002041650 |
|
|
107 |
|
|
|
108 |
# Now apply our logit normalisation to the TSS data |
|
|
109 |
normalised <- normalise.logit(Koren.16S$data.TSS) |
|
|
110 |
normalised[1:3, 3:5] |
|
|
111 |
## 410908 177792 4294607 |
|
|
112 |
## Feces659 -8.124447 -3.499869 -8.124447 |
|
|
113 |
## Feces309 -7.972466 -7.972466 -7.972466 |
|
|
114 |
## Mouth599 -7.803027 -8.496378 -8.496378 |
|
|
115 |
``` |
|
|
116 |
|
|
|
117 |
As can be seen, this adds more distance between values, which can be more |
|
|
118 |
beneficial for the model fitting. In addition, values which are virtually zero |
|
|
119 |
will be heavily modified towards a very negative value. If any values to be |
|
|
120 |
transformed are actually at 0.0 or 1.0 the logit function will generate infinity |
|
|
121 |
values, which are inappropriate for modelling. For this reason, a second |
|
|
122 |
empirical function is provided by OmicsFold, `normalise.logit.empirical()`, |
|
|
123 |
which moves measurements away from 0.0 and 1.0 on a per-feature basis, avoiding |
|
|
124 |
the generation of infinity values. |
|
|
125 |
|
|
|
126 |
## Centered Log-Ratio |
|
|
127 |
|
|
|
128 |
`normalise.clr()` applies the centered log-ratio (CLR) transformation to the |
|
|
129 |
data, where each measurement is divided by the mean of all the measurements and |
|
|
130 |
the log of that ratio returned. The ratio centers the mean of the data about |
|
|
131 |
zero after the log is applied. This transformation can be used with sum scaled |
|
|
132 |
OTU data as an alternative to the logit transformation above. |
|
|
133 |
|
|
|
134 |
This CLR function is a convenience wrapper around the |
|
|
135 |
[`logratio.transfo()`](https://www.rdocumentation.org/packages/mixOmics/versions/6.3.2/topics/logratio.transfo) |
|
|
136 |
function from the `mixOmics` package, maintaining the structure of the input |
|
|
137 |
data after the transformation is applied. Logs of zero produce infinity values, |
|
|
138 |
so if the data being normalised contains zero values, a small offset can be |
|
|
139 |
provided to prevent this problem. The following is an example of applying the |
|
|
140 |
CLR transformation with and without a small offset to the Koren 16S TSS data. |
|
|
141 |
The offset is unnecessary in this data as there are no zero values, but the |
|
|
142 |
results are similar for non-zero values. |
|
|
143 |
|
|
|
144 |
```R |
|
|
145 |
data(Koren.16S) |
|
|
146 |
Koren.16S$data.TSS[1:3, 3:5] |
|
|
147 |
## 410908 177792 4294607 |
|
|
148 |
## Feces659 0.0002961208 0.0293159609 0.0002961208 |
|
|
149 |
## Feces309 0.0003447087 0.0003447087 0.0003447087 |
|
|
150 |
## Mouth599 0.0004083299 0.0002041650 0.0002041650 |
|
|
151 |
|
|
|
152 |
# Now apply our CLR normalisation to the TSS data |
|
|
153 |
normalised.1 <- normalise.clr(Koren.16S$data.TSS) |
|
|
154 |
normalised.1[1:3, 3:5] |
|
|
155 |
## 410908 177792 4294607 |
|
|
156 |
## Feces659 -0.3861923 4.2089276 -0.3861923 |
|
|
157 |
## Feces309 -0.3387886 -0.3387886 -0.3387886 |
|
|
158 |
## Mouth599 0.5184764 -0.1746708 -0.1746708 |
|
|
159 |
|
|
|
160 |
# Re-apply the CLR normalisation with a small offset |
|
|
161 |
normalised.2 <- normalise.clr(Koren.16S$data.TSS, offset = 0.000001) |
|
|
162 |
normalised.2[1:3, 3:5] |
|
|
163 |
## 410908 177792 4294607 |
|
|
164 |
## Feces659 -0.3856632 4.2061195 -0.3856632 |
|
|
165 |
## Feces309 -0.3383694 -0.3383694 -0.3383694 |
|
|
166 |
## Mouth599 0.5164012 -0.1743060 -0.1743060 |
|
|
167 |
``` |
|
|
168 |
|
|
|
169 |
Where the intention is to apply the centered log-ratio to non-OTU data, the |
|
|
170 |
function above should be avoided as it applies an inter-sample normalisation. |
|
|
171 |
For this purpose OmicsFold also provides a related function |
|
|
172 |
`normalise.clr.within.features()` which ensures that the means used to center |
|
|
173 |
the log-ratio are calculated within a feature instead. This can be more |
|
|
174 |
appropriate for non-OTU data. |