|
a |
|
b/vignettes/data_generator.Rmd |
|
|
1 |
--- |
|
|
2 |
title: "Data Generator" |
|
|
3 |
output: rmarkdown::html_vignette |
|
|
4 |
vignette: > |
|
|
5 |
%\VignetteIndexEntry{Data Generator} |
|
|
6 |
%\VignetteEngine{knitr::rmarkdown} |
|
|
7 |
%\VignetteEncoding{UTF-8} |
|
|
8 |
--- |
|
|
9 |
|
|
|
10 |
```{r, echo=FALSE, warning=FALSE, message=FALSE} |
|
|
11 |
|
|
|
12 |
if (!reticulate::py_module_available("tensorflow")) { |
|
|
13 |
knitr::opts_chunk$set(eval = FALSE) |
|
|
14 |
} else { |
|
|
15 |
knitr::opts_chunk$set(eval = TRUE) |
|
|
16 |
} |
|
|
17 |
``` |
|
|
18 |
|
|
|
19 |
```{r, message=FALSE} |
|
|
20 |
library(deepG) |
|
|
21 |
library(magrittr) |
|
|
22 |
library(keras) |
|
|
23 |
``` |
|
|
24 |
|
|
|
25 |
|
|
|
26 |
```{r, echo=FALSE, warning=FALSE, message=FALSE} |
|
|
27 |
options(rmarkdown.html_vignette.check_title = FALSE) |
|
|
28 |
``` |
|
|
29 |
|
|
|
30 |
```{css, echo=FALSE} |
|
|
31 |
mark.in { |
|
|
32 |
background-color: CornflowerBlue; |
|
|
33 |
} |
|
|
34 |
|
|
|
35 |
mark.out { |
|
|
36 |
background-color: IndianRed; |
|
|
37 |
} |
|
|
38 |
|
|
|
39 |
``` |
|
|
40 |
|
|
|
41 |
## Introduction |
|
|
42 |
|
|
|
43 |
The most common use case for the deepG data generator is to extract samples from a collection of |
|
|
44 |
fasta (or fastq) files. |
|
|
45 |
The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$. |
|
|
46 |
We can differentiate between 2 approaches |
|
|
47 |
|
|
|
48 |
+ **Language model**: Part of a sequence is the input and other part the target. |
|
|
49 |
+ Example: Predict the next nucleotide given the previous 100 nucleotides. |
|
|
50 |
+ **Label classification**: Assign a label to a sequence. |
|
|
51 |
+ Example: Assign a label "virus" or "bacteria" to a sequence of length 100. |
|
|
52 |
|
|
|
53 |
Suppose we are given 2 fasta files called "a.fasta" and "b.fasta" that look as follows: |
|
|
54 |
|
|
|
55 |
<div style="float: left;margin-right:10px"> |
|
|
56 |
<table> |
|
|
57 |
<tr> |
|
|
58 |
<td> |
|
|
59 |
**a.fasta** <br> |
|
|
60 |
<tt> |
|
|
61 |
>header_a1 <br> |
|
|
62 |
AACCAAGG <br> |
|
|
63 |
>header_a2 <br> |
|
|
64 |
TTTGGG <br> |
|
|
65 |
>header_a3 <br> |
|
|
66 |
ACGTACGT <br> |
|
|
67 |
</tt> |
|
|
68 |
</td> |
|
|
69 |
</tr> |
|
|
70 |
</table> |
|
|
71 |
</div> |
|
|
72 |
<div style="float: left"> |
|
|
73 |
<table> |
|
|
74 |
<tr> |
|
|
75 |
<td> |
|
|
76 |
**b.fasta** <br> |
|
|
77 |
<tt> |
|
|
78 |
>header_b1 <br> |
|
|
79 |
GTGTGT <br> |
|
|
80 |
>header_b2 <br> |
|
|
81 |
AAGG <br> |
|
|
82 |
</tt> |
|
|
83 |
</td> |
|
|
84 |
</tr> |
|
|
85 |
</table> |
|
|
86 |
</div> |
|
|
87 |
<br><br><br><br><br><br><br><br><br> |
|
|
88 |
|
|
|
89 |
If we want to extract sequences of length 4 from these files, there would be 17 possible samples |
|
|
90 |
(5 from <tt>AACCAAGG</tt>, 3 from <tt>TTTGGG</tt>, ...). |
|
|
91 |
A naive approach would be to extract the samples in a sequential manner: |
|
|
92 |
|
|
|
93 |
*1. sample*: |
|
|
94 |
|
|
|
95 |
<div style="float: left;margin-right:10px"> |
|
|
96 |
<table> |
|
|
97 |
<tr> |
|
|
98 |
<td> |
|
|
99 |
**a.fasta** <br> |
|
|
100 |
<tt> |
|
|
101 |
>header_a1 <br> |
|
|
102 |
<mark class="in">AACC</mark>AAGG <br> |
|
|
103 |
>header_a2 <br> |
|
|
104 |
TTTGGG <br> |
|
|
105 |
>header_a3 <br> |
|
|
106 |
ACGTACGT <br> |
|
|
107 |
</tt> |
|
|
108 |
</td> |
|
|
109 |
</tr> |
|
|
110 |
</table> |
|
|
111 |
</div> |
|
|
112 |
<div style="float: left"> |
|
|
113 |
<table> |
|
|
114 |
<tr> |
|
|
115 |
<td> |
|
|
116 |
**b.fasta** <br> |
|
|
117 |
<tt> |
|
|
118 |
>header_b1 <br> |
|
|
119 |
GTGTGT <br> |
|
|
120 |
>header_b2 <br> |
|
|
121 |
AAGG <br> |
|
|
122 |
</tt> |
|
|
123 |
</td> |
|
|
124 |
</tr> |
|
|
125 |
</table> |
|
|
126 |
</div> |
|
|
127 |
<br><br><br><br><br><br><br><br><br> |
|
|
128 |
|
|
|
129 |
*2. sample*: |
|
|
130 |
|
|
|
131 |
<div style="float: left;margin-right:10px"> |
|
|
132 |
<table> |
|
|
133 |
<tr> |
|
|
134 |
<td> |
|
|
135 |
**a.fasta** <br> |
|
|
136 |
<tt> |
|
|
137 |
>header_a1 <br> |
|
|
138 |
A<mark class="in">ACCA</mark>AGG <br> |
|
|
139 |
>header_a2 <br> |
|
|
140 |
TTTGGG <br> |
|
|
141 |
>header_a3 <br> |
|
|
142 |
ACGTACGT <br> |
|
|
143 |
</tt> |
|
|
144 |
</td> |
|
|
145 |
</tr> |
|
|
146 |
</table> |
|
|
147 |
</div> |
|
|
148 |
<div style="float: left"> |
|
|
149 |
<table> |
|
|
150 |
<tr> |
|
|
151 |
<td> |
|
|
152 |
**b.fasta** <br> |
|
|
153 |
<tt> |
|
|
154 |
>header_b1 <br> |
|
|
155 |
GTGTGT <br> |
|
|
156 |
>header_b2 <br> |
|
|
157 |
AAGG <br> |
|
|
158 |
</tt> |
|
|
159 |
</td> |
|
|
160 |
</tr> |
|
|
161 |
</table> |
|
|
162 |
</div> |
|
|
163 |
<br><br><br><br><br><br><br><br><br> |
|
|
164 |
|
|
|
165 |
... |
|
|
166 |
|
|
|
167 |
<br> |
|
|
168 |
|
|
|
169 |
*17. sample*: |
|
|
170 |
|
|
|
171 |
<div style="float: left;margin-right:10px"> |
|
|
172 |
<table> |
|
|
173 |
<tr> |
|
|
174 |
<td> |
|
|
175 |
**a.fasta** <br> |
|
|
176 |
<tt> |
|
|
177 |
>header_a1 <br> |
|
|
178 |
AACCAAGG <br> |
|
|
179 |
>header_a2 <br> |
|
|
180 |
TTTGGG <br> |
|
|
181 |
>header_a3 <br> |
|
|
182 |
ACGTACGT <br> |
|
|
183 |
</tt> |
|
|
184 |
</td> |
|
|
185 |
</tr> |
|
|
186 |
</table> |
|
|
187 |
</div> |
|
|
188 |
<div style="float: left"> |
|
|
189 |
<table> |
|
|
190 |
<tr> |
|
|
191 |
<td> |
|
|
192 |
**b.fasta** <br> |
|
|
193 |
<tt> |
|
|
194 |
>header_b1 <br> |
|
|
195 |
GTGTGT <br> |
|
|
196 |
>header_b2 <br> |
|
|
197 |
<mark class="in">AAGG</mark> <br> |
|
|
198 |
</tt> |
|
|
199 |
</td> |
|
|
200 |
</tr> |
|
|
201 |
</table> |
|
|
202 |
</div> |
|
|
203 |
<br><br><br><br><br><br><br><br><br> |
|
|
204 |
|
|
|
205 |
*18. sample*: |
|
|
206 |
|
|
|
207 |
<div style="float: left;margin-right:10px"> |
|
|
208 |
<table> |
|
|
209 |
<tr> |
|
|
210 |
<td> |
|
|
211 |
**a.fasta** <br> |
|
|
212 |
<tt> |
|
|
213 |
>header_a1 <br> |
|
|
214 |
<mark class="in">AACC</mark>AAGG <br> |
|
|
215 |
>header_a2 <br> |
|
|
216 |
TTTGGG <br> |
|
|
217 |
>header_a3 <br> |
|
|
218 |
ACGTACGT <br> |
|
|
219 |
</tt> |
|
|
220 |
</td> |
|
|
221 |
</tr> |
|
|
222 |
</table> |
|
|
223 |
</div> |
|
|
224 |
<div style="float: left"> |
|
|
225 |
<table> |
|
|
226 |
<tr> |
|
|
227 |
<td> |
|
|
228 |
**b.fasta** <br> |
|
|
229 |
<tt> |
|
|
230 |
>header_b1 <br> |
|
|
231 |
GTGTGT <br> |
|
|
232 |
>header_b2 <br> |
|
|
233 |
AAGG <br> |
|
|
234 |
</tt> |
|
|
235 |
</td> |
|
|
236 |
</tr> |
|
|
237 |
</table> |
|
|
238 |
</div> |
|
|
239 |
<br><br><br><br><br><br><br><br><br> |
|
|
240 |
|
|
|
241 |
... |
|
|
242 |
<br><br> |
|
|
243 |
|
|
|
244 |
For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and |
|
|
245 |
the model would often see long stretches of data from the same source. |
|
|
246 |
Choosing the samples completely at random can also be problematic since we would constantly have to open new files. |
|
|
247 |
The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches. |
|
|
248 |
|
|
|
249 |
## Data generator options |
|
|
250 |
|
|
|
251 |
In the following code examples, we will mostly use the sequence <tt> **abcdefghiiii** </tt> to demonstrate some of the |
|
|
252 |
deepG data generator options. (In real world application you would usually have sequences from the <tt>ACGT</tt> vocabulary.) |
|
|
253 |
|
|
|
254 |
```{r warning = FALSE, message = FALSE} |
|
|
255 |
sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i") |
|
|
256 |
vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i") |
|
|
257 |
``` |
|
|
258 |
|
|
|
259 |
We may store this sequence in a fasta file |
|
|
260 |
|
|
|
261 |
```{r warning = FALSE, message = FALSE} |
|
|
262 |
temp_dir <- tempfile() |
|
|
263 |
dir.create(temp_dir) |
|
|
264 |
dir_path <- paste0(temp_dir, "/dummy_data") |
|
|
265 |
dir.create(dir_path) |
|
|
266 |
df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE) |
|
|
267 |
file_path <- file.path(dir_path, "a.fasta") |
|
|
268 |
# sequence as fasta file |
|
|
269 |
microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path) |
|
|
270 |
``` |
|
|
271 |
|
|
|
272 |
Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data. |
|
|
273 |
Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see `use_coverage`, `use_quality_score` |
|
|
274 |
and `ambiguous_nuc` sections. |
|
|
275 |
|
|
|
276 |
```{r warning = FALSE, message = FALSE} |
|
|
277 |
# one-hot encoding example |
|
|
278 |
s <- c("a", "c", "a", "f", "i", "b") |
|
|
279 |
s_as_int_seq <- vector("integer", length(s)) |
|
|
280 |
for (i in 1:length(s)) { |
|
|
281 |
s_as_int_seq[i] <- which(s[i] == vocabulary) - 1 |
|
|
282 |
} |
|
|
283 |
one_hot_sample <- keras::to_categorical(s_as_int_seq) |
|
|
284 |
colnames(one_hot_sample) <- vocabulary |
|
|
285 |
one_hot_sample |
|
|
286 |
``` |
|
|
287 |
|
|
|
288 |
### maxlen |
|
|
289 |
|
|
|
290 |
The length of the input sequence. |
|
|
291 |
|
|
|
292 |
### vocabulary |
|
|
293 |
|
|
|
294 |
The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with |
|
|
295 |
the `ambiguous_nuc` argument. |
|
|
296 |
|
|
|
297 |
### train_type |
|
|
298 |
|
|
|
299 |
The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$. |
|
|
300 |
The `train_type` argument determines how $X$ and $Y$ get extracted. |
|
|
301 |
Possible arguments for <u> *language models* </u> are: |
|
|
302 |
|
|
|
303 |
+ **"lm"** or **"lm_rds"**: Given some sequence $s$, we take some subset of that sequence as input and the rest as target. How to split $s$ |
|
|
304 |
can be specified in `output_format` argument. |
|
|
305 |
|
|
|
306 |
Besides the language model approach, we can use <u> *label classification* </u>. This means we map some label to a sequence. For example, the target for some |
|
|
307 |
nucleotide sequence could be one of the labels "bacteria" or "virus". We have to specify how to extract a label corresponding to a sequence. |
|
|
308 |
Possible arguments are: |
|
|
309 |
|
|
|
310 |
+ **"label_header"**: get label from fasta headers. |
|
|
311 |
+ **"label_folder"**: get label from folder, i.e. all files in one folder must belong to the same class. |
|
|
312 |
+ **"label_csv"**: get label from csv file. Csv file should have one column named "file". The targets then correspond to entries in that row (except "file" |
|
|
313 |
column). Example: if we are currently working with a file called "a.fasta", there should be a row in our csv file with some target information for that file <br> |
|
|
314 |
|
|
|
315 |
|
|
|
316 |
| file | label_1 | label_2 | |
|
|
317 |
| --- | --- | --- | |
|
|
318 |
| "a.fasta" | 1 | 0 | |
|
|
319 |
|
|
|
320 |
|
|
|
321 |
+ **"label_rds"**: rds file contains preprocessed list of input and target tensors. |
|
|
322 |
|
|
|
323 |
Another option is **"dummy_gen"**: generator creates random data once and repeatedly returns them. |
|
|
324 |
|
|
|
325 |
Extract target from fasta header (fasta header is "label_1" in example file): |
|
|
326 |
|
|
|
327 |
```{r warning = FALSE, message = FALSE} |
|
|
328 |
# get target from header |
|
|
329 |
vocabulary_label <- paste0("label_", 1:5) |
|
|
330 |
gen <- get_generator(path = file_path, |
|
|
331 |
train_type = "label_header", |
|
|
332 |
batch_size = 1, |
|
|
333 |
maxlen = 6, |
|
|
334 |
vocabulary = vocabulary, |
|
|
335 |
vocabulary_label = vocabulary_label) |
|
|
336 |
|
|
|
337 |
z <- gen() |
|
|
338 |
x <- z[[1]][1,,] |
|
|
339 |
y <- z[[2]] |
|
|
340 |
colnames(x) <- vocabulary |
|
|
341 |
colnames(y) <- vocabulary_label |
|
|
342 |
x # abcdef |
|
|
343 |
y # label_1 |
|
|
344 |
``` |
|
|
345 |
|
|
|
346 |
Extract target from fasta folder: |
|
|
347 |
|
|
|
348 |
```{r warning = FALSE, message = FALSE} |
|
|
349 |
# create data for second class |
|
|
350 |
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1") |
|
|
351 |
file_path_2 <- tempfile(fileext = ".fasta") |
|
|
352 |
fasta_file <- microseq::writeFasta(df, file_path_2) |
|
|
353 |
|
|
|
354 |
# get target from folder |
|
|
355 |
vocabulary_label <- paste0("label_", 1:2) |
|
|
356 |
gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class |
|
|
357 |
train_type = "label_folder", |
|
|
358 |
batch_size = 8, |
|
|
359 |
maxlen = 6, |
|
|
360 |
vocabulary = vocabulary, |
|
|
361 |
vocabulary_label = vocabulary_label) |
|
|
362 |
|
|
|
363 |
z <- gen() |
|
|
364 |
x <- z[[1]] |
|
|
365 |
y <- z[[2]] |
|
|
366 |
x_1_1 <- x[1, , ] |
|
|
367 |
colnames(x_1_1) <- vocabulary |
|
|
368 |
x_1_1 # first sample from first class |
|
|
369 |
x_2_1 <- x[5, , ] |
|
|
370 |
colnames(x_2_1) <- vocabulary |
|
|
371 |
x_2_1 # first sample from second class |
|
|
372 |
colnames(y) <- vocabulary_label |
|
|
373 |
y # 4 samples from each class |
|
|
374 |
``` |
|
|
375 |
|
|
|
376 |
Extract target from csv file: |
|
|
377 |
|
|
|
378 |
```{r warning = FALSE, message = FALSE} |
|
|
379 |
# get target from csv |
|
|
380 |
file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta") |
|
|
381 |
vocabulary_label <- paste0("label", 1:4) |
|
|
382 |
label_1 <- c(1, 0, 0, 0) |
|
|
383 |
label_2 <- c(0, 1, 0, 0) |
|
|
384 |
label_3 <- c(0, 0, 1, 0) |
|
|
385 |
label_4 <- c(0, 0, 0, 1) |
|
|
386 |
df <- data.frame(file, label_1, label_2, label_3, label_4) |
|
|
387 |
df |
|
|
388 |
csv_file <- tempfile(fileext = ".csv") |
|
|
389 |
write.csv(df, csv_file, row.names = FALSE) |
|
|
390 |
|
|
|
391 |
gen <- get_generator(path = file_path, |
|
|
392 |
train_type = "label_csv", |
|
|
393 |
batch_size = 1, |
|
|
394 |
maxlen = 6, |
|
|
395 |
target_from_csv = csv_file, |
|
|
396 |
vocabulary = vocabulary, |
|
|
397 |
vocabulary_label = vocabulary_label) |
|
|
398 |
|
|
|
399 |
z <- gen() |
|
|
400 |
x <- z[[1]][1,,] |
|
|
401 |
y <- z[[2]] |
|
|
402 |
colnames(x) <- vocabulary |
|
|
403 |
colnames(y) <- vocabulary_label |
|
|
404 |
x # abcdef |
|
|
405 |
y # label_1 |
|
|
406 |
``` |
|
|
407 |
|
|
|
408 |
Examples for language models follow in the next section. |
|
|
409 |
|
|
|
410 |
### output_format |
|
|
411 |
|
|
|
412 |
The `output_format` determines the shape of the output for a language model, i.e. part of a sequence is the input $X$ and another the |
|
|
413 |
target $Y$. Assume a sequence <tt>abcdefg</tt> and `maxlen = 6`. Output correspond as follows |
|
|
414 |
|
|
|
415 |
**"target_right"**: $X=$ <tt>abcdef</tt>, $Y=$ <tt>g</tt> |
|
|
416 |
|
|
|
417 |
**"target_middle_lstm"**: $X =$ ($X_1 =$ <tt>abc</tt>, $X_2 =$ <tt>gfe</tt>), $Y=$ <tt>d</tt> (note reversed order of $X_2$) |
|
|
418 |
|
|
|
419 |
**"target_middle_cnn"**: $X =$ <tt>abcefg</tt>, $Y =$ <tt>d</tt> |
|
|
420 |
|
|
|
421 |
**"wavenet"**: $X =$ <tt>abcdef</tt>, $Y =$ <tt>bcdefg</tt> |
|
|
422 |
|
|
|
423 |
```{r warning = FALSE, message = FALSE} |
|
|
424 |
# target_right |
|
|
425 |
gen <- get_generator(path = file_path, |
|
|
426 |
train_type = "lm", |
|
|
427 |
batch_size = 1, |
|
|
428 |
maxlen = 6, |
|
|
429 |
vocabulary = vocabulary, |
|
|
430 |
output_format = "target_right") |
|
|
431 |
|
|
|
432 |
z <- gen() |
|
|
433 |
x <- z[[1]][1,,] |
|
|
434 |
y <- z[[2]] |
|
|
435 |
colnames(x) <- vocabulary |
|
|
436 |
colnames(y) <- vocabulary |
|
|
437 |
x # abcdef |
|
|
438 |
y # g |
|
|
439 |
``` |
|
|
440 |
|
|
|
441 |
```{r warning = FALSE, message = FALSE} |
|
|
442 |
# target_middle_lstm |
|
|
443 |
gen <- get_generator(path = file_path, |
|
|
444 |
train_type = "lm", |
|
|
445 |
batch_size = 1, |
|
|
446 |
maxlen = 6, |
|
|
447 |
vocabulary = vocabulary, |
|
|
448 |
output_format = "target_middle_lstm") |
|
|
449 |
|
|
|
450 |
z <- gen() |
|
|
451 |
x_1 <- z[[1]][[1]][1,,] |
|
|
452 |
x_2 <- z[[1]][[2]][1,,] |
|
|
453 |
y <- z[[2]] |
|
|
454 |
colnames(x_1) <- vocabulary |
|
|
455 |
colnames(x_2) <- vocabulary |
|
|
456 |
colnames(y) <- vocabulary |
|
|
457 |
x_1 # abc |
|
|
458 |
x_2 # gfe |
|
|
459 |
y # d |
|
|
460 |
``` |
|
|
461 |
|
|
|
462 |
```{r warning = FALSE, message = FALSE} |
|
|
463 |
# target_middle_cnn |
|
|
464 |
gen <- get_generator(path = file_path, |
|
|
465 |
train_type = "lm", |
|
|
466 |
batch_size = 1, |
|
|
467 |
maxlen = 6, |
|
|
468 |
vocabulary = vocabulary, |
|
|
469 |
output_format = "target_middle_cnn") |
|
|
470 |
|
|
|
471 |
z <- gen() |
|
|
472 |
x <- z[[1]][1,,] |
|
|
473 |
y <- z[[2]] |
|
|
474 |
colnames(x) <- vocabulary |
|
|
475 |
colnames(y) <- vocabulary |
|
|
476 |
x # abcefg |
|
|
477 |
y # d |
|
|
478 |
``` |
|
|
479 |
|
|
|
480 |
```{r warning = FALSE, message = FALSE} |
|
|
481 |
# wavenet |
|
|
482 |
gen <- get_generator(path = file_path, |
|
|
483 |
train_type = "lm", |
|
|
484 |
batch_size = 1, |
|
|
485 |
maxlen = 6, |
|
|
486 |
vocabulary = vocabulary, |
|
|
487 |
output_format = "wavenet") |
|
|
488 |
|
|
|
489 |
z <- gen() |
|
|
490 |
x <- z[[1]][1,,] |
|
|
491 |
y <- z[[2]][1,,] |
|
|
492 |
colnames(x) <- vocabulary |
|
|
493 |
colnames(y) <- vocabulary |
|
|
494 |
x # abcdef |
|
|
495 |
y # bcdefg |
|
|
496 |
``` |
|
|
497 |
|
|
|
498 |
### batch_size |
|
|
499 |
|
|
|
500 |
Number of samples in one batch. |
|
|
501 |
|
|
|
502 |
```{r warning = FALSE, message = FALSE} |
|
|
503 |
# target_right |
|
|
504 |
gen <- get_generator(path = file_path, |
|
|
505 |
train_type = "lm", |
|
|
506 |
batch_size = 7, |
|
|
507 |
maxlen = 6, |
|
|
508 |
vocabulary = vocabulary, |
|
|
509 |
output_format = "target_right") |
|
|
510 |
|
|
|
511 |
z <- gen() |
|
|
512 |
x <- z[[1]] |
|
|
513 |
y <- z[[2]] |
|
|
514 |
dim(x) |
|
|
515 |
dim(y) |
|
|
516 |
``` |
|
|
517 |
|
|
|
518 |
### step |
|
|
519 |
|
|
|
520 |
We may determine how frequently we want to take a sample. If `step = 1` we take a sample at every possible step. |
|
|
521 |
Let's assume we want to predict the next character, i.e. part of the sequence is the <mark class="in">input</mark> and next character the |
|
|
522 |
<mark class="out">target</mark>. If `maxlen = 3, step = 1`: |
|
|
523 |
|
|
|
524 |
1. sample: <tt><mark class="in">abc</mark><mark class="out">d</mark>efghiiii</tt> |
|
|
525 |
|
|
|
526 |
2. sample: <tt>a<mark class="in">bcd</mark><mark class="out">e</mark>fghiiii</tt> |
|
|
527 |
|
|
|
528 |
3. sample: <tt>ab<mark class="in">cde</mark><mark class="out">f</mark>ghiiii</tt> |
|
|
529 |
|
|
|
530 |
if `step = 3` |
|
|
531 |
|
|
|
532 |
1. sample: <tt><mark class="in">abc</mark><mark class="out">d</mark>efghiiii</tt> |
|
|
533 |
|
|
|
534 |
2. sample: <tt>abc<mark class="in">def</mark><mark class="out">g</mark>hiiii</tt> |
|
|
535 |
|
|
|
536 |
3. sample: <tt>abcdef<mark class="in">ghi</mark><mark class="out">i</mark>ii</tt> |
|
|
537 |
|
|
|
538 |
```{r warning = FALSE, message = FALSE} |
|
|
539 |
gen <- get_generator(path = file_path, |
|
|
540 |
train_type = "lm", |
|
|
541 |
batch_size = 1, |
|
|
542 |
maxlen = 3, |
|
|
543 |
vocabulary = vocabulary, |
|
|
544 |
step = 3, |
|
|
545 |
output_format = "target_right") |
|
|
546 |
|
|
|
547 |
z <- gen() |
|
|
548 |
x <- z[[1]][1,,] #encodes abc |
|
|
549 |
y <- z[[2]] # encodes d |
|
|
550 |
colnames(x) <- vocabulary |
|
|
551 |
colnames(y) <- vocabulary |
|
|
552 |
x |
|
|
553 |
y |
|
|
554 |
# go 3 steps forward |
|
|
555 |
z <- gen() |
|
|
556 |
x <- z[[1]][1,,] #encodes def |
|
|
557 |
y <- z[[2]] # encodes g |
|
|
558 |
colnames(x) <- vocabulary |
|
|
559 |
colnames(y) <- vocabulary |
|
|
560 |
x |
|
|
561 |
y |
|
|
562 |
``` |
|
|
563 |
|
|
|
564 |
|
|
|
565 |
### padding |
|
|
566 |
|
|
|
567 |
If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If `padding = FALSE` the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample. |
|
|
568 |
|
|
|
569 |
```{r warning = FALSE, message = FALSE} |
|
|
570 |
gen <- get_generator(path = file_path, |
|
|
571 |
train_type = "lm", |
|
|
572 |
batch_size = 1, |
|
|
573 |
maxlen = 15, # maxlen is longer than sequence |
|
|
574 |
vocabulary = vocabulary, |
|
|
575 |
step = 3, |
|
|
576 |
padding = TRUE, |
|
|
577 |
output_format = "target_right") |
|
|
578 |
|
|
|
579 |
z <- gen() |
|
|
580 |
x <- z[[1]][1,,] |
|
|
581 |
y <- z[[2]] |
|
|
582 |
colnames(x) <- vocabulary |
|
|
583 |
colnames(y) <- vocabulary |
|
|
584 |
x # first 4 entries are zero-vectors |
|
|
585 |
y |
|
|
586 |
``` |
|
|
587 |
|
|
|
588 |
### ambiguous_nuc |
|
|
589 |
|
|
|
590 |
A sequence might contain a character that does not lie inside our vocabulary. For example, let's assume we discard <tt>e</tt> from our vocabulary. |
|
|
591 |
We have 4 options to handle this situation |
|
|
592 |
|
|
|
593 |
(1) encode as zero vector |
|
|
594 |
|
|
|
595 |
```{r warning = FALSE, message = FALSE} |
|
|
596 |
vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary |
|
|
597 |
|
|
|
598 |
# zero |
|
|
599 |
gen <- get_generator(path = file_path, |
|
|
600 |
train_type = "lm", |
|
|
601 |
batch_size = 1, |
|
|
602 |
maxlen = 6, |
|
|
603 |
vocabulary = vocabulary_2, |
|
|
604 |
output_format = "target_right", |
|
|
605 |
ambiguous_nuc = "zeros") |
|
|
606 |
z <- gen() |
|
|
607 |
x <- z[[1]][1,,] |
|
|
608 |
colnames(x) <- vocabulary_2 |
|
|
609 |
x # fifth row is zero vector |
|
|
610 |
``` |
|
|
611 |
|
|
|
612 |
(2) equal probability |
|
|
613 |
|
|
|
614 |
```{r warning = FALSE, message = FALSE} |
|
|
615 |
# equal |
|
|
616 |
gen <- get_generator(path = file_path, |
|
|
617 |
train_type = "lm", |
|
|
618 |
batch_size = 1, |
|
|
619 |
maxlen = 6, |
|
|
620 |
vocabulary = vocabulary_2, |
|
|
621 |
output_format = "target_right", |
|
|
622 |
ambiguous_nuc = "equal") |
|
|
623 |
|
|
|
624 |
z <- gen() |
|
|
625 |
x <- z[[1]][1,,] |
|
|
626 |
colnames(x) <- vocabulary_2 |
|
|
627 |
x # fifth row is 1/8 for every entry |
|
|
628 |
``` |
|
|
629 |
|
|
|
630 |
(3) use distribution of current file |
|
|
631 |
|
|
|
632 |
```{r warning = FALSE, message = FALSE} |
|
|
633 |
# empirical |
|
|
634 |
gen <- get_generator(path = file_path, |
|
|
635 |
train_type = "lm", |
|
|
636 |
batch_size = 1, |
|
|
637 |
maxlen = 6, |
|
|
638 |
vocabulary = vocabulary_2, |
|
|
639 |
output_format = "target_right", |
|
|
640 |
ambiguous_nuc = "empirical") |
|
|
641 |
|
|
|
642 |
z <- gen() |
|
|
643 |
x <- z[[1]][1,,] |
|
|
644 |
colnames(x) <- vocabulary_2 |
|
|
645 |
x # fifth row is distribuation of file |
|
|
646 |
``` |
|
|
647 |
|
|
|
648 |
(4) discard |
|
|
649 |
|
|
|
650 |
```{r warning = FALSE, message = FALSE} |
|
|
651 |
# discard |
|
|
652 |
gen <- get_generator(path = file_path, |
|
|
653 |
train_type = "lm", |
|
|
654 |
batch_size = 1, |
|
|
655 |
maxlen = 6, |
|
|
656 |
vocabulary = vocabulary_2, |
|
|
657 |
output_format = "target_right", |
|
|
658 |
ambiguous_nuc = "discard") |
|
|
659 |
|
|
|
660 |
z <- gen() |
|
|
661 |
x <- z[[1]][1,,] |
|
|
662 |
colnames(x) <- vocabulary_2 |
|
|
663 |
x # first sample with only characters from vocabulary is fghiii|i |
|
|
664 |
``` |
|
|
665 |
|
|
|
666 |
### proportion_per_seq |
|
|
667 |
|
|
|
668 |
The `proportion_per_seq` argument gives the option to use a random subset instead of the full sequence. |
|
|
669 |
|
|
|
670 |
```{r warning = FALSE, message = FALSE} |
|
|
671 |
cat("sequence is ", nchar(sequence), "characters long \n") |
|
|
672 |
gen <- get_generator(path = file_path, |
|
|
673 |
train_type = "lm", |
|
|
674 |
batch_size = 1, |
|
|
675 |
maxlen = 5, |
|
|
676 |
seed = 1, |
|
|
677 |
vocabulary = vocabulary, |
|
|
678 |
output_format = "target_right", |
|
|
679 |
# take random subsequence using 50% of sequence |
|
|
680 |
proportion_per_seq = 0.5) |
|
|
681 |
|
|
|
682 |
z <- gen() |
|
|
683 |
x <- z[[1]][1, , ] |
|
|
684 |
y <- z[[2]] |
|
|
685 |
colnames(x) <- vocabulary |
|
|
686 |
colnames(y) <- vocabulary |
|
|
687 |
x # defgh |
|
|
688 |
y # i |
|
|
689 |
``` |
|
|
690 |
|
|
|
691 |
### file_limit |
|
|
692 |
|
|
|
693 |
Integer or NULL. If integer, use only specified number of randomly sampled files for training. |
|
|
694 |
|
|
|
695 |
### delete_used_files |
|
|
696 |
|
|
|
697 |
If true, delete file once used. Only applies for rds files. |
|
|
698 |
|
|
|
699 |
```{r warning = FALSE, message = FALSE} |
|
|
700 |
|
|
|
701 |
x <- array(0, dim = c(1,5,4)) |
|
|
702 |
y <- matrix(0, ncol = 1) |
|
|
703 |
rds_path <- tempfile(fileext = ".rds") |
|
|
704 |
saveRDS(list(x, y), rds_path) |
|
|
705 |
|
|
|
706 |
gen <- get_generator(path = rds_path, |
|
|
707 |
delete_used_files = TRUE, |
|
|
708 |
train_type = "label_rds", |
|
|
709 |
batch_size = 1, |
|
|
710 |
maxlen = 5) |
|
|
711 |
|
|
|
712 |
z <- gen() |
|
|
713 |
file.exists(rds_path) |
|
|
714 |
# z <- gen() |
|
|
715 |
# When calling the generator again, it will wait until it finds a file again from the files listed in |
|
|
716 |
# the initial `path` argument. Can be used if another process(es) create rds files. |
|
|
717 |
``` |
|
|
718 |
|
|
|
719 |
### max_samples |
|
|
720 |
|
|
|
721 |
Only use fixed number of samples per file. Randomly choose which samples to use. (If `random_sampling = FALSE`, samples are consecutive.) |
|
|
722 |
|
|
|
723 |
```{r warning = FALSE, message = FALSE} |
|
|
724 |
gen <- get_generator(path = file_path, |
|
|
725 |
train_type = "lm", |
|
|
726 |
batch_size = 2, |
|
|
727 |
maxlen = 5, |
|
|
728 |
step = 1, |
|
|
729 |
seed = 3, |
|
|
730 |
vocabulary = vocabulary, |
|
|
731 |
output_format = "target_right", |
|
|
732 |
max_samples = 2) |
|
|
733 |
|
|
|
734 |
z <- gen() |
|
|
735 |
x1 <- z[[1]][1, , ] |
|
|
736 |
x2 <- z[[1]][2, , ] |
|
|
737 |
colnames(x1) <- vocabulary |
|
|
738 |
colnames(x2) <- vocabulary |
|
|
739 |
x1 # bcdef |
|
|
740 |
x2 # cdefg |
|
|
741 |
``` |
|
|
742 |
|
|
|
743 |
### random_sampling |
|
|
744 |
|
|
|
745 |
If you use `max_samples`, generator will randomly choose subset from all possible samples, but those samples are consecutive. With `random_sampling = TRUE`, |
|
|
746 |
samples are completely random. |
|
|
747 |
|
|
|
748 |
```{r warning = FALSE, message = FALSE} |
|
|
749 |
gen <- get_generator(path = file_path, |
|
|
750 |
train_type = "lm", |
|
|
751 |
batch_size = 2, |
|
|
752 |
maxlen = 5, |
|
|
753 |
seed = 66, |
|
|
754 |
random_sampling = TRUE, |
|
|
755 |
vocabulary = vocabulary, |
|
|
756 |
output_format = "target_right", |
|
|
757 |
max_samples = 2) |
|
|
758 |
|
|
|
759 |
z <- gen() |
|
|
760 |
x1 <- z[[1]][1, , ] |
|
|
761 |
x2 <- z[[1]][2, , ] |
|
|
762 |
colnames(x1) <- vocabulary |
|
|
763 |
colnames(x2) <- vocabulary |
|
|
764 |
x1 # efghi |
|
|
765 |
x2 # defgh |
|
|
766 |
``` |
|
|
767 |
|
|
|
768 |
### target_len |
|
|
769 |
|
|
|
770 |
Target length for language model. |
|
|
771 |
|
|
|
772 |
```{r warning = FALSE, message = FALSE} |
|
|
773 |
gen <- get_generator(path = file_path, |
|
|
774 |
train_type = "lm", |
|
|
775 |
batch_size = 1, |
|
|
776 |
target_len = 3, |
|
|
777 |
maxlen = 5, |
|
|
778 |
vocabulary = vocabulary, |
|
|
779 |
output_format = "target_right") |
|
|
780 |
|
|
|
781 |
z <- gen() |
|
|
782 |
x <- z[[1]][1, , ] |
|
|
783 |
y1 <- z[[2]][ , 1, ] |
|
|
784 |
y2 <- z[[2]][ , 2, ] |
|
|
785 |
y3 <- z[[2]][ , 3, ] |
|
|
786 |
colnames(x) <- vocabulary |
|
|
787 |
names(y1) <- vocabulary |
|
|
788 |
names(y2) <- vocabulary |
|
|
789 |
names(y3) <- vocabulary |
|
|
790 |
x # abcde |
|
|
791 |
y1 # f |
|
|
792 |
y2 # g |
|
|
793 |
y3 # h |
|
|
794 |
``` |
|
|
795 |
|
|
|
796 |
### n_gram / n_gram_stride |
|
|
797 |
|
|
|
798 |
Encode target in language model not character wise but combine n characters to one target. `n_gram_stride` determines the frequency of |
|
|
799 |
the n-gram encoding. |
|
|
800 |
|
|
|
801 |
```{r warning = FALSE, message = FALSE} |
|
|
802 |
gen <- get_generator(path = file_path, |
|
|
803 |
train_type = "lm", |
|
|
804 |
batch_size = 1, |
|
|
805 |
target_len = 6, |
|
|
806 |
n_gram = 3, |
|
|
807 |
n_gram_stride = 3, |
|
|
808 |
maxlen = 3, |
|
|
809 |
vocabulary = vocabulary, |
|
|
810 |
output_format = "target_right") |
|
|
811 |
|
|
|
812 |
z <- gen() |
|
|
813 |
x <- z[[1]] |
|
|
814 |
y1 <- z[[2]][ , 1, ] |
|
|
815 |
y2 <- z[[2]][ , 2, ] |
|
|
816 |
|
|
|
817 |
dim(x)[3] == length(vocabulary)^3 |
|
|
818 |
# x = abc as 3-gram |
|
|
819 |
# y1 = def as 3-gram |
|
|
820 |
# y2 = ghi as 3-gram |
|
|
821 |
``` |
|
|
822 |
|
|
|
823 |
### add_noise |
|
|
824 |
|
|
|
825 |
Add noise to input. Must be a list that specifies noise distribution or NULL (no noise). |
|
|
826 |
List contains arguments `noise_type`: either `"normal"` or `"uniform"`. |
|
|
827 |
Optional arguments `sd` or `mean` if `noise_type` is `"normal"` (default is `sd=1` and `mean=0`) or `min`, `max` if `noise_type` is `"uniform"` |
|
|
828 |
(default is `min=0`, `max=1`). |
|
|
829 |
|
|
|
830 |
```{r warning = FALSE, message = FALSE} |
|
|
831 |
gen <- get_generator(path = file_path, |
|
|
832 |
train_type = "lm", |
|
|
833 |
batch_size = 1, |
|
|
834 |
add_noise = list(noise_type = "normal", mean = 0, sd = 0.01), |
|
|
835 |
maxlen = 5, |
|
|
836 |
vocabulary = vocabulary, |
|
|
837 |
output_format = "target_right") |
|
|
838 |
|
|
|
839 |
z <- gen() |
|
|
840 |
x <- z[[1]][1, , ] |
|
|
841 |
y <- z[[2]] |
|
|
842 |
|
|
|
843 |
colnames(x) <- vocabulary |
|
|
844 |
colnames(y) <- vocabulary |
|
|
845 |
round(x, 3) # abcde + noise |
|
|
846 |
y # f |
|
|
847 |
``` |
|
|
848 |
|
|
|
849 |
### proportion_entries |
|
|
850 |
|
|
|
851 |
If a fasta file has multiple entries, you can randomly choose a subset. |
|
|
852 |
For example, if the file has 6 entries and `proportion_entries = 0.5` |
|
|
853 |
the generator will randomly choose only 3 of the entries. |
|
|
854 |
|
|
|
855 |
### shuffle_file_order |
|
|
856 |
|
|
|
857 |
Shuffle file order before iterating through files. Order gets reshuffled after every iteration. |
|
|
858 |
|
|
|
859 |
### shuffle_input |
|
|
860 |
|
|
|
861 |
Whether to shuffle fasta entries if fasta file has multiple entries. |
|
|
862 |
|
|
|
863 |
### reverse_complement |
|
|
864 |
|
|
|
865 |
If `TRUE`, randomly decide for every batch to use original sequence or its reverse complement. |
|
|
866 |
Only implemented for <tt>ACGT</tt> vocabulary. |
|
|
867 |
|
|
|
868 |
### sample_by_file_size |
|
|
869 |
|
|
|
870 |
Randomly choose new file by sampling according to file size (bigger files more likely). |
|
|
871 |
|
|
|
872 |
### concat_seq |
|
|
873 |
|
|
|
874 |
Character string or `NULL`. If not `NULL` all entries from file get concatenated to one sequence with `concat_seq` string between them. |
|
|
875 |
Use `concat_seq = ""` if you don't want to add a new token. |
|
|
876 |
|
|
|
877 |
```{r warning = FALSE, message = FALSE} |
|
|
878 |
df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3)) |
|
|
879 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
880 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
881 |
gen <- get_generator(path = fasta_path, |
|
|
882 |
train_type = "lm", |
|
|
883 |
batch_size = 1, |
|
|
884 |
maxlen = 9, |
|
|
885 |
vocabulary = c("A", "C", "G", "T", "Z"), |
|
|
886 |
concat_seq = "ZZ", |
|
|
887 |
output_format = "target_right") |
|
|
888 |
|
|
|
889 |
z <- gen() |
|
|
890 |
x <- z[[1]][1, , ] |
|
|
891 |
y <- z[[2]] |
|
|
892 |
|
|
|
893 |
colnames(x) <- c("A", "C", "G", "T", "Z") |
|
|
894 |
colnames(y) <- c("A", "C", "G", "T", "Z") |
|
|
895 |
x # ACZZAGZZA |
|
|
896 |
y # T |
|
|
897 |
``` |
|
|
898 |
|
|
|
899 |
### set_learning |
|
|
900 |
|
|
|
901 |
When you want to assign one label to set of samples. Only implemented for `train_type = "label_folder"`. |
|
|
902 |
Input is a list with the following parameters |
|
|
903 |
|
|
|
904 |
+ `samples_per_target` how many samples to use for one target |
|
|
905 |
+ `maxlen` length of one sample |
|
|
906 |
+ `reshape_mode`: `"time_dist", "multi_input"` or `"concat"`. |
|
|
907 |
+ If `reshape_mode = "multi_input"`, generator will produce `samples_per_target` separate inputs, each of length `maxlen`. |
|
|
908 |
+ If `reshape_mode = "time_dist"`, generator will produce a 4D input array. The dimensions correspond to |
|
|
909 |
`(batch_size, samples_per_target, maxlen, length(vocabulary))`. |
|
|
910 |
+ If `reshape_mode` is `"concat"`, generator will concatenate `samples_per_target` sequences |
|
|
911 |
of length `maxlen` to one long sequence. |
|
|
912 |
+ If `reshape_mode = "concat"`, there is an additional `buffer_len` argument: add new token between |
|
|
913 |
concatenated samples |
|
|
914 |
+ If `buffer_len` is an integer, the sub-sequences are inter spaced with `buffer_len` rows. The input length is |
|
|
915 |
(`maxlen` \* `samples_per_target`) + `buffer_len` \* (`samples_per_target` - 1) |
|
|
916 |
|
|
|
917 |
```{r warning = FALSE, message = FALSE} |
|
|
918 |
# create data for second label |
|
|
919 |
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1") |
|
|
920 |
file_path_2 <- tempfile(fileext = ".fasta") |
|
|
921 |
fasta_file <- microseq::writeFasta(df, file_path_2) |
|
|
922 |
|
|
|
923 |
# multi_input |
|
|
924 |
set_learning <- list(reshape_mode = "multi_input", |
|
|
925 |
maxlen = 4, |
|
|
926 |
samples_per_target = 3) |
|
|
927 |
|
|
|
928 |
gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes |
|
|
929 |
train_type = "label_folder", |
|
|
930 |
batch_size = 2, |
|
|
931 |
maxlen = 4, |
|
|
932 |
step = 1, |
|
|
933 |
vocabulary = vocabulary, |
|
|
934 |
set_learning = set_learning) |
|
|
935 |
|
|
|
936 |
z <- gen() |
|
|
937 |
x <- z[[1]] |
|
|
938 |
y <- z[[2]] |
|
|
939 |
length(x) # 3 samples per target |
|
|
940 |
x_1_1 <- x[[1]][1, , ] |
|
|
941 |
x_1_1 # abcd |
|
|
942 |
x_1_2 <- x[[2]][1, , ] |
|
|
943 |
x_1_2 # bcde |
|
|
944 |
x_1_3 <- x[[3]][1, , ] |
|
|
945 |
x_1_3 # cdef |
|
|
946 |
|
|
|
947 |
x_2_1 <- x[[1]][2, , ] |
|
|
948 |
x_2_1 # aaba |
|
|
949 |
x_2_2 <- x[[2]][2, , ] |
|
|
950 |
x_2_2 # abaa |
|
|
951 |
x_2_3 <- x[[3]][2, , ] |
|
|
952 |
x_2_3 # baac |
|
|
953 |
|
|
|
954 |
colnames(y) <- c("label_1", "label_2") |
|
|
955 |
y |
|
|
956 |
``` |
|
|
957 |
|
|
|
958 |
```{r warning = FALSE, message = FALSE} |
|
|
959 |
# concat |
|
|
960 |
set_learning <- list(reshape_mode = "concat", |
|
|
961 |
maxlen = 4, |
|
|
962 |
samples_per_target = 3) |
|
|
963 |
|
|
|
964 |
gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes |
|
|
965 |
train_type = "label_folder", |
|
|
966 |
batch_size = 2, |
|
|
967 |
maxlen = 4, |
|
|
968 |
step = 2, |
|
|
969 |
vocabulary = vocabulary, |
|
|
970 |
set_learning = set_learning) |
|
|
971 |
|
|
|
972 |
z <- gen() |
|
|
973 |
x <- z[[1]] |
|
|
974 |
y <- z[[2]] |
|
|
975 |
dim(x) |
|
|
976 |
x_1 <- x[1, , ] |
|
|
977 |
colnames(x_1) <- vocabulary |
|
|
978 |
x_1 # abcd | cdef | efgh |
|
|
979 |
x_2 <- x[2, , ] |
|
|
980 |
colnames(x_2) <- vocabulary |
|
|
981 |
x_2 # aaba | baac | acaa |
|
|
982 |
|
|
|
983 |
colnames(y) <- c("label_1", "label_2") |
|
|
984 |
y |
|
|
985 |
``` |
|
|
986 |
|
|
|
987 |
|
|
|
988 |
### use_quality_score |
|
|
989 |
|
|
|
990 |
If `TRUE`, instead of one-hot encoding, use quality score of fastq file. |
|
|
991 |
|
|
|
992 |
```{r warning = FALSE, message = FALSE} |
|
|
993 |
df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I") |
|
|
994 |
fastq_path <- tempfile(fileext = ".fastq") |
|
|
995 |
fastq_file <- microseq::writeFastq(df, fastq_path) |
|
|
996 |
gen <- get_generator(path = fastq_path, |
|
|
997 |
train_type = "lm", |
|
|
998 |
batch_size = 1, |
|
|
999 |
maxlen = 5, |
|
|
1000 |
format = "fastq", |
|
|
1001 |
vocabulary = c("A", "C", "G", "T"), |
|
|
1002 |
use_quality_score = TRUE, |
|
|
1003 |
output_format = "target_right") |
|
|
1004 |
|
|
|
1005 |
z <- gen() |
|
|
1006 |
x <- z[[1]][1, , ] |
|
|
1007 |
y <- z[[2]] |
|
|
1008 |
|
|
|
1009 |
colnames(x) <- c("A", "C", "G", "T") |
|
|
1010 |
colnames(y) <- c("A", "C", "G", "T") |
|
|
1011 |
x # ACAGA |
|
|
1012 |
y # T |
|
|
1013 |
``` |
|
|
1014 |
|
|
|
1015 |
|
|
|
1016 |
### use_coverage |
|
|
1017 |
|
|
|
1018 |
Integer or `NULL`. If not `NULL`, use coverage as encoding rather than one-hot encoding. |
|
|
1019 |
Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where |
|
|
1020 |
n is some integer. |
|
|
1021 |
|
|
|
1022 |
```{r warning = FALSE, message = FALSE} |
|
|
1023 |
df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8") |
|
|
1024 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
1025 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
1026 |
gen <- get_generator(path = fasta_path, |
|
|
1027 |
train_type = "lm", |
|
|
1028 |
batch_size = 1, |
|
|
1029 |
maxlen = 5, |
|
|
1030 |
vocabulary = c("A", "C", "G", "T"), |
|
|
1031 |
use_coverage = 25, |
|
|
1032 |
output_format = "target_right") |
|
|
1033 |
|
|
|
1034 |
z <- gen() |
|
|
1035 |
x <- z[[1]][1, , ] |
|
|
1036 |
y <- z[[2]] |
|
|
1037 |
|
|
|
1038 |
colnames(x) <- c("A", "C", "G", "T") |
|
|
1039 |
colnames(y) <- c("A", "C", "G", "T") |
|
|
1040 |
x # ACAGA; 0.32 = 8/25 |
|
|
1041 |
y # T |
|
|
1042 |
``` |
|
|
1043 |
|
|
|
1044 |
### added_label_path |
|
|
1045 |
|
|
|
1046 |
It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named "file". |
|
|
1047 |
|
|
|
1048 |
We may add some additional input to our dummy data |
|
|
1049 |
|
|
|
1050 |
```{r warning = FALSE, message = FALSE} |
|
|
1051 |
file <- c(basename(file_path), "some_file_name.fasta") |
|
|
1052 |
df <- data.frame(file = file, |
|
|
1053 |
label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0)) |
|
|
1054 |
df |
|
|
1055 |
write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE) |
|
|
1056 |
``` |
|
|
1057 |
|
|
|
1058 |
If we add the path to the csv file, the generator will map additional input to sequences: |
|
|
1059 |
|
|
|
1060 |
```{r warning = FALSE, message = FALSE} |
|
|
1061 |
gen <- get_generator(path = dir_path, |
|
|
1062 |
train_type = "lm", |
|
|
1063 |
batch_size = 1, |
|
|
1064 |
maxlen = 5, |
|
|
1065 |
output_format = "target_right", |
|
|
1066 |
vocabulary = vocabulary, |
|
|
1067 |
added_label_path = file.path(dir_path, "add_input.csv"), |
|
|
1068 |
add_input_as_seq = FALSE) # don't treat added input as sequence |
|
|
1069 |
|
|
|
1070 |
z <- gen() |
|
|
1071 |
added_label_input <- z[[1]][[1]] |
|
|
1072 |
added_label_input |
|
|
1073 |
x <- z[[1]][[2]] |
|
|
1074 |
x[1, , ] |
|
|
1075 |
y <- z[[2]] |
|
|
1076 |
y |
|
|
1077 |
``` |
|
|
1078 |
|
|
|
1079 |
If we want to train a network with additional labels, we have to add an additional input layer. |
|
|
1080 |
|
|
|
1081 |
```{r warning = FALSE, message = FALSE} |
|
|
1082 |
model <- create_model_lstm_cnn( |
|
|
1083 |
maxlen = 5, |
|
|
1084 |
layer_lstm = c(8, 8), |
|
|
1085 |
layer_dense = c(4), |
|
|
1086 |
label_input = 3 # additional input vector has length 3 |
|
|
1087 |
) |
|
|
1088 |
|
|
|
1089 |
# train_model(train_type = "lm", |
|
|
1090 |
# model = model, |
|
|
1091 |
# path = file.path(dir_path, "train_files_1"), |
|
|
1092 |
# path_val = file.path(dir_path, "validation_files_1"), |
|
|
1093 |
# added_label_path = file.path(dir_path, "add_input.csv"), |
|
|
1094 |
# steps_per_epoch = 5, |
|
|
1095 |
# batch_size = 8, |
|
|
1096 |
# epochs = 2) |
|
|
1097 |
``` |
|
|
1098 |
|
|
|
1099 |
|
|
|
1100 |
### return_int |
|
|
1101 |
|
|
|
1102 |
Whether to return integer encoding rather than one-hot encoding. |
|
|
1103 |
|
|
|
1104 |
```{r warning = FALSE, message = FALSE} |
|
|
1105 |
df <- data.frame(Sequence = "ATCGC", Header = "seq_1") |
|
|
1106 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
1107 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
1108 |
gen <- get_generator(path = fasta_path, |
|
|
1109 |
train_type = "lm", |
|
|
1110 |
batch_size = 1, |
|
|
1111 |
return_int = TRUE, |
|
|
1112 |
padding = TRUE, |
|
|
1113 |
maxlen = 8, |
|
|
1114 |
vocabulary = c("A", "C", "G", "T"), |
|
|
1115 |
output_format = "target_right") |
|
|
1116 |
|
|
|
1117 |
z <- gen() |
|
|
1118 |
x <- z[[1]] |
|
|
1119 |
y <- z[[2]] |
|
|
1120 |
colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G") |
|
|
1121 |
x |
|
|
1122 |
colnames(y) <- "C" |
|
|
1123 |
y |
|
|
1124 |
``` |
|
|
1125 |
|
|
|
1126 |
Can also be combined with n-gram encoding: |
|
|
1127 |
|
|
|
1128 |
```{r warning = FALSE, message = FALSE} |
|
|
1129 |
df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1") |
|
|
1130 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
1131 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
1132 |
gen <- get_generator(path = fasta_path, |
|
|
1133 |
train_type = "lm", |
|
|
1134 |
batch_size = 1, |
|
|
1135 |
n_gram = 3, |
|
|
1136 |
n_gram_stride = 3, |
|
|
1137 |
return_int = TRUE, |
|
|
1138 |
maxlen = 6, |
|
|
1139 |
target_len = 3, |
|
|
1140 |
vocabulary = c("A", "C", "G", "T"), |
|
|
1141 |
output_format = "target_right") |
|
|
1142 |
|
|
|
1143 |
z <- gen() |
|
|
1144 |
x <- z[[1]] |
|
|
1145 |
y <- z[[2]] |
|
|
1146 |
colnames(x) <- c("AAA", "CCC") |
|
|
1147 |
x |
|
|
1148 |
colnames(y) <- "TTT" |
|
|
1149 |
y |
|
|
1150 |
``` |
|
|
1151 |
|
|
|
1152 |
### reshape_xy |
|
|
1153 |
|
|
|
1154 |
Apply some function to the output of a generator call. |
|
|
1155 |
|
|
|
1156 |
```{r} |
|
|
1157 |
df <- data.frame(Sequence = "AAAATTTT", Header = "header_1") |
|
|
1158 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
1159 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
1160 |
fx <- function(x = NULL, y = NULL) { |
|
|
1161 |
return(x - 1) |
|
|
1162 |
} |
|
|
1163 |
fy <- function(x = NULL, y = NULL) { |
|
|
1164 |
return(exp(y * 5)) |
|
|
1165 |
} |
|
|
1166 |
|
|
|
1167 |
gen <- get_generator(path = fasta_path, |
|
|
1168 |
reshape_xy = list(x = fx, y = fy), |
|
|
1169 |
train_type = "label_folder", |
|
|
1170 |
batch_size = 1, |
|
|
1171 |
maxlen = 8) |
|
|
1172 |
|
|
|
1173 |
z <- gen() |
|
|
1174 |
x <- z[[1]] |
|
|
1175 |
x[1,,] |
|
|
1176 |
y <- z[[2]] |
|
|
1177 |
y |
|
|
1178 |
``` |
|
|
1179 |
|
|
|
1180 |
|
|
|
1181 |
### masked_lm |
|
|
1182 |
|
|
|
1183 |
Masks some parts of input sequence. Can be used for training BERT-like models. |
|
|
1184 |
|
|
|
1185 |
```{r warning = FALSE, message = FALSE} |
|
|
1186 |
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "") |
|
|
1187 |
df <- data.frame(Sequence = nt_seq, Header = "seq_1") |
|
|
1188 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
1189 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
1190 |
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token |
|
|
1191 |
random_rate = 0.025, # set 2.5% of input to random value |
|
|
1192 |
identity_rate = 0.05, # leave 5% unchanged |
|
|
1193 |
include_sw = TRUE) # 0,1 matrix showing where masking was applied |
|
|
1194 |
gen <- get_generator(path = fasta_path, |
|
|
1195 |
train_type = "masked_lm", |
|
|
1196 |
masked_lm = masked_lm, |
|
|
1197 |
batch_size = 1, |
|
|
1198 |
n_gram = 1, |
|
|
1199 |
n_gram_stride = 1, |
|
|
1200 |
return_int = TRUE, |
|
|
1201 |
maxlen = 100, |
|
|
1202 |
vocabulary = c("A", "C", "G", "T")) |
|
|
1203 |
|
|
|
1204 |
z <- gen() |
|
|
1205 |
x <- z[[1]] |
|
|
1206 |
y <- z[[2]] |
|
|
1207 |
sw <- z[[3]] |
|
|
1208 |
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ]) |
|
|
1209 |
head(df) |
|
|
1210 |
``` |
|
|
1211 |
|
|
|
1212 |
Whenever sw (sample weight) column is 0, x and y columns are identical. Let's look at rows where sw is 1: |
|
|
1213 |
|
|
|
1214 |
```{r warning = FALSE, message = FALSE} |
|
|
1215 |
df %>% dplyr::filter(sw == 1) |
|
|
1216 |
``` |
|
|
1217 |
|
|
|
1218 |
Here 5 is the mask token, this is always the size of the vocabulary + 1. |
|
|
1219 |
|
|
|
1220 |
```{r warning = FALSE, message = FALSE} |
|
|
1221 |
df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part |
|
|
1222 |
df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value) |
|
|
1223 |
``` |
|
|
1224 |
|
|
|
1225 |
Can be combined with n-gram encoding and masking of fixed block size: |
|
|
1226 |
|
|
|
1227 |
```{r warning = FALSE, message = FALSE} |
|
|
1228 |
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "") |
|
|
1229 |
df <- data.frame(Sequence = nt_seq, Header = "seq_1") |
|
|
1230 |
fasta_path <- tempfile(fileext = ".fasta") |
|
|
1231 |
fasta_file <- microseq::writeFasta(df, fasta_path) |
|
|
1232 |
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token |
|
|
1233 |
random_rate = 0.05, # set 5% of input to random value |
|
|
1234 |
identity_rate = 0.05, # leave 5% unchanged |
|
|
1235 |
include_sw = TRUE, # 0,1 matrix showing where masking was applied |
|
|
1236 |
block_len = 3) # always mask at least 3 tokens in a row |
|
|
1237 |
gen <- get_generator(path = fasta_path, |
|
|
1238 |
train_type = "masked_lm", |
|
|
1239 |
masked_lm = masked_lm, |
|
|
1240 |
batch_size = 1, |
|
|
1241 |
n_gram = 3, |
|
|
1242 |
seed = 12, |
|
|
1243 |
n_gram_stride = 1, |
|
|
1244 |
return_int = TRUE, |
|
|
1245 |
maxlen = 100, |
|
|
1246 |
vocabulary = c("A", "C", "G", "T")) |
|
|
1247 |
|
|
|
1248 |
z <- gen() |
|
|
1249 |
x <- z[[1]] |
|
|
1250 |
y <- z[[2]] |
|
|
1251 |
sw <- z[[3]] |
|
|
1252 |
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x)) |
|
|
1253 |
head(df) |
|
|
1254 |
tail(df) |
|
|
1255 |
``` |
|
|
1256 |
|
|
|
1257 |
We can check that sample weights appear only in blocks. |
|
|
1258 |
|
|
|
1259 |
```{r warning = FALSE, message = FALSE} |
|
|
1260 |
which(sw == 1) |
|
|
1261 |
``` |
|
|
1262 |
|
|
|
1263 |
Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1). |
|
|
1264 |
|
|
|
1265 |
```{r warning = FALSE, message = FALSE} |
|
|
1266 |
df %>% dplyr::filter(sw == 1 & x == 65) # 10% masked part |
|
|
1267 |
df %>% dplyr::filter(sw == 1 & x != 65) # 5% identity part and 5% random part (can randomly be the true value) |
|
|
1268 |
``` |