Switch to unified view

a b/vignettes/data_generator.Rmd
1
---
2
title: "Data Generator"
3
output: rmarkdown::html_vignette
4
vignette: >
5
  %\VignetteIndexEntry{Data Generator}
6
  %\VignetteEngine{knitr::rmarkdown}
7
  %\VignetteEncoding{UTF-8}
8
---
9
10
```{r, echo=FALSE, warning=FALSE, message=FALSE}
11
12
if (!reticulate::py_module_available("tensorflow")) {
13
  knitr::opts_chunk$set(eval = FALSE)
14
} else {
15
  knitr::opts_chunk$set(eval = TRUE)
16
}
17
```
18
19
```{r, message=FALSE}
20
library(deepG)
21
library(magrittr)
22
library(keras)
23
```
24
25
26
```{r, echo=FALSE, warning=FALSE, message=FALSE}
27
options(rmarkdown.html_vignette.check_title = FALSE)
28
```
29
30
```{css, echo=FALSE}
31
mark.in {
32
  background-color: CornflowerBlue;
33
}
34
35
mark.out {
36
  background-color: IndianRed;
37
}
38
39
```
40
41
## Introduction
42
43
The most common use case for the deepG data generator is to extract samples from a collection of 
44
fasta (or fastq) files.
45
The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$.
46
We can differentiate between 2 approaches
47
48
+ **Language model**: Part of a sequence is the input and other part the target.
49
    + Example: Predict the next nucleotide given the previous 100 nucleotides. 
50
+ **Label classification**: Assign a label to a sequence.  
51
    + Example: Assign a label "virus" or "bacteria" to a sequence of length 100.
52
53
Suppose we are given 2 fasta files called "a.fasta" and "b.fasta" that look as follows:
54
55
<div style="float: left;margin-right:10px">
56
  <table>
57
    <tr>
58
      <td>
59
      **a.fasta** <br>
60
      <tt> 
61
        >header_a1 <br>
62
        AACCAAGG <br>
63
        >header_a2 <br>
64
        TTTGGG <br>
65
        >header_a3 <br>
66
        ACGTACGT <br>
67
      </tt>
68
      </td> 
69
    </tr>
70
  </table>
71
</div>
72
<div style="float: left">
73
  <table>
74
    <tr>
75
      <td>
76
            **b.fasta** <br>
77
            <tt> 
78
              >header_b1 <br>
79
              GTGTGT <br>
80
              >header_b2 <br>
81
              AAGG <br>
82
              </tt>
83
      </td>
84
    </tr>
85
  </table>
86
</div>
87
<br><br><br><br><br><br><br><br><br>
88
89
If we want to extract sequences of length 4 from these files, there would be 17 possible samples 
90
(5 from <tt>AACCAAGG</tt>, 3 from <tt>TTTGGG</tt>, ...). 
91
A naive approach would be to extract the samples in a sequential manner:  
92
93
*1. sample*: 
94
95
<div style="float: left;margin-right:10px">
96
  <table>
97
    <tr>
98
      <td>
99
      **a.fasta** <br>
100
      <tt> 
101
        >header_a1 <br>
102
        <mark class="in">AACC</mark>AAGG <br>
103
        >header_a2 <br>
104
        TTTGGG <br>
105
        >header_a3 <br>
106
        ACGTACGT <br>
107
      </tt>
108
      </td> 
109
    </tr>
110
  </table>
111
</div>
112
<div style="float: left">
113
  <table>
114
    <tr>
115
      <td>
116
            **b.fasta** <br>
117
            <tt> 
118
              >header_b1 <br>
119
              GTGTGT <br>
120
              >header_b2 <br>
121
              AAGG <br>
122
              </tt>
123
      </td>
124
    </tr>
125
  </table>
126
</div>
127
<br><br><br><br><br><br><br><br><br>
128
129
*2. sample*: 
130
131
<div style="float: left;margin-right:10px">
132
  <table>
133
    <tr>
134
      <td>
135
      **a.fasta** <br>
136
      <tt> 
137
        >header_a1 <br>
138
        A<mark class="in">ACCA</mark>AGG <br>
139
        >header_a2 <br>
140
        TTTGGG <br>
141
        >header_a3 <br>
142
        ACGTACGT <br>
143
      </tt>
144
      </td> 
145
    </tr>
146
  </table>
147
</div>
148
<div style="float: left">
149
  <table>
150
    <tr>
151
      <td>
152
            **b.fasta** <br>
153
            <tt> 
154
              >header_b1 <br>
155
              GTGTGT <br>
156
              >header_b2 <br>
157
              AAGG <br>
158
              </tt>
159
      </td>
160
    </tr>
161
  </table>
162
</div>
163
<br><br><br><br><br><br><br><br><br>
164
165
... 
166
167
<br>
168
169
*17. sample*: 
170
171
<div style="float: left;margin-right:10px">
172
  <table>
173
    <tr>
174
      <td>
175
      **a.fasta** <br>
176
      <tt> 
177
        >header_a1 <br>
178
        AACCAAGG <br>
179
        >header_a2 <br>
180
        TTTGGG <br>
181
        >header_a3 <br>
182
        ACGTACGT <br>
183
      </tt>
184
      </td> 
185
    </tr>
186
  </table>
187
</div>
188
<div style="float: left">
189
  <table>
190
    <tr>
191
      <td>
192
            **b.fasta** <br>
193
            <tt> 
194
              >header_b1 <br>
195
              GTGTGT <br>
196
              >header_b2 <br>
197
              <mark class="in">AAGG</mark> <br>
198
              </tt>
199
      </td>
200
    </tr>
201
  </table>
202
</div>
203
<br><br><br><br><br><br><br><br><br>
204
205
*18. sample*: 
206
207
<div style="float: left;margin-right:10px">
208
  <table>
209
    <tr>
210
      <td>
211
      **a.fasta** <br>
212
      <tt> 
213
        >header_a1 <br>
214
        <mark class="in">AACC</mark>AAGG <br>
215
        >header_a2 <br>
216
        TTTGGG <br>
217
        >header_a3 <br>
218
        ACGTACGT <br>
219
      </tt>
220
      </td> 
221
    </tr>
222
  </table>
223
</div>
224
<div style="float: left">
225
  <table>
226
    <tr>
227
      <td>
228
            **b.fasta** <br>
229
            <tt> 
230
              >header_b1 <br>
231
              GTGTGT <br>
232
              >header_b2 <br>
233
              AAGG <br>
234
              </tt>
235
      </td>
236
    </tr>
237
  </table>
238
</div>
239
<br><br><br><br><br><br><br><br><br>
240
241
... 
242
<br><br>
243
244
For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and 
245
the model would often see long stretches of data from the same source.
246
Choosing the samples completely at random can also be problematic since we would constantly have to open new files.
247
The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches.   
248
249
## Data generator options
250
251
In the following code examples, we will mostly use the sequence <tt> **abcdefghiiii** </tt> to demonstrate some of the 
252
deepG data generator options. (In real world application you would usually have sequences from the <tt>ACGT</tt> vocabulary.) 
253
254
```{r warning = FALSE, message = FALSE}
255
sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i")
256
vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")  
257
```
258
259
We may store this sequence in a fasta file
260
261
```{r warning = FALSE, message = FALSE}
262
temp_dir <- tempfile()
263
dir.create(temp_dir)
264
dir_path <- paste0(temp_dir, "/dummy_data")
265
dir.create(dir_path)
266
df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE)
267
file_path <- file.path(dir_path, "a.fasta")
268
# sequence as fasta file
269
microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path)
270
```
271
272
Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data.
273
Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see `use_coverage`, `use_quality_score` 
274
and `ambiguous_nuc` sections.
275
276
```{r warning = FALSE, message = FALSE}
277
# one-hot encoding example
278
s <-  c("a", "c", "a", "f", "i", "b")
279
s_as_int_seq <- vector("integer", length(s))
280
for (i in 1:length(s)) {
281
  s_as_int_seq[i] <- which(s[i] == vocabulary) - 1
282
}
283
one_hot_sample <- keras::to_categorical(s_as_int_seq)
284
colnames(one_hot_sample) <- vocabulary
285
one_hot_sample
286
```
287
288
### maxlen 
289
290
The length of the input sequence.
291
292
### vocabulary 
293
294
The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with
295
the `ambiguous_nuc` argument.
296
297
### train_type
298
299
The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$.
300
The `train_type` argument determines how $X$ and $Y$ get extracted. 
301
Possible arguments for <u> *language models* </u> are:
302
303
+ **"lm"** or **"lm_rds"**: Given some sequence $s$, we take some subset of that sequence as input and the rest as target. How to split $s$ 
304
  can be specified in `output_format` argument.
305
306
Besides the language model approach, we can use <u> *label classification* </u>. This means we map some label to a sequence. For example, the target for some 
307
nucleotide sequence could be one of the labels "bacteria" or "virus". We have to specify how to extract a label corresponding to a sequence. 
308
Possible arguments are:
309
310
+ **"label_header"**: get label from fasta headers.
311
+ **"label_folder"**: get label from folder, i.e. all files in one folder must belong to the same class.
312
+ **"label_csv"**: get label from csv file. Csv file should have one column named "file". The targets then correspond to entries in that row (except "file"
313
column). Example: if we are currently working with a file called "a.fasta", there should be a row in our csv file with some target information for that file <br>
314
315
316
  |  file       | label_1 | label_2 | 
317
  |   ---       |   ---   |  ---    |   
318
  | "a.fasta"   |    1    |    0    |
319
320
321
+ **"label_rds"**: rds file contains preprocessed list of input and target tensors.
322
323
Another option is **"dummy_gen"**: generator creates random data once and repeatedly returns them.
324
325
Extract target from fasta header (fasta header is "label_1" in example file):
326
327
```{r warning = FALSE, message = FALSE}
328
# get target from header
329
vocabulary_label <- paste0("label_", 1:5)
330
gen <- get_generator(path = file_path,
331
                     train_type = "label_header",
332
                     batch_size = 1,
333
                     maxlen = 6,
334
                     vocabulary = vocabulary,
335
                     vocabulary_label = vocabulary_label)
336
337
z <- gen()
338
x <- z[[1]][1,,] 
339
y <- z[[2]] 
340
colnames(x) <- vocabulary
341
colnames(y) <- vocabulary_label 
342
x # abcdef
343
y # label_1 
344
```
345
346
Extract target from fasta folder:
347
348
```{r warning = FALSE, message = FALSE}
349
# create data for second class
350
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
351
file_path_2 <- tempfile(fileext = ".fasta")
352
fasta_file <- microseq::writeFasta(df, file_path_2)
353
354
# get target from folder
355
vocabulary_label <- paste0("label_", 1:2)
356
gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class
357
                     train_type = "label_folder",
358
                     batch_size = 8,
359
                     maxlen = 6,
360
                     vocabulary = vocabulary,
361
                     vocabulary_label = vocabulary_label)
362
363
z <- gen()
364
x <- z[[1]]
365
y <- z[[2]] 
366
x_1_1 <- x[1, , ]
367
colnames(x_1_1) <- vocabulary
368
x_1_1 # first sample from first class
369
x_2_1 <- x[5, , ]
370
colnames(x_2_1) <- vocabulary
371
x_2_1 # first sample from second class
372
colnames(y) <- vocabulary_label 
373
y # 4 samples from each class  
374
```
375
376
Extract target from csv file:
377
378
```{r warning = FALSE, message = FALSE}
379
# get target from csv
380
file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta")
381
vocabulary_label <- paste0("label", 1:4)
382
label_1 <- c(1, 0, 0, 0)
383
label_2 <- c(0, 1, 0, 0)
384
label_3 <- c(0, 0, 1, 0)
385
label_4 <- c(0, 0, 0, 1)
386
df <- data.frame(file, label_1, label_2, label_3, label_4)
387
df
388
csv_file <- tempfile(fileext = ".csv")
389
write.csv(df, csv_file, row.names = FALSE)
390
391
gen <- get_generator(path = file_path,
392
                     train_type = "label_csv",
393
                     batch_size = 1,
394
                     maxlen = 6,
395
                     target_from_csv = csv_file,
396
                     vocabulary = vocabulary,
397
                     vocabulary_label = vocabulary_label)
398
399
z <- gen()
400
x <- z[[1]][1,,] 
401
y <- z[[2]] 
402
colnames(x) <- vocabulary
403
colnames(y) <- vocabulary_label 
404
x # abcdef
405
y # label_1 
406
```
407
408
Examples for language models follow in the next section.
409
410
### output_format
411
412
The `output_format` determines the shape of the output for a language model, i.e. part of a sequence is the input $X$ and another the
413
target $Y$. Assume a sequence <tt>abcdefg</tt> and `maxlen = 6`. Output correspond as follows
414
415
**"target_right"**: $X=$  <tt>abcdef</tt>, $Y=$  <tt>g</tt> 
416
417
**"target_middle_lstm"**: $X =$ ($X_1 =$ <tt>abc</tt>, $X_2 =$ <tt>gfe</tt>), $Y=$ <tt>d</tt> (note reversed order of $X_2$)
418
419
**"target_middle_cnn"**: $X =$ <tt>abcefg</tt>, $Y =$ <tt>d</tt> 
420
421
**"wavenet"**: $X =$ <tt>abcdef</tt>, $Y =$ <tt>bcdefg</tt>
422
423
```{r warning = FALSE, message = FALSE}
424
# target_right
425
gen <- get_generator(path = file_path,
426
                     train_type = "lm",
427
                     batch_size = 1,
428
                     maxlen = 6,
429
                     vocabulary = vocabulary,
430
                     output_format = "target_right")
431
432
z <- gen()
433
x <- z[[1]][1,,] 
434
y <- z[[2]] 
435
colnames(x) <- vocabulary
436
colnames(y) <- vocabulary
437
x # abcdef
438
y # g 
439
```
440
441
```{r warning = FALSE, message = FALSE}
442
# target_middle_lstm
443
gen <- get_generator(path = file_path,
444
                     train_type = "lm",
445
                     batch_size = 1,
446
                     maxlen = 6,
447
                     vocabulary = vocabulary,
448
                     output_format = "target_middle_lstm")
449
450
z <- gen()
451
x_1 <- z[[1]][[1]][1,,] 
452
x_2 <- z[[1]][[2]][1,,] 
453
y <- z[[2]] 
454
colnames(x_1) <- vocabulary
455
colnames(x_2) <- vocabulary
456
colnames(y) <- vocabulary
457
x_1 # abc
458
x_2 # gfe
459
y # d 
460
```
461
462
```{r warning = FALSE, message = FALSE}
463
# target_middle_cnn
464
gen <- get_generator(path = file_path,
465
                     train_type = "lm",
466
                     batch_size = 1,
467
                     maxlen = 6,
468
                     vocabulary = vocabulary,
469
                     output_format = "target_middle_cnn")
470
471
z <- gen()
472
x <- z[[1]][1,,]
473
y <- z[[2]]
474
colnames(x) <- vocabulary
475
colnames(y) <- vocabulary
476
x # abcefg
477
y # d
478
```
479
480
```{r warning = FALSE, message = FALSE}
481
# wavenet
482
gen <- get_generator(path = file_path,
483
                     train_type = "lm",
484
                     batch_size = 1,
485
                     maxlen = 6,
486
                     vocabulary = vocabulary,
487
                     output_format = "wavenet")
488
489
z <- gen()
490
x <- z[[1]][1,,] 
491
y <- z[[2]][1,,]
492
colnames(x) <- vocabulary
493
colnames(y) <- vocabulary
494
x # abcdef
495
y # bcdefg
496
```
497
498
### batch_size 
499
500
Number of samples in one batch.
501
502
```{r warning = FALSE, message = FALSE}
503
# target_right
504
gen <- get_generator(path = file_path,
505
                     train_type = "lm",
506
                     batch_size = 7,
507
                     maxlen = 6,
508
                     vocabulary = vocabulary,
509
                     output_format = "target_right")
510
511
z <- gen()
512
x <- z[[1]]
513
y <- z[[2]] 
514
dim(x)
515
dim(y)
516
```
517
518
### step
519
520
We may determine how frequently we want to take a sample. If `step = 1` we take a sample at every possible step. 
521
Let's assume we want to predict the next character, i.e. part of the sequence is the <mark class="in">input</mark> and next character the 
522
<mark class="out">target</mark>. If `maxlen = 3, step = 1`: 
523
524
1. sample: <tt><mark class="in">abc</mark><mark class="out">d</mark>efghiiii</tt> 
525
526
2. sample: <tt>a<mark class="in">bcd</mark><mark class="out">e</mark>fghiiii</tt> 
527
528
3. sample: <tt>ab<mark class="in">cde</mark><mark class="out">f</mark>ghiiii</tt> 
529
530
if `step = 3`  
531
532
1. sample: <tt><mark class="in">abc</mark><mark class="out">d</mark>efghiiii</tt> 
533
534
2. sample: <tt>abc<mark class="in">def</mark><mark class="out">g</mark>hiiii</tt> 
535
536
3. sample: <tt>abcdef<mark class="in">ghi</mark><mark class="out">i</mark>ii</tt> 
537
538
```{r warning = FALSE, message = FALSE}
539
gen <- get_generator(path = file_path,
540
                     train_type = "lm",
541
                     batch_size = 1,
542
                     maxlen = 3,
543
                     vocabulary = vocabulary,
544
                     step = 3, 
545
                     output_format = "target_right")
546
547
z <- gen()
548
x <- z[[1]][1,,] #encodes abc
549
y <- z[[2]] # encodes d
550
colnames(x) <- vocabulary
551
colnames(y) <- vocabulary
552
x
553
y
554
# go 3 steps forward
555
z <- gen()
556
x <- z[[1]][1,,] #encodes def
557
y <- z[[2]] # encodes g
558
colnames(x) <- vocabulary
559
colnames(y) <- vocabulary
560
x
561
y
562
```
563
564
565
### padding
566
567
If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If `padding = FALSE` the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample.
568
569
```{r warning = FALSE, message = FALSE}
570
gen <- get_generator(path = file_path,
571
                     train_type = "lm",
572
                     batch_size = 1,
573
                     maxlen = 15, # maxlen is longer than sequence
574
                     vocabulary = vocabulary,
575
                     step = 3,
576
                     padding = TRUE,
577
                     output_format = "target_right")
578
579
z <- gen()
580
x <- z[[1]][1,,] 
581
y <- z[[2]] 
582
colnames(x) <- vocabulary
583
colnames(y) <- vocabulary
584
x # first 4 entries are zero-vectors
585
y
586
```
587
588
### ambiguous_nuc
589
590
A sequence might contain a character that does not lie inside our vocabulary. For example, let's assume we discard <tt>e</tt> from our vocabulary.
591
We have 4 options to handle this situation 
592
593
(1) encode as zero vector
594
595
```{r warning = FALSE, message = FALSE}
596
vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary
597
598
# zero
599
gen <- get_generator(path = file_path,
600
                     train_type = "lm",
601
                     batch_size = 1,
602
                     maxlen = 6,
603
                     vocabulary = vocabulary_2,
604
                     output_format = "target_right",
605
                     ambiguous_nuc = "zeros")
606
z <- gen()
607
x <- z[[1]][1,,] 
608
colnames(x) <- vocabulary_2
609
x # fifth row is zero vector 
610
```
611
612
(2) equal probability
613
614
```{r warning = FALSE, message = FALSE}
615
# equal
616
gen <- get_generator(path = file_path,
617
                    train_type = "lm",
618
                    batch_size = 1,
619
                    maxlen = 6,
620
                    vocabulary = vocabulary_2,
621
                    output_format = "target_right",
622
                    ambiguous_nuc = "equal") 
623
624
z <- gen()
625
x <- z[[1]][1,,]
626
colnames(x) <- vocabulary_2
627
x # fifth row is 1/8 for every entry 
628
```
629
630
(3) use distribution of current file
631
632
```{r warning = FALSE, message = FALSE}
633
# empirical
634
gen <- get_generator(path = file_path,
635
                     train_type = "lm",
636
                     batch_size = 1,
637
                     maxlen = 6,
638
                     vocabulary = vocabulary_2,
639
                     output_format = "target_right",
640
                     ambiguous_nuc = "empirical") 
641
642
z <- gen()
643
x <- z[[1]][1,,] 
644
colnames(x) <- vocabulary_2
645
x # fifth row is distribuation of file
646
```
647
648
(4) discard 
649
650
```{r warning = FALSE, message = FALSE}
651
# discard
652
gen <- get_generator(path = file_path,
653
                     train_type = "lm",
654
                     batch_size = 1,
655
                     maxlen = 6,
656
                     vocabulary = vocabulary_2,
657
                     output_format = "target_right",
658
                     ambiguous_nuc = "discard") 
659
660
z <- gen()
661
x <- z[[1]][1,,]
662
colnames(x) <- vocabulary_2
663
x # first sample with only characters from vocabulary is fghiii|i
664
```
665
666
### proportion_per_seq
667
668
The `proportion_per_seq` argument gives the option to use a random subset instead of the full sequence. 
669
670
```{r warning = FALSE, message = FALSE}
671
cat("sequence is ", nchar(sequence), "characters long \n")
672
gen <- get_generator(path = file_path,
673
                     train_type = "lm",
674
                     batch_size = 1,
675
                     maxlen = 5,
676
                     seed = 1,
677
                     vocabulary = vocabulary,
678
                     output_format = "target_right",
679
                     # take random subsequence using 50% of sequence 
680
                     proportion_per_seq = 0.5)
681
682
z <- gen()
683
x <- z[[1]][1, , ]
684
y <- z[[2]]
685
colnames(x) <- vocabulary
686
colnames(y) <- vocabulary
687
x # defgh
688
y # i
689
```
690
691
### file_limit
692
693
Integer or NULL. If integer, use only specified number of randomly sampled files for training. 
694
695
### delete_used_files
696
697
If true, delete file once used. Only applies for rds files.
698
699
```{r warning = FALSE, message = FALSE}
700
701
x <- array(0, dim = c(1,5,4))
702
y <- matrix(0, ncol = 1)
703
rds_path <- tempfile(fileext = ".rds")
704
saveRDS(list(x, y), rds_path)
705
706
gen <- get_generator(path = rds_path,
707
                     delete_used_files = TRUE,
708
                     train_type = "label_rds",
709
                     batch_size = 1,
710
                     maxlen = 5)
711
712
z <- gen()
713
file.exists(rds_path)
714
# z <- gen()
715
# When calling the generator again, it will wait until it finds a file again from the files listed in 
716
# the initial `path` argument. Can be used if another process(es) create rds files.
717
```
718
719
### max_samples
720
721
Only use fixed number of samples per file. Randomly choose which samples to use. (If `random_sampling = FALSE`, samples are consecutive.)
722
723
```{r warning = FALSE, message = FALSE}
724
gen <- get_generator(path = file_path,
725
                     train_type = "lm",
726
                     batch_size = 2,
727
                     maxlen = 5,
728
                     step = 1,
729
                     seed = 3,
730
                     vocabulary = vocabulary,
731
                     output_format = "target_right",
732
                     max_samples = 2)
733
734
z <- gen()
735
x1 <- z[[1]][1, , ]
736
x2 <- z[[1]][2, , ]
737
colnames(x1) <- vocabulary
738
colnames(x2) <- vocabulary
739
x1 # bcdef
740
x2 # cdefg
741
```
742
743
### random_sampling
744
745
If you use `max_samples`, generator will randomly choose subset from all possible samples, but those samples are consecutive. With `random_sampling = TRUE`,
746
samples are completely random.
747
748
```{r warning = FALSE, message = FALSE}
749
gen <- get_generator(path = file_path,
750
                     train_type = "lm",
751
                     batch_size = 2,
752
                     maxlen = 5,
753
                     seed = 66,
754
                     random_sampling = TRUE,
755
                     vocabulary = vocabulary,
756
                     output_format = "target_right",
757
                     max_samples = 2)
758
759
z <- gen()
760
x1 <- z[[1]][1, , ]
761
x2 <- z[[1]][2, , ]
762
colnames(x1) <- vocabulary
763
colnames(x2) <- vocabulary
764
x1 # efghi
765
x2 # defgh
766
```
767
768
### target_len 
769
770
Target length for language model.
771
772
```{r warning = FALSE, message = FALSE}
773
gen <- get_generator(path = file_path,
774
                     train_type = "lm",
775
                     batch_size = 1,
776
                     target_len = 3, 
777
                     maxlen = 5,
778
                     vocabulary = vocabulary,
779
                     output_format = "target_right")
780
781
z <- gen()
782
x <- z[[1]][1, , ]
783
y1 <- z[[2]][ , 1, ]
784
y2 <- z[[2]][ , 2, ]
785
y3 <- z[[2]][ , 3, ]
786
colnames(x) <- vocabulary
787
names(y1) <- vocabulary
788
names(y2) <- vocabulary
789
names(y3) <- vocabulary
790
x # abcde
791
y1 # f
792
y2 # g
793
y3 # h
794
```
795
796
### n_gram / n_gram_stride                       
797
798
Encode target in language model not character wise but combine n characters to one target. `n_gram_stride` determines the frequency of 
799
the n-gram encoding. 
800
801
```{r warning = FALSE, message = FALSE}
802
gen <- get_generator(path = file_path,
803
                     train_type = "lm",
804
                     batch_size = 1,
805
                     target_len = 6, 
806
                     n_gram = 3,
807
                     n_gram_stride = 3,
808
                     maxlen = 3,
809
                     vocabulary = vocabulary,
810
                     output_format = "target_right")
811
812
z <- gen()
813
x <- z[[1]]
814
y1 <- z[[2]][ , 1, ]
815
y2 <- z[[2]][ , 2, ]
816
817
dim(x)[3] == length(vocabulary)^3
818
# x = abc as 3-gram
819
# y1 = def as 3-gram
820
# y2 = ghi as 3-gram
821
```
822
823
### add_noise
824
825
Add noise to input. Must be a list that specifies noise distribution or NULL (no noise).
826
List contains arguments `noise_type`: either `"normal"` or `"uniform"`.
827
Optional arguments `sd` or `mean` if `noise_type` is `"normal"` (default is `sd=1` and `mean=0`) or `min`, `max` if `noise_type` is `"uniform"`
828
(default is `min=0`, `max=1`).  
829
830
```{r warning = FALSE, message = FALSE}
831
gen <- get_generator(path = file_path,
832
                     train_type = "lm",
833
                     batch_size = 1,
834
                     add_noise = list(noise_type = "normal", mean = 0, sd = 0.01),
835
                     maxlen = 5,
836
                     vocabulary = vocabulary,
837
                     output_format = "target_right")
838
839
z <- gen()
840
x <- z[[1]][1, , ]
841
y <- z[[2]]
842
843
colnames(x) <- vocabulary
844
colnames(y) <- vocabulary
845
round(x, 3) # abcde + noise
846
y # f
847
```
848
849
### proportion_entries
850
851
If a fasta file has multiple entries, you can randomly choose a subset.
852
For example, if the file has 6 entries and `proportion_entries = 0.5` 
853
the generator will randomly choose only 3 of the entries.
854
855
### shuffle_file_order 
856
857
Shuffle file order before iterating through files. Order gets reshuffled after every iteration.
858
859
### shuffle_input 
860
861
Whether to shuffle fasta entries if fasta file has multiple entries.
862
863
### reverse_complement
864
865
If `TRUE`, randomly decide for every batch to use original sequence or its reverse complement.
866
Only implemented for <tt>ACGT</tt> vocabulary.
867
868
### sample_by_file_size 
869
870
Randomly choose new file by sampling according to file size (bigger files more likely).
871
872
### concat_seq                     
873
874
Character string or `NULL`. If not `NULL` all entries from file get concatenated to one sequence with `concat_seq` string between them. 
875
Use `concat_seq = ""` if you don't want to add a new token.
876
877
```{r warning = FALSE, message = FALSE}
878
df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3))
879
fasta_path <- tempfile(fileext = ".fasta")
880
fasta_file <- microseq::writeFasta(df, fasta_path)
881
gen <- get_generator(path = fasta_path,
882
                     train_type = "lm",
883
                     batch_size = 1,
884
                     maxlen = 9,
885
                     vocabulary = c("A", "C", "G", "T", "Z"),
886
                     concat_seq = "ZZ",
887
                     output_format = "target_right")
888
889
z <- gen()
890
x <- z[[1]][1, , ]
891
y <- z[[2]]
892
893
colnames(x) <- c("A", "C", "G", "T", "Z")
894
colnames(y) <- c("A", "C", "G", "T", "Z")
895
x # ACZZAGZZA
896
y # T
897
```
898
899
### set_learning
900
901
When you want to assign one label to set of samples. Only implemented for `train_type = "label_folder"`.
902
Input is a list with the following parameters
903
904
+  `samples_per_target` how many samples to use for one target
905
+  `maxlen` length of one sample
906
+  `reshape_mode`: `"time_dist", "multi_input"` or `"concat"`. 
907
     + If `reshape_mode = "multi_input"`, generator will produce `samples_per_target` separate inputs, each of length `maxlen`. 
908
     + If `reshape_mode = "time_dist"`, generator will produce a 4D input array. The dimensions correspond to
909
       `(batch_size, samples_per_target, maxlen, length(vocabulary))`.   
910
     +  If `reshape_mode` is `"concat"`, generator will concatenate `samples_per_target` sequences
911
        of length `maxlen` to one long sequence.
912
+   If `reshape_mode = "concat"`, there is an additional `buffer_len` argument: add new token between 
913
   concatenated samples
914
    + If `buffer_len` is an integer, the sub-sequences are inter spaced with `buffer_len` rows. The input length is
915
      (`maxlen` \* `samples_per_target`) + `buffer_len` \* (`samples_per_target` - 1)   
916
917
```{r warning = FALSE, message = FALSE}
918
# create data for second label
919
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
920
file_path_2 <- tempfile(fileext = ".fasta")
921
fasta_file <- microseq::writeFasta(df, file_path_2)
922
923
# multi_input 
924
set_learning <- list(reshape_mode = "multi_input",
925
                     maxlen = 4,
926
                     samples_per_target = 3)
927
928
gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
929
                     train_type = "label_folder",
930
                     batch_size = 2,
931
                     maxlen = 4,
932
                     step = 1, 
933
                     vocabulary = vocabulary,
934
                     set_learning = set_learning)
935
936
z <- gen()
937
x <- z[[1]]
938
y <- z[[2]]
939
length(x) # 3 samples per target
940
x_1_1 <- x[[1]][1, , ]
941
x_1_1 # abcd
942
x_1_2 <- x[[2]][1, , ]
943
x_1_2 # bcde
944
x_1_3 <- x[[3]][1, , ]
945
x_1_3 # cdef
946
947
x_2_1 <- x[[1]][2, , ]
948
x_2_1 # aaba
949
x_2_2 <- x[[2]][2, , ]
950
x_2_2 # abaa
951
x_2_3 <- x[[3]][2, , ]
952
x_2_3 # baac
953
954
colnames(y) <- c("label_1", "label_2")
955
y 
956
```
957
958
```{r warning = FALSE, message = FALSE}
959
# concat 
960
set_learning <- list(reshape_mode = "concat",
961
                     maxlen = 4,
962
                     samples_per_target = 3)
963
964
gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
965
                     train_type = "label_folder",
966
                     batch_size = 2,
967
                     maxlen = 4,
968
                     step = 2, 
969
                     vocabulary = vocabulary,
970
                     set_learning = set_learning)
971
972
z <- gen()
973
x <- z[[1]]
974
y <- z[[2]]
975
dim(x) 
976
x_1 <- x[1, , ]
977
colnames(x_1) <- vocabulary
978
x_1 # abcd | cdef | efgh
979
x_2 <- x[2, , ]
980
colnames(x_2) <- vocabulary
981
x_2 # aaba | baac | acaa
982
983
colnames(y) <- c("label_1", "label_2")
984
y 
985
```
986
987
988
### use_quality_score
989
990
If `TRUE`, instead of one-hot encoding, use quality score of fastq file.
991
992
```{r warning = FALSE, message = FALSE}
993
df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I")
994
fastq_path <- tempfile(fileext = ".fastq")
995
fastq_file <- microseq::writeFastq(df, fastq_path)
996
gen <- get_generator(path = fastq_path,
997
                     train_type = "lm",
998
                     batch_size = 1,
999
                     maxlen = 5,
1000
                     format = "fastq",
1001
                     vocabulary = c("A", "C", "G", "T"),
1002
                     use_quality_score = TRUE,
1003
                     output_format = "target_right")
1004
1005
z <- gen()
1006
x <- z[[1]][1, , ]
1007
y <- z[[2]]
1008
1009
colnames(x) <- c("A", "C", "G", "T")
1010
colnames(y) <- c("A", "C", "G", "T")
1011
x # ACAGA
1012
y # T
1013
```
1014
1015
1016
### use_coverage 
1017
1018
Integer or `NULL`. If not `NULL`, use coverage as encoding rather than one-hot encoding.
1019
Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where 
1020
n is some integer.
1021
1022
```{r warning = FALSE, message = FALSE}
1023
df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8")
1024
fasta_path <- tempfile(fileext = ".fasta")
1025
fasta_file <- microseq::writeFasta(df, fasta_path)
1026
gen <-  get_generator(path = fasta_path,
1027
                      train_type = "lm",
1028
                      batch_size = 1,
1029
                      maxlen = 5,
1030
                      vocabulary = c("A", "C", "G", "T"),
1031
                      use_coverage = 25,
1032
                      output_format = "target_right")
1033
1034
z <- gen()
1035
x <- z[[1]][1, , ]
1036
y <- z[[2]]
1037
1038
colnames(x) <- c("A", "C", "G", "T")
1039
colnames(y) <- c("A", "C", "G", "T")
1040
x # ACAGA; 0.32 = 8/25
1041
y # T
1042
```
1043
1044
### added_label_path
1045
1046
It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named "file". 
1047
1048
We may add some additional input to our dummy data
1049
1050
```{r warning = FALSE, message = FALSE}
1051
file <- c(basename(file_path), "some_file_name.fasta")
1052
df <- data.frame(file = file,
1053
                 label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0))
1054
df
1055
write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE)
1056
```
1057
1058
If we add the path to the csv file, the generator will map additional input to sequences: 
1059
1060
```{r warning = FALSE, message = FALSE}
1061
gen <-  get_generator(path = dir_path,
1062
                      train_type = "lm", 
1063
                      batch_size = 1,
1064
                      maxlen = 5,
1065
                      output_format = "target_right",
1066
                      vocabulary = vocabulary,
1067
                      added_label_path = file.path(dir_path, "add_input.csv"),
1068
                      add_input_as_seq = FALSE)  # don't treat added input as sequence
1069
                      
1070
z <- gen()
1071
added_label_input <- z[[1]][[1]]
1072
added_label_input
1073
x <- z[[1]][[2]]
1074
x[1, , ]
1075
y <- z[[2]] 
1076
y
1077
```
1078
1079
If we want to train a network with additional labels, we have to add an additional input layer.
1080
1081
```{r warning = FALSE, message = FALSE}
1082
model <- create_model_lstm_cnn(
1083
  maxlen = 5,
1084
  layer_lstm = c(8, 8),
1085
  layer_dense = c(4),
1086
  label_input = 3 # additional input vector has length 3
1087
)
1088
1089
# train_model(train_type = "lm", 
1090
#             model = model,
1091
#             path = file.path(dir_path, "train_files_1"),
1092
#             path_val = file.path(dir_path, "validation_files_1"),
1093
#             added_label_path = file.path(dir_path, "add_input.csv"),
1094
#             steps_per_epoch = 5,
1095
#             batch_size = 8,
1096
#             epochs = 2)
1097
```
1098
1099
1100
### return_int
1101
1102
Whether to return integer encoding rather than one-hot encoding.
1103
1104
```{r warning = FALSE, message = FALSE}
1105
df <- data.frame(Sequence = "ATCGC", Header = "seq_1")
1106
fasta_path <- tempfile(fileext = ".fasta")
1107
fasta_file <- microseq::writeFasta(df, fasta_path)
1108
gen <-  get_generator(path = fasta_path,
1109
                      train_type = "lm",
1110
                      batch_size = 1,
1111
                      return_int = TRUE,
1112
                      padding = TRUE,
1113
                      maxlen = 8,
1114
                      vocabulary = c("A", "C", "G", "T"),
1115
                      output_format = "target_right")
1116
1117
z <- gen()
1118
x <- z[[1]]
1119
y <- z[[2]]
1120
colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G")
1121
x
1122
colnames(y) <- "C"
1123
y
1124
```
1125
1126
Can also be combined with n-gram encoding:
1127
1128
```{r warning = FALSE, message = FALSE}
1129
df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1")
1130
fasta_path <- tempfile(fileext = ".fasta")
1131
fasta_file <- microseq::writeFasta(df, fasta_path)
1132
gen <-  get_generator(path = fasta_path,
1133
                      train_type = "lm",
1134
                      batch_size = 1,
1135
                      n_gram = 3,
1136
                      n_gram_stride = 3,
1137
                      return_int = TRUE,
1138
                      maxlen = 6,
1139
                      target_len = 3,
1140
                      vocabulary = c("A", "C", "G", "T"),
1141
                      output_format = "target_right")
1142
1143
z <- gen()
1144
x <- z[[1]]
1145
y <- z[[2]]
1146
colnames(x) <- c("AAA", "CCC")
1147
x
1148
colnames(y) <- "TTT"
1149
y
1150
```
1151
1152
### reshape_xy
1153
1154
Apply some function to the output of a generator call.
1155
1156
```{r}
1157
df <- data.frame(Sequence = "AAAATTTT", Header = "header_1")
1158
fasta_path <- tempfile(fileext = ".fasta")
1159
fasta_file <- microseq::writeFasta(df, fasta_path)
1160
fx <- function(x = NULL, y = NULL) {
1161
  return(x - 1)
1162
}
1163
fy <- function(x = NULL, y = NULL) {
1164
  return(exp(y * 5))
1165
}
1166
1167
gen <-  get_generator(path = fasta_path,
1168
                      reshape_xy = list(x = fx, y = fy),
1169
                      train_type = "label_folder",
1170
                      batch_size = 1,
1171
                      maxlen = 8)
1172
1173
z <- gen()
1174
x <- z[[1]]
1175
x[1,,]
1176
y <- z[[2]]
1177
y
1178
```
1179
1180
1181
### masked_lm
1182
1183
Masks some parts of input sequence. Can be used for training BERT-like models.
1184
1185
```{r warning = FALSE, message = FALSE}
1186
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
1187
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
1188
fasta_path <- tempfile(fileext = ".fasta")
1189
fasta_file <- microseq::writeFasta(df, fasta_path)
1190
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
1191
                  random_rate = 0.025, # set 2.5% of input to random value
1192
                  identity_rate = 0.05, # leave 5% unchanged
1193
                  include_sw = TRUE) # 0,1 matrix showing where masking was applied
1194
gen <-  get_generator(path = fasta_path,
1195
                      train_type = "masked_lm",
1196
                      masked_lm = masked_lm,
1197
                      batch_size = 1,
1198
                      n_gram = 1,
1199
                      n_gram_stride = 1,
1200
                      return_int = TRUE,
1201
                      maxlen = 100,
1202
                      vocabulary = c("A", "C", "G", "T"))
1203
1204
z <- gen()
1205
x <- z[[1]]
1206
y <- z[[2]]
1207
sw <- z[[3]]
1208
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ])
1209
head(df)
1210
```
1211
1212
Whenever sw (sample weight) column is 0, x and y columns are identical. Let's look at rows where sw is 1:
1213
1214
```{r warning = FALSE, message = FALSE}
1215
df %>% dplyr::filter(sw == 1)
1216
```
1217
1218
Here 5 is the mask token, this is always the size of the vocabulary + 1.
1219
1220
```{r warning = FALSE, message = FALSE}
1221
df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part
1222
df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value)
1223
```
1224
1225
Can be combined with n-gram encoding and masking of fixed block size:
1226
1227
```{r warning = FALSE, message = FALSE}
1228
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
1229
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
1230
fasta_path <- tempfile(fileext = ".fasta")
1231
fasta_file <- microseq::writeFasta(df, fasta_path)
1232
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
1233
                  random_rate = 0.05, # set 5% of input to random value
1234
                  identity_rate = 0.05, # leave 5% unchanged
1235
                  include_sw = TRUE, # 0,1 matrix showing where masking was applied
1236
                  block_len = 3) # always mask at least 3 tokens in a row 
1237
gen <-  get_generator(path = fasta_path,
1238
                      train_type = "masked_lm",
1239
                      masked_lm = masked_lm,
1240
                      batch_size = 1,
1241
                      n_gram = 3,
1242
                      seed = 12,
1243
                      n_gram_stride = 1,
1244
                      return_int = TRUE,
1245
                      maxlen = 100,
1246
                      vocabulary = c("A", "C", "G", "T"))
1247
1248
z <- gen()
1249
x <- z[[1]]
1250
y <- z[[2]]
1251
sw <- z[[3]]
1252
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x))
1253
head(df)
1254
tail(df)
1255
```
1256
1257
We can check that sample weights appear only in blocks.
1258
1259
```{r warning = FALSE, message = FALSE}
1260
which(sw == 1)
1261
```
1262
1263
Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1).
1264
1265
```{r warning = FALSE, message = FALSE}
1266
df %>% dplyr::filter(sw == 1 & x == 65) # 10% masked part
1267
df %>% dplyr::filter(sw == 1 & x != 65) # 5% identity part and 5% random part (can randomly be the true value)
1268
```