[1c0e03]: / vignettes / data_generator.Rmd

Download this file

1269 lines (1048 with data), 37.0 kB

---
title: "Data Generator"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Generator}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE, warning=FALSE, message=FALSE}

if (!reticulate::py_module_available("tensorflow")) {
  knitr::opts_chunk$set(eval = FALSE)
} else {
  knitr::opts_chunk$set(eval = TRUE)
}
```

```{r, message=FALSE}
library(deepG)
library(magrittr)
library(keras)
```


```{r, echo=FALSE, warning=FALSE, message=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
```

```{css, echo=FALSE}
mark.in {
  background-color: CornflowerBlue;
}

mark.out {
  background-color: IndianRed;
}

```

## Introduction

The most common use case for the deepG data generator is to extract samples from a collection of 
fasta (or fastq) files.
The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$.
We can differentiate between 2 approaches

+ **Language model**: Part of a sequence is the input and other part the target.
    + Example: Predict the next nucleotide given the previous 100 nucleotides. 
+ **Label classification**: Assign a label to a sequence.  
    + Example: Assign a label "virus" or "bacteria" to a sequence of length 100.

Suppose we are given 2 fasta files called "a.fasta" and "b.fasta" that look as follows:

<div style="float: left;margin-right:10px">
  <table>
    <tr>
      <td>
      **a.fasta** <br>
      <tt> 
        >header_a1 <br>
        AACCAAGG <br>
        >header_a2 <br>
        TTTGGG <br>
        >header_a3 <br>
        ACGTACGT <br>
      </tt>
      </td> 
    </tr>
  </table>
</div>
<div style="float: left">
  <table>
    <tr>
      <td>
            **b.fasta** <br>
            <tt> 
              >header_b1 <br>
              GTGTGT <br>
              >header_b2 <br>
              AAGG <br>
              </tt>
      </td>
    </tr>
  </table>
</div>
<br><br><br><br><br><br><br><br><br>

If we want to extract sequences of length 4 from these files, there would be 17 possible samples 
(5 from <tt>AACCAAGG</tt>, 3 from <tt>TTTGGG</tt>, ...). 
A naive approach would be to extract the samples in a sequential manner:  

*1. sample*: 

<div style="float: left;margin-right:10px">
  <table>
    <tr>
      <td>
      **a.fasta** <br>
      <tt> 
        >header_a1 <br>
        <mark class="in">AACC</mark>AAGG <br>
        >header_a2 <br>
        TTTGGG <br>
        >header_a3 <br>
        ACGTACGT <br>
      </tt>
      </td> 
    </tr>
  </table>
</div>
<div style="float: left">
  <table>
    <tr>
      <td>
            **b.fasta** <br>
            <tt> 
              >header_b1 <br>
              GTGTGT <br>
              >header_b2 <br>
              AAGG <br>
              </tt>
      </td>
    </tr>
  </table>
</div>
<br><br><br><br><br><br><br><br><br>

*2. sample*: 

<div style="float: left;margin-right:10px">
  <table>
    <tr>
      <td>
      **a.fasta** <br>
      <tt> 
        >header_a1 <br>
        A<mark class="in">ACCA</mark>AGG <br>
        >header_a2 <br>
        TTTGGG <br>
        >header_a3 <br>
        ACGTACGT <br>
      </tt>
      </td> 
    </tr>
  </table>
</div>
<div style="float: left">
  <table>
    <tr>
      <td>
            **b.fasta** <br>
            <tt> 
              >header_b1 <br>
              GTGTGT <br>
              >header_b2 <br>
              AAGG <br>
              </tt>
      </td>
    </tr>
  </table>
</div>
<br><br><br><br><br><br><br><br><br>

... 

<br>

*17. sample*: 

<div style="float: left;margin-right:10px">
  <table>
    <tr>
      <td>
      **a.fasta** <br>
      <tt> 
        >header_a1 <br>
        AACCAAGG <br>
        >header_a2 <br>
        TTTGGG <br>
        >header_a3 <br>
        ACGTACGT <br>
      </tt>
      </td> 
    </tr>
  </table>
</div>
<div style="float: left">
  <table>
    <tr>
      <td>
            **b.fasta** <br>
            <tt> 
              >header_b1 <br>
              GTGTGT <br>
              >header_b2 <br>
              <mark class="in">AAGG</mark> <br>
              </tt>
      </td>
    </tr>
  </table>
</div>
<br><br><br><br><br><br><br><br><br>

*18. sample*: 

<div style="float: left;margin-right:10px">
  <table>
    <tr>
      <td>
      **a.fasta** <br>
      <tt> 
        >header_a1 <br>
        <mark class="in">AACC</mark>AAGG <br>
        >header_a2 <br>
        TTTGGG <br>
        >header_a3 <br>
        ACGTACGT <br>
      </tt>
      </td> 
    </tr>
  </table>
</div>
<div style="float: left">
  <table>
    <tr>
      <td>
            **b.fasta** <br>
            <tt> 
              >header_b1 <br>
              GTGTGT <br>
              >header_b2 <br>
              AAGG <br>
              </tt>
      </td>
    </tr>
  </table>
</div>
<br><br><br><br><br><br><br><br><br>

... 
<br><br>

For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and 
the model would often see long stretches of data from the same source.
Choosing the samples completely at random can also be problematic since we would constantly have to open new files.
The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches.   

## Data generator options

In the following code examples, we will mostly use the sequence <tt> **abcdefghiiii** </tt> to demonstrate some of the 
deepG data generator options. (In real world application you would usually have sequences from the <tt>ACGT</tt> vocabulary.) 

```{r warning = FALSE, message = FALSE}
sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i")
vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")  
```

We may store this sequence in a fasta file

```{r warning = FALSE, message = FALSE}
temp_dir <- tempfile()
dir.create(temp_dir)
dir_path <- paste0(temp_dir, "/dummy_data")
dir.create(dir_path)
df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE)
file_path <- file.path(dir_path, "a.fasta")
# sequence as fasta file
microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path)
```

Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data.
Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see `use_coverage`, `use_quality_score` 
and `ambiguous_nuc` sections.

```{r warning = FALSE, message = FALSE}
# one-hot encoding example
s <-  c("a", "c", "a", "f", "i", "b")
s_as_int_seq <- vector("integer", length(s))
for (i in 1:length(s)) {
  s_as_int_seq[i] <- which(s[i] == vocabulary) - 1
}
one_hot_sample <- keras::to_categorical(s_as_int_seq)
colnames(one_hot_sample) <- vocabulary
one_hot_sample
```

### maxlen 

The length of the input sequence.

### vocabulary 

The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with
the `ambiguous_nuc` argument.

### train_type

The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$.
The `train_type` argument determines how $X$ and $Y$ get extracted. 
Possible arguments for <u> *language models* </u> are:

+ **"lm"** or **"lm_rds"**: Given some sequence $s$, we take some subset of that sequence as input and the rest as target. How to split $s$ 
  can be specified in `output_format` argument.

Besides the language model approach, we can use <u> *label classification* </u>. This means we map some label to a sequence. For example, the target for some 
nucleotide sequence could be one of the labels "bacteria" or "virus". We have to specify how to extract a label corresponding to a sequence. 
Possible arguments are:

+ **"label_header"**: get label from fasta headers.
+ **"label_folder"**: get label from folder, i.e. all files in one folder must belong to the same class.
+ **"label_csv"**: get label from csv file. Csv file should have one column named "file". The targets then correspond to entries in that row (except "file"
column). Example: if we are currently working with a file called "a.fasta", there should be a row in our csv file with some target information for that file <br>


  |  file       | label_1 | label_2 | 
  |   ---       |   ---   |  ---    |   
  | "a.fasta"   |    1    |    0    |


+ **"label_rds"**: rds file contains preprocessed list of input and target tensors.

Another option is **"dummy_gen"**: generator creates random data once and repeatedly returns them.

Extract target from fasta header (fasta header is "label_1" in example file):

```{r warning = FALSE, message = FALSE}
# get target from header
vocabulary_label <- paste0("label_", 1:5)
gen <- get_generator(path = file_path,
                     train_type = "label_header",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     vocabulary_label = vocabulary_label)

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary_label 
x # abcdef
y # label_1 
```

Extract target from fasta folder:

```{r warning = FALSE, message = FALSE}
# create data for second class
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
file_path_2 <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, file_path_2)

# get target from folder
vocabulary_label <- paste0("label_", 1:2)
gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class
                     train_type = "label_folder",
                     batch_size = 8,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     vocabulary_label = vocabulary_label)

z <- gen()
x <- z[[1]]
y <- z[[2]] 
x_1_1 <- x[1, , ]
colnames(x_1_1) <- vocabulary
x_1_1 # first sample from first class
x_2_1 <- x[5, , ]
colnames(x_2_1) <- vocabulary
x_2_1 # first sample from second class
colnames(y) <- vocabulary_label 
y # 4 samples from each class  
```

Extract target from csv file:

```{r warning = FALSE, message = FALSE}
# get target from csv
file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta")
vocabulary_label <- paste0("label", 1:4)
label_1 <- c(1, 0, 0, 0)
label_2 <- c(0, 1, 0, 0)
label_3 <- c(0, 0, 1, 0)
label_4 <- c(0, 0, 0, 1)
df <- data.frame(file, label_1, label_2, label_3, label_4)
df
csv_file <- tempfile(fileext = ".csv")
write.csv(df, csv_file, row.names = FALSE)

gen <- get_generator(path = file_path,
                     train_type = "label_csv",
                     batch_size = 1,
                     maxlen = 6,
                     target_from_csv = csv_file,
                     vocabulary = vocabulary,
                     vocabulary_label = vocabulary_label)

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary_label 
x # abcdef
y # label_1 
```

Examples for language models follow in the next section.

### output_format

The `output_format` determines the shape of the output for a language model, i.e. part of a sequence is the input $X$ and another the
target $Y$. Assume a sequence <tt>abcdefg</tt> and `maxlen = 6`. Output correspond as follows

**"target_right"**: $X=$  <tt>abcdef</tt>, $Y=$  <tt>g</tt> 

**"target_middle_lstm"**: $X =$ ($X_1 =$ <tt>abc</tt>, $X_2 =$ <tt>gfe</tt>), $Y=$ <tt>d</tt> (note reversed order of $X_2$)

**"target_middle_cnn"**: $X =$ <tt>abcefg</tt>, $Y =$ <tt>d</tt> 

**"wavenet"**: $X =$ <tt>abcdef</tt>, $Y =$ <tt>bcdefg</tt>

```{r warning = FALSE, message = FALSE}
# target_right
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcdef
y # g 
```

```{r warning = FALSE, message = FALSE}
# target_middle_lstm
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_middle_lstm")

z <- gen()
x_1 <- z[[1]][[1]][1,,] 
x_2 <- z[[1]][[2]][1,,] 
y <- z[[2]] 
colnames(x_1) <- vocabulary
colnames(x_2) <- vocabulary
colnames(y) <- vocabulary
x_1 # abc
x_2 # gfe
y # d 
```

```{r warning = FALSE, message = FALSE}
# target_middle_cnn
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_middle_cnn")

z <- gen()
x <- z[[1]][1,,]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcefg
y # d
```

```{r warning = FALSE, message = FALSE}
# wavenet
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "wavenet")

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]][1,,]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # abcdef
y # bcdefg
```

### batch_size 

Number of samples in one batch.

```{r warning = FALSE, message = FALSE}
# target_right
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 7,
                     maxlen = 6,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]]
y <- z[[2]] 
dim(x)
dim(y)
```

### step

We may determine how frequently we want to take a sample. If `step = 1` we take a sample at every possible step. 
Let's assume we want to predict the next character, i.e. part of the sequence is the <mark class="in">input</mark> and next character the 
<mark class="out">target</mark>. If `maxlen = 3, step = 1`: 

1. sample: <tt><mark class="in">abc</mark><mark class="out">d</mark>efghiiii</tt> 

2. sample: <tt>a<mark class="in">bcd</mark><mark class="out">e</mark>fghiiii</tt> 

3. sample: <tt>ab<mark class="in">cde</mark><mark class="out">f</mark>ghiiii</tt> 

if `step = 3`  

1. sample: <tt><mark class="in">abc</mark><mark class="out">d</mark>efghiiii</tt> 

2. sample: <tt>abc<mark class="in">def</mark><mark class="out">g</mark>hiiii</tt> 

3. sample: <tt>abcdef<mark class="in">ghi</mark><mark class="out">i</mark>ii</tt> 

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 3,
                     vocabulary = vocabulary,
                     step = 3, 
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1,,] #encodes abc
y <- z[[2]] # encodes d
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x
y
# go 3 steps forward
z <- gen()
x <- z[[1]][1,,] #encodes def
y <- z[[2]] # encodes g
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x
y
```


### padding

If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If `padding = FALSE` the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample.

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 15, # maxlen is longer than sequence
                     vocabulary = vocabulary,
                     step = 3,
                     padding = TRUE,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1,,] 
y <- z[[2]] 
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # first 4 entries are zero-vectors
y
```

### ambiguous_nuc

A sequence might contain a character that does not lie inside our vocabulary. For example, let's assume we discard <tt>e</tt> from our vocabulary.
We have 4 options to handle this situation 

(1) encode as zero vector

```{r warning = FALSE, message = FALSE}
vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary

# zero
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary_2,
                     output_format = "target_right",
                     ambiguous_nuc = "zeros")
z <- gen()
x <- z[[1]][1,,] 
colnames(x) <- vocabulary_2
x # fifth row is zero vector 
```

(2) equal probability

```{r warning = FALSE, message = FALSE}
# equal
gen <- get_generator(path = file_path,
                    train_type = "lm",
                    batch_size = 1,
                    maxlen = 6,
                    vocabulary = vocabulary_2,
                    output_format = "target_right",
                    ambiguous_nuc = "equal") 

z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # fifth row is 1/8 for every entry 
```

(3) use distribution of current file

```{r warning = FALSE, message = FALSE}
# empirical
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary_2,
                     output_format = "target_right",
                     ambiguous_nuc = "empirical") 

z <- gen()
x <- z[[1]][1,,] 
colnames(x) <- vocabulary_2
x # fifth row is distribuation of file
```

(4) discard 

```{r warning = FALSE, message = FALSE}
# discard
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 6,
                     vocabulary = vocabulary_2,
                     output_format = "target_right",
                     ambiguous_nuc = "discard") 

z <- gen()
x <- z[[1]][1,,]
colnames(x) <- vocabulary_2
x # first sample with only characters from vocabulary is fghiii|i
```

### proportion_per_seq

The `proportion_per_seq` argument gives the option to use a random subset instead of the full sequence. 

```{r warning = FALSE, message = FALSE}
cat("sequence is ", nchar(sequence), "characters long \n")
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 5,
                     seed = 1,
                     vocabulary = vocabulary,
                     output_format = "target_right",
                     # take random subsequence using 50% of sequence 
                     proportion_per_seq = 0.5)

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]
colnames(x) <- vocabulary
colnames(y) <- vocabulary
x # defgh
y # i
```

### file_limit

Integer or NULL. If integer, use only specified number of randomly sampled files for training. 

### delete_used_files

If true, delete file once used. Only applies for rds files.

```{r warning = FALSE, message = FALSE}

x <- array(0, dim = c(1,5,4))
y <- matrix(0, ncol = 1)
rds_path <- tempfile(fileext = ".rds")
saveRDS(list(x, y), rds_path)

gen <- get_generator(path = rds_path,
                     delete_used_files = TRUE,
                     train_type = "label_rds",
                     batch_size = 1,
                     maxlen = 5)

z <- gen()
file.exists(rds_path)
# z <- gen()
# When calling the generator again, it will wait until it finds a file again from the files listed in 
# the initial `path` argument. Can be used if another process(es) create rds files.
```

### max_samples

Only use fixed number of samples per file. Randomly choose which samples to use. (If `random_sampling = FALSE`, samples are consecutive.)

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 2,
                     maxlen = 5,
                     step = 1,
                     seed = 3,
                     vocabulary = vocabulary,
                     output_format = "target_right",
                     max_samples = 2)

z <- gen()
x1 <- z[[1]][1, , ]
x2 <- z[[1]][2, , ]
colnames(x1) <- vocabulary
colnames(x2) <- vocabulary
x1 # bcdef
x2 # cdefg
```

### random_sampling

If you use `max_samples`, generator will randomly choose subset from all possible samples, but those samples are consecutive. With `random_sampling = TRUE`,
samples are completely random.

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 2,
                     maxlen = 5,
                     seed = 66,
                     random_sampling = TRUE,
                     vocabulary = vocabulary,
                     output_format = "target_right",
                     max_samples = 2)

z <- gen()
x1 <- z[[1]][1, , ]
x2 <- z[[1]][2, , ]
colnames(x1) <- vocabulary
colnames(x2) <- vocabulary
x1 # efghi
x2 # defgh
```

### target_len 

Target length for language model.

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     target_len = 3, 
                     maxlen = 5,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y1 <- z[[2]][ , 1, ]
y2 <- z[[2]][ , 2, ]
y3 <- z[[2]][ , 3, ]
colnames(x) <- vocabulary
names(y1) <- vocabulary
names(y2) <- vocabulary
names(y3) <- vocabulary
x # abcde
y1 # f
y2 # g
y3 # h
```

### n_gram / n_gram_stride                       

Encode target in language model not character wise but combine n characters to one target. `n_gram_stride` determines the frequency of 
the n-gram encoding. 

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     target_len = 6, 
                     n_gram = 3,
                     n_gram_stride = 3,
                     maxlen = 3,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]]
y1 <- z[[2]][ , 1, ]
y2 <- z[[2]][ , 2, ]

dim(x)[3] == length(vocabulary)^3
# x = abc as 3-gram
# y1 = def as 3-gram
# y2 = ghi as 3-gram
```

### add_noise

Add noise to input. Must be a list that specifies noise distribution or NULL (no noise).
List contains arguments `noise_type`: either `"normal"` or `"uniform"`.
Optional arguments `sd` or `mean` if `noise_type` is `"normal"` (default is `sd=1` and `mean=0`) or `min`, `max` if `noise_type` is `"uniform"`
(default is `min=0`, `max=1`).  

```{r warning = FALSE, message = FALSE}
gen <- get_generator(path = file_path,
                     train_type = "lm",
                     batch_size = 1,
                     add_noise = list(noise_type = "normal", mean = 0, sd = 0.01),
                     maxlen = 5,
                     vocabulary = vocabulary,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- vocabulary
colnames(y) <- vocabulary
round(x, 3) # abcde + noise
y # f
```

### proportion_entries

If a fasta file has multiple entries, you can randomly choose a subset.
For example, if the file has 6 entries and `proportion_entries = 0.5` 
the generator will randomly choose only 3 of the entries.

### shuffle_file_order 

Shuffle file order before iterating through files. Order gets reshuffled after every iteration.

### shuffle_input 

Whether to shuffle fasta entries if fasta file has multiple entries.

### reverse_complement

If `TRUE`, randomly decide for every batch to use original sequence or its reverse complement.
Only implemented for <tt>ACGT</tt> vocabulary.

### sample_by_file_size 

Randomly choose new file by sampling according to file size (bigger files more likely).

### concat_seq                     

Character string or `NULL`. If not `NULL` all entries from file get concatenated to one sequence with `concat_seq` string between them. 
Use `concat_seq = ""` if you don't want to add a new token.

```{r warning = FALSE, message = FALSE}
df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3))
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <- get_generator(path = fasta_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 9,
                     vocabulary = c("A", "C", "G", "T", "Z"),
                     concat_seq = "ZZ",
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- c("A", "C", "G", "T", "Z")
colnames(y) <- c("A", "C", "G", "T", "Z")
x # ACZZAGZZA
y # T
```

### set_learning

When you want to assign one label to set of samples. Only implemented for `train_type = "label_folder"`.
Input is a list with the following parameters

+  `samples_per_target` how many samples to use for one target
+  `maxlen` length of one sample
+  `reshape_mode`: `"time_dist", "multi_input"` or `"concat"`. 
     + If `reshape_mode = "multi_input"`, generator will produce `samples_per_target` separate inputs, each of length `maxlen`. 
     + If `reshape_mode = "time_dist"`, generator will produce a 4D input array. The dimensions correspond to
       `(batch_size, samples_per_target, maxlen, length(vocabulary))`.   
     +  If `reshape_mode` is `"concat"`, generator will concatenate `samples_per_target` sequences
        of length `maxlen` to one long sequence.
+   If `reshape_mode = "concat"`, there is an additional `buffer_len` argument: add new token between 
   concatenated samples
    + If `buffer_len` is an integer, the sub-sequences are inter spaced with `buffer_len` rows. The input length is
      (`maxlen` \* `samples_per_target`) + `buffer_len` \* (`samples_per_target` - 1)   

```{r warning = FALSE, message = FALSE}
# create data for second label
df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1")
file_path_2 <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, file_path_2)

# multi_input 
set_learning <- list(reshape_mode = "multi_input",
                     maxlen = 4,
                     samples_per_target = 3)

gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
                     train_type = "label_folder",
                     batch_size = 2,
                     maxlen = 4,
                     step = 1, 
                     vocabulary = vocabulary,
                     set_learning = set_learning)

z <- gen()
x <- z[[1]]
y <- z[[2]]
length(x) # 3 samples per target
x_1_1 <- x[[1]][1, , ]
x_1_1 # abcd
x_1_2 <- x[[2]][1, , ]
x_1_2 # bcde
x_1_3 <- x[[3]][1, , ]
x_1_3 # cdef

x_2_1 <- x[[1]][2, , ]
x_2_1 # aaba
x_2_2 <- x[[2]][2, , ]
x_2_2 # abaa
x_2_3 <- x[[3]][2, , ]
x_2_3 # baac

colnames(y) <- c("label_1", "label_2")
y 
```

```{r warning = FALSE, message = FALSE}
# concat 
set_learning <- list(reshape_mode = "concat",
                     maxlen = 4,
                     samples_per_target = 3)

gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes
                     train_type = "label_folder",
                     batch_size = 2,
                     maxlen = 4,
                     step = 2, 
                     vocabulary = vocabulary,
                     set_learning = set_learning)

z <- gen()
x <- z[[1]]
y <- z[[2]]
dim(x) 
x_1 <- x[1, , ]
colnames(x_1) <- vocabulary
x_1 # abcd | cdef | efgh
x_2 <- x[2, , ]
colnames(x_2) <- vocabulary
x_2 # aaba | baac | acaa

colnames(y) <- c("label_1", "label_2")
y 
```


### use_quality_score

If `TRUE`, instead of one-hot encoding, use quality score of fastq file.

```{r warning = FALSE, message = FALSE}
df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I")
fastq_path <- tempfile(fileext = ".fastq")
fastq_file <- microseq::writeFastq(df, fastq_path)
gen <- get_generator(path = fastq_path,
                     train_type = "lm",
                     batch_size = 1,
                     maxlen = 5,
                     format = "fastq",
                     vocabulary = c("A", "C", "G", "T"),
                     use_quality_score = TRUE,
                     output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- c("A", "C", "G", "T")
colnames(y) <- c("A", "C", "G", "T")
x # ACAGA
y # T
```


### use_coverage 

Integer or `NULL`. If not `NULL`, use coverage as encoding rather than one-hot encoding.
Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where 
n is some integer.

```{r warning = FALSE, message = FALSE}
df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <-  get_generator(path = fasta_path,
                      train_type = "lm",
                      batch_size = 1,
                      maxlen = 5,
                      vocabulary = c("A", "C", "G", "T"),
                      use_coverage = 25,
                      output_format = "target_right")

z <- gen()
x <- z[[1]][1, , ]
y <- z[[2]]

colnames(x) <- c("A", "C", "G", "T")
colnames(y) <- c("A", "C", "G", "T")
x # ACAGA; 0.32 = 8/25
y # T
```

### added_label_path

It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named "file". 

We may add some additional input to our dummy data

```{r warning = FALSE, message = FALSE}
file <- c(basename(file_path), "some_file_name.fasta")
df <- data.frame(file = file,
                 label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0))
df
write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE)
```

If we add the path to the csv file, the generator will map additional input to sequences: 

```{r warning = FALSE, message = FALSE}
gen <-  get_generator(path = dir_path,
                      train_type = "lm", 
                      batch_size = 1,
                      maxlen = 5,
                      output_format = "target_right",
                      vocabulary = vocabulary,
                      added_label_path = file.path(dir_path, "add_input.csv"),
                      add_input_as_seq = FALSE)  # don't treat added input as sequence
                      
z <- gen()
added_label_input <- z[[1]][[1]]
added_label_input
x <- z[[1]][[2]]
x[1, , ]
y <- z[[2]] 
y
```

If we want to train a network with additional labels, we have to add an additional input layer.

```{r warning = FALSE, message = FALSE}
model <- create_model_lstm_cnn(
  maxlen = 5,
  layer_lstm = c(8, 8),
  layer_dense = c(4),
  label_input = 3 # additional input vector has length 3
)

# train_model(train_type = "lm", 
#             model = model,
#             path = file.path(dir_path, "train_files_1"),
#             path_val = file.path(dir_path, "validation_files_1"),
#             added_label_path = file.path(dir_path, "add_input.csv"),
#             steps_per_epoch = 5,
#             batch_size = 8,
#             epochs = 2)
```


### return_int

Whether to return integer encoding rather than one-hot encoding.

```{r warning = FALSE, message = FALSE}
df <- data.frame(Sequence = "ATCGC", Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <-  get_generator(path = fasta_path,
                      train_type = "lm",
                      batch_size = 1,
                      return_int = TRUE,
                      padding = TRUE,
                      maxlen = 8,
                      vocabulary = c("A", "C", "G", "T"),
                      output_format = "target_right")

z <- gen()
x <- z[[1]]
y <- z[[2]]
colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G")
x
colnames(y) <- "C"
y
```

Can also be combined with n-gram encoding:

```{r warning = FALSE, message = FALSE}
df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
gen <-  get_generator(path = fasta_path,
                      train_type = "lm",
                      batch_size = 1,
                      n_gram = 3,
                      n_gram_stride = 3,
                      return_int = TRUE,
                      maxlen = 6,
                      target_len = 3,
                      vocabulary = c("A", "C", "G", "T"),
                      output_format = "target_right")

z <- gen()
x <- z[[1]]
y <- z[[2]]
colnames(x) <- c("AAA", "CCC")
x
colnames(y) <- "TTT"
y
```

### reshape_xy

Apply some function to the output of a generator call.

```{r}
df <- data.frame(Sequence = "AAAATTTT", Header = "header_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
fx <- function(x = NULL, y = NULL) {
  return(x - 1)
}
fy <- function(x = NULL, y = NULL) {
  return(exp(y * 5))
}

gen <-  get_generator(path = fasta_path,
                      reshape_xy = list(x = fx, y = fy),
                      train_type = "label_folder",
                      batch_size = 1,
                      maxlen = 8)

z <- gen()
x <- z[[1]]
x[1,,]
y <- z[[2]]
y
```


### masked_lm

Masks some parts of input sequence. Can be used for training BERT-like models.

```{r warning = FALSE, message = FALSE}
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
                  random_rate = 0.025, # set 2.5% of input to random value
                  identity_rate = 0.05, # leave 5% unchanged
                  include_sw = TRUE) # 0,1 matrix showing where masking was applied
gen <-  get_generator(path = fasta_path,
                      train_type = "masked_lm",
                      masked_lm = masked_lm,
                      batch_size = 1,
                      n_gram = 1,
                      n_gram_stride = 1,
                      return_int = TRUE,
                      maxlen = 100,
                      vocabulary = c("A", "C", "G", "T"))

z <- gen()
x <- z[[1]]
y <- z[[2]]
sw <- z[[3]]
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ])
head(df)
```

Whenever sw (sample weight) column is 0, x and y columns are identical. Let's look at rows where sw is 1:

```{r warning = FALSE, message = FALSE}
df %>% dplyr::filter(sw == 1)
```

Here 5 is the mask token, this is always the size of the vocabulary + 1.

```{r warning = FALSE, message = FALSE}
df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part
df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value)
```

Can be combined with n-gram encoding and masking of fixed block size:

```{r warning = FALSE, message = FALSE}
nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "")
df <- data.frame(Sequence = nt_seq, Header = "seq_1")
fasta_path <- tempfile(fileext = ".fasta")
fasta_file <- microseq::writeFasta(df, fasta_path)
masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token
                  random_rate = 0.05, # set 5% of input to random value
                  identity_rate = 0.05, # leave 5% unchanged
                  include_sw = TRUE, # 0,1 matrix showing where masking was applied
                  block_len = 3) # always mask at least 3 tokens in a row 
gen <-  get_generator(path = fasta_path,
                      train_type = "masked_lm",
                      masked_lm = masked_lm,
                      batch_size = 1,
                      n_gram = 3,
                      seed = 12,
                      n_gram_stride = 1,
                      return_int = TRUE,
                      maxlen = 100,
                      vocabulary = c("A", "C", "G", "T"))

z <- gen()
x <- z[[1]]
y <- z[[2]]
sw <- z[[3]]
df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x))
head(df)
tail(df)
```

We can check that sample weights appear only in blocks.

```{r warning = FALSE, message = FALSE}
which(sw == 1)
```

Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1).

```{r warning = FALSE, message = FALSE}
df %>% dplyr::filter(sw == 1 & x == 65) # 10% masked part
df %>% dplyr::filter(sw == 1 & x != 65) # 5% identity part and 5% random part (can randomly be the true value)
```