--- title: "Data Generator" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Generator} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, warning=FALSE, message=FALSE} if (!reticulate::py_module_available("tensorflow")) { knitr::opts_chunk$set(eval = FALSE) } else { knitr::opts_chunk$set(eval = TRUE) } ``` ```{r, message=FALSE} library(deepG) library(magrittr) library(keras) ``` ```{r, echo=FALSE, warning=FALSE, message=FALSE} options(rmarkdown.html_vignette.check_title = FALSE) ``` ```{css, echo=FALSE} mark.in { background-color: CornflowerBlue; } mark.out { background-color: IndianRed; } ``` ## Introduction The most common use case for the deepG data generator is to extract samples from a collection of fasta (or fastq) files. The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$. We can differentiate between 2 approaches + **Language model**: Part of a sequence is the input and other part the target. + Example: Predict the next nucleotide given the previous 100 nucleotides. + **Label classification**: Assign a label to a sequence. + Example: Assign a label "virus" or "bacteria" to a sequence of length 100. Suppose we are given 2 fasta files called "a.fasta" and "b.fasta" that look as follows:
**a.fasta**
>header_a1
AACCAAGG
>header_a2
TTTGGG
>header_a3
ACGTACGT
**b.fasta**
>header_b1
GTGTGT
>header_b2
AAGG









If we want to extract sequences of length 4 from these files, there would be 17 possible samples (5 from AACCAAGG, 3 from TTTGGG, ...). A naive approach would be to extract the samples in a sequential manner: *1. sample*:
**a.fasta**
>header_a1
AACCAAGG
>header_a2
TTTGGG
>header_a3
ACGTACGT
**b.fasta**
>header_b1
GTGTGT
>header_b2
AAGG









*2. sample*:
**a.fasta**
>header_a1
AACCAAGG
>header_a2
TTTGGG
>header_a3
ACGTACGT
**b.fasta**
>header_b1
GTGTGT
>header_b2
AAGG









...
*17. sample*:
**a.fasta**
>header_a1
AACCAAGG
>header_a2
TTTGGG
>header_a3
ACGTACGT
**b.fasta**
>header_b1
GTGTGT
>header_b2
AAGG









*18. sample*:
**a.fasta**
>header_a1
AACCAAGG
>header_a2
TTTGGG
>header_a3
ACGTACGT
**b.fasta**
>header_b1
GTGTGT
>header_b2
AAGG









...

For longer sequences this is not a desirable strategy since the data is very redundant (often just one nucleotide difference) and the model would often see long stretches of data from the same source. Choosing the samples completely at random can also be problematic since we would constantly have to open new files. The deepG generators offers several option to navigate the data sampling strategy to achieve a good balance between the two approaches. ## Data generator options In the following code examples, we will mostly use the sequence **abcdefghiiii** to demonstrate some of the deepG data generator options. (In real world application you would usually have sequences from the ACGT vocabulary.) ```{r warning = FALSE, message = FALSE} sequence <- paste0("a", "b", "c", "d", "e", "f", "g", "h", "i", "i", "i", "i") vocabulary <- c("a", "b", "c", "d", "e", "f", "g", "h", "i") ``` We may store this sequence in a fasta file ```{r warning = FALSE, message = FALSE} temp_dir <- tempfile() dir.create(temp_dir) dir_path <- paste0(temp_dir, "/dummy_data") dir.create(dir_path) df <- data.frame(Sequence = sequence, Header = "label_1", stringsAsFactors = FALSE) file_path <- file.path(dir_path, "a.fasta") # sequence as fasta file microseq::writeFasta(fdta = dplyr::as_tibble(df), out.file = file_path) ``` Since neural networks can only work with numeric data, we have to encode sequences of characters with numeric data. Usually this is achieved by one-hot-encoding; there are some other approaches implemented: see `use_coverage`, `use_quality_score` and `ambiguous_nuc` sections. ```{r warning = FALSE, message = FALSE} # one-hot encoding example s <- c("a", "c", "a", "f", "i", "b") s_as_int_seq <- vector("integer", length(s)) for (i in 1:length(s)) { s_as_int_seq[i] <- which(s[i] == vocabulary) - 1 } one_hot_sample <- keras::to_categorical(s_as_int_seq) colnames(one_hot_sample) <- vocabulary one_hot_sample ``` ### maxlen The length of the input sequence. ### vocabulary The set of allowed characters in a sequence. What happens to characters outside the vocabulary can be controlled with the `ambiguous_nuc` argument. ### train_type The generator will always return a list of length 2. The first element is the input $X$ and the second the target $Y$. The `train_type` argument determines how $X$ and $Y$ get extracted. Possible arguments for *language models* are: + **"lm"** or **"lm_rds"**: Given some sequence $s$, we take some subset of that sequence as input and the rest as target. How to split $s$ can be specified in `output_format` argument. Besides the language model approach, we can use *label classification* . This means we map some label to a sequence. For example, the target for some nucleotide sequence could be one of the labels "bacteria" or "virus". We have to specify how to extract a label corresponding to a sequence. Possible arguments are: + **"label_header"**: get label from fasta headers. + **"label_folder"**: get label from folder, i.e. all files in one folder must belong to the same class. + **"label_csv"**: get label from csv file. Csv file should have one column named "file". The targets then correspond to entries in that row (except "file" column). Example: if we are currently working with a file called "a.fasta", there should be a row in our csv file with some target information for that file
| file | label_1 | label_2 | | --- | --- | --- | | "a.fasta" | 1 | 0 | + **"label_rds"**: rds file contains preprocessed list of input and target tensors. Another option is **"dummy_gen"**: generator creates random data once and repeatedly returns them. Extract target from fasta header (fasta header is "label_1" in example file): ```{r warning = FALSE, message = FALSE} # get target from header vocabulary_label <- paste0("label_", 1:5) gen <- get_generator(path = file_path, train_type = "label_header", batch_size = 1, maxlen = 6, vocabulary = vocabulary, vocabulary_label = vocabulary_label) z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary_label x # abcdef y # label_1 ``` Extract target from fasta folder: ```{r warning = FALSE, message = FALSE} # create data for second class df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1") file_path_2 <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, file_path_2) # get target from folder vocabulary_label <- paste0("label_", 1:2) gen <- get_generator(path = c(file_path, file_path_2), # one entry for each class train_type = "label_folder", batch_size = 8, maxlen = 6, vocabulary = vocabulary, vocabulary_label = vocabulary_label) z <- gen() x <- z[[1]] y <- z[[2]] x_1_1 <- x[1, , ] colnames(x_1_1) <- vocabulary x_1_1 # first sample from first class x_2_1 <- x[5, , ] colnames(x_2_1) <- vocabulary x_2_1 # first sample from second class colnames(y) <- vocabulary_label y # 4 samples from each class ``` Extract target from csv file: ```{r warning = FALSE, message = FALSE} # get target from csv file <- c(basename(file_path), "xyz.fasta", "abc.fasta", "x_123.fasta") vocabulary_label <- paste0("label", 1:4) label_1 <- c(1, 0, 0, 0) label_2 <- c(0, 1, 0, 0) label_3 <- c(0, 0, 1, 0) label_4 <- c(0, 0, 0, 1) df <- data.frame(file, label_1, label_2, label_3, label_4) df csv_file <- tempfile(fileext = ".csv") write.csv(df, csv_file, row.names = FALSE) gen <- get_generator(path = file_path, train_type = "label_csv", batch_size = 1, maxlen = 6, target_from_csv = csv_file, vocabulary = vocabulary, vocabulary_label = vocabulary_label) z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary_label x # abcdef y # label_1 ``` Examples for language models follow in the next section. ### output_format The `output_format` determines the shape of the output for a language model, i.e. part of a sequence is the input $X$ and another the target $Y$. Assume a sequence abcdefg and `maxlen = 6`. Output correspond as follows **"target_right"**: $X=$ abcdef, $Y=$ g **"target_middle_lstm"**: $X =$ ($X_1 =$ abc, $X_2 =$ gfe), $Y=$ d (note reversed order of $X_2$) **"target_middle_cnn"**: $X =$ abcefg, $Y =$ d **"wavenet"**: $X =$ abcdef, $Y =$ bcdefg ```{r warning = FALSE, message = FALSE} # target_right gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # abcdef y # g ``` ```{r warning = FALSE, message = FALSE} # target_middle_lstm gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "target_middle_lstm") z <- gen() x_1 <- z[[1]][[1]][1,,] x_2 <- z[[1]][[2]][1,,] y <- z[[2]] colnames(x_1) <- vocabulary colnames(x_2) <- vocabulary colnames(y) <- vocabulary x_1 # abc x_2 # gfe y # d ``` ```{r warning = FALSE, message = FALSE} # target_middle_cnn gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "target_middle_cnn") z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # abcefg y # d ``` ```{r warning = FALSE, message = FALSE} # wavenet gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary, output_format = "wavenet") z <- gen() x <- z[[1]][1,,] y <- z[[2]][1,,] colnames(x) <- vocabulary colnames(y) <- vocabulary x # abcdef y # bcdefg ``` ### batch_size Number of samples in one batch. ```{r warning = FALSE, message = FALSE} # target_right gen <- get_generator(path = file_path, train_type = "lm", batch_size = 7, maxlen = 6, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]] y <- z[[2]] dim(x) dim(y) ``` ### step We may determine how frequently we want to take a sample. If `step = 1` we take a sample at every possible step. Let's assume we want to predict the next character, i.e. part of the sequence is the input and next character the target. If `maxlen = 3, step = 1`: 1. sample: abcdefghiiii 2. sample: abcdefghiiii 3. sample: abcdefghiiii if `step = 3` 1. sample: abcdefghiiii 2. sample: abcdefghiiii 3. sample: abcdefghiiii ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 3, vocabulary = vocabulary, step = 3, output_format = "target_right") z <- gen() x <- z[[1]][1,,] #encodes abc y <- z[[2]] # encodes d colnames(x) <- vocabulary colnames(y) <- vocabulary x y # go 3 steps forward z <- gen() x <- z[[1]][1,,] #encodes def y <- z[[2]] # encodes g colnames(x) <- vocabulary colnames(y) <- vocabulary x y ``` ### padding If the sequence is too short to create a single sample, we can pad the sequence with zero-vectors. If `padding = FALSE` the generator will go to next file/ fasta entry until it finds a sequence long enough for a sample. ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 15, # maxlen is longer than sequence vocabulary = vocabulary, step = 3, padding = TRUE, output_format = "target_right") z <- gen() x <- z[[1]][1,,] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # first 4 entries are zero-vectors y ``` ### ambiguous_nuc A sequence might contain a character that does not lie inside our vocabulary. For example, let's assume we discard e from our vocabulary. We have 4 options to handle this situation (1) encode as zero vector ```{r warning = FALSE, message = FALSE} vocabulary_2 <- c("a", "b", "c", "d", "f", "g", "h", "i") # exclude "e" from vocabulary # zero gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "zeros") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # fifth row is zero vector ``` (2) equal probability ```{r warning = FALSE, message = FALSE} # equal gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "equal") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # fifth row is 1/8 for every entry ``` (3) use distribution of current file ```{r warning = FALSE, message = FALSE} # empirical gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "empirical") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # fifth row is distribuation of file ``` (4) discard ```{r warning = FALSE, message = FALSE} # discard gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 6, vocabulary = vocabulary_2, output_format = "target_right", ambiguous_nuc = "discard") z <- gen() x <- z[[1]][1,,] colnames(x) <- vocabulary_2 x # first sample with only characters from vocabulary is fghiii|i ``` ### proportion_per_seq The `proportion_per_seq` argument gives the option to use a random subset instead of the full sequence. ```{r warning = FALSE, message = FALSE} cat("sequence is ", nchar(sequence), "characters long \n") gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, maxlen = 5, seed = 1, vocabulary = vocabulary, output_format = "target_right", # take random subsequence using 50% of sequence proportion_per_seq = 0.5) z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary x # defgh y # i ``` ### file_limit Integer or NULL. If integer, use only specified number of randomly sampled files for training. ### delete_used_files If true, delete file once used. Only applies for rds files. ```{r warning = FALSE, message = FALSE} x <- array(0, dim = c(1,5,4)) y <- matrix(0, ncol = 1) rds_path <- tempfile(fileext = ".rds") saveRDS(list(x, y), rds_path) gen <- get_generator(path = rds_path, delete_used_files = TRUE, train_type = "label_rds", batch_size = 1, maxlen = 5) z <- gen() file.exists(rds_path) # z <- gen() # When calling the generator again, it will wait until it finds a file again from the files listed in # the initial `path` argument. Can be used if another process(es) create rds files. ``` ### max_samples Only use fixed number of samples per file. Randomly choose which samples to use. (If `random_sampling = FALSE`, samples are consecutive.) ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 2, maxlen = 5, step = 1, seed = 3, vocabulary = vocabulary, output_format = "target_right", max_samples = 2) z <- gen() x1 <- z[[1]][1, , ] x2 <- z[[1]][2, , ] colnames(x1) <- vocabulary colnames(x2) <- vocabulary x1 # bcdef x2 # cdefg ``` ### random_sampling If you use `max_samples`, generator will randomly choose subset from all possible samples, but those samples are consecutive. With `random_sampling = TRUE`, samples are completely random. ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 2, maxlen = 5, seed = 66, random_sampling = TRUE, vocabulary = vocabulary, output_format = "target_right", max_samples = 2) z <- gen() x1 <- z[[1]][1, , ] x2 <- z[[1]][2, , ] colnames(x1) <- vocabulary colnames(x2) <- vocabulary x1 # efghi x2 # defgh ``` ### target_len Target length for language model. ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, target_len = 3, maxlen = 5, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y1 <- z[[2]][ , 1, ] y2 <- z[[2]][ , 2, ] y3 <- z[[2]][ , 3, ] colnames(x) <- vocabulary names(y1) <- vocabulary names(y2) <- vocabulary names(y3) <- vocabulary x # abcde y1 # f y2 # g y3 # h ``` ### n_gram / n_gram_stride Encode target in language model not character wise but combine n characters to one target. `n_gram_stride` determines the frequency of the n-gram encoding. ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, target_len = 6, n_gram = 3, n_gram_stride = 3, maxlen = 3, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]] y1 <- z[[2]][ , 1, ] y2 <- z[[2]][ , 2, ] dim(x)[3] == length(vocabulary)^3 # x = abc as 3-gram # y1 = def as 3-gram # y2 = ghi as 3-gram ``` ### add_noise Add noise to input. Must be a list that specifies noise distribution or NULL (no noise). List contains arguments `noise_type`: either `"normal"` or `"uniform"`. Optional arguments `sd` or `mean` if `noise_type` is `"normal"` (default is `sd=1` and `mean=0`) or `min`, `max` if `noise_type` is `"uniform"` (default is `min=0`, `max=1`). ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = file_path, train_type = "lm", batch_size = 1, add_noise = list(noise_type = "normal", mean = 0, sd = 0.01), maxlen = 5, vocabulary = vocabulary, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- vocabulary colnames(y) <- vocabulary round(x, 3) # abcde + noise y # f ``` ### proportion_entries If a fasta file has multiple entries, you can randomly choose a subset. For example, if the file has 6 entries and `proportion_entries = 0.5` the generator will randomly choose only 3 of the entries. ### shuffle_file_order Shuffle file order before iterating through files. Order gets reshuffled after every iteration. ### shuffle_input Whether to shuffle fasta entries if fasta file has multiple entries. ### reverse_complement If `TRUE`, randomly decide for every batch to use original sequence or its reverse complement. Only implemented for ACGT vocabulary. ### sample_by_file_size Randomly choose new file by sampling according to file size (bigger files more likely). ### concat_seq Character string or `NULL`. If not `NULL` all entries from file get concatenated to one sequence with `concat_seq` string between them. Use `concat_seq = ""` if you don't want to add a new token. ```{r warning = FALSE, message = FALSE} df <- data.frame(Sequence = c("AC", "AG", "AT"), Header = paste0("header", 1:3)) fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, maxlen = 9, vocabulary = c("A", "C", "G", "T", "Z"), concat_seq = "ZZ", output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- c("A", "C", "G", "T", "Z") colnames(y) <- c("A", "C", "G", "T", "Z") x # ACZZAGZZA y # T ``` ### set_learning When you want to assign one label to set of samples. Only implemented for `train_type = "label_folder"`. Input is a list with the following parameters + `samples_per_target` how many samples to use for one target + `maxlen` length of one sample + `reshape_mode`: `"time_dist", "multi_input"` or `"concat"`. + If `reshape_mode = "multi_input"`, generator will produce `samples_per_target` separate inputs, each of length `maxlen`. + If `reshape_mode = "time_dist"`, generator will produce a 4D input array. The dimensions correspond to `(batch_size, samples_per_target, maxlen, length(vocabulary))`. + If `reshape_mode` is `"concat"`, generator will concatenate `samples_per_target` sequences of length `maxlen` to one long sequence. + If `reshape_mode = "concat"`, there is an additional `buffer_len` argument: add new token between concatenated samples + If `buffer_len` is an integer, the sub-sequences are inter spaced with `buffer_len` rows. The input length is (`maxlen` \* `samples_per_target`) + `buffer_len` \* (`samples_per_target` - 1) ```{r warning = FALSE, message = FALSE} # create data for second label df <- data.frame(Sequence = "AABAACAADAAE", Header = "header_1") file_path_2 <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, file_path_2) # multi_input set_learning <- list(reshape_mode = "multi_input", maxlen = 4, samples_per_target = 3) gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes train_type = "label_folder", batch_size = 2, maxlen = 4, step = 1, vocabulary = vocabulary, set_learning = set_learning) z <- gen() x <- z[[1]] y <- z[[2]] length(x) # 3 samples per target x_1_1 <- x[[1]][1, , ] x_1_1 # abcd x_1_2 <- x[[2]][1, , ] x_1_2 # bcde x_1_3 <- x[[3]][1, , ] x_1_3 # cdef x_2_1 <- x[[1]][2, , ] x_2_1 # aaba x_2_2 <- x[[2]][2, , ] x_2_2 # abaa x_2_3 <- x[[3]][2, , ] x_2_3 # baac colnames(y) <- c("label_1", "label_2") y ``` ```{r warning = FALSE, message = FALSE} # concat set_learning <- list(reshape_mode = "concat", maxlen = 4, samples_per_target = 3) gen <- get_generator(path = c(file_path, file_path_2), # path has length 2 => 2 classes train_type = "label_folder", batch_size = 2, maxlen = 4, step = 2, vocabulary = vocabulary, set_learning = set_learning) z <- gen() x <- z[[1]] y <- z[[2]] dim(x) x_1 <- x[1, , ] colnames(x_1) <- vocabulary x_1 # abcd | cdef | efgh x_2 <- x[2, , ] colnames(x_2) <- vocabulary x_2 # aaba | baac | acaa colnames(y) <- c("label_1", "label_2") y ``` ### use_quality_score If `TRUE`, instead of one-hot encoding, use quality score of fastq file. ```{r warning = FALSE, message = FALSE} df <- data.frame(Sequence = "ACAGAT", Header = "header_1", Quality = "!#*=?I") fastq_path <- tempfile(fileext = ".fastq") fastq_file <- microseq::writeFastq(df, fastq_path) gen <- get_generator(path = fastq_path, train_type = "lm", batch_size = 1, maxlen = 5, format = "fastq", vocabulary = c("A", "C", "G", "T"), use_quality_score = TRUE, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- c("A", "C", "G", "T") colnames(y) <- c("A", "C", "G", "T") x # ACAGA y # T ``` ### use_coverage Integer or `NULL`. If not `NULL`, use coverage as encoding rather than one-hot encoding. Coverage information must be contained in fasta header: there must be a string "cov_n" in the header, where n is some integer. ```{r warning = FALSE, message = FALSE} df <- data.frame(Sequence = "ACAGAT", Header = "header_1_cov_8") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, maxlen = 5, vocabulary = c("A", "C", "G", "T"), use_coverage = 25, output_format = "target_right") z <- gen() x <- z[[1]][1, , ] y <- z[[2]] colnames(x) <- c("A", "C", "G", "T") colnames(y) <- c("A", "C", "G", "T") x # ACAGA; 0.32 = 8/25 y # T ``` ### added_label_path It is possible to feed a network additional information associated to a sequence. This information needs to be in a csv file. If all sequences in one file share the same label, the csv file should have one column named "file". We may add some additional input to our dummy data ```{r warning = FALSE, message = FALSE} file <- c(basename(file_path), "some_file_name.fasta") df <- data.frame(file = file, label_1 = c(0, 1), label_2 = c(1, 0), label_3 = c(1, 0)) df write.csv(x = df, file = file.path(dir_path, "add_input.csv"), row.names = FALSE) ``` If we add the path to the csv file, the generator will map additional input to sequences: ```{r warning = FALSE, message = FALSE} gen <- get_generator(path = dir_path, train_type = "lm", batch_size = 1, maxlen = 5, output_format = "target_right", vocabulary = vocabulary, added_label_path = file.path(dir_path, "add_input.csv"), add_input_as_seq = FALSE) # don't treat added input as sequence z <- gen() added_label_input <- z[[1]][[1]] added_label_input x <- z[[1]][[2]] x[1, , ] y <- z[[2]] y ``` If we want to train a network with additional labels, we have to add an additional input layer. ```{r warning = FALSE, message = FALSE} model <- create_model_lstm_cnn( maxlen = 5, layer_lstm = c(8, 8), layer_dense = c(4), label_input = 3 # additional input vector has length 3 ) # train_model(train_type = "lm", # model = model, # path = file.path(dir_path, "train_files_1"), # path_val = file.path(dir_path, "validation_files_1"), # added_label_path = file.path(dir_path, "add_input.csv"), # steps_per_epoch = 5, # batch_size = 8, # epochs = 2) ``` ### return_int Whether to return integer encoding rather than one-hot encoding. ```{r warning = FALSE, message = FALSE} df <- data.frame(Sequence = "ATCGC", Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, return_int = TRUE, padding = TRUE, maxlen = 8, vocabulary = c("A", "C", "G", "T"), output_format = "target_right") z <- gen() x <- z[[1]] y <- z[[2]] colnames(x) <- c("pad", "pad", "pad", "pad", "A", "T", "C", "G") x colnames(y) <- "C" y ``` Can also be combined with n-gram encoding: ```{r warning = FALSE, message = FALSE} df <- data.frame(Sequence = "AAACCCTTT", Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) gen <- get_generator(path = fasta_path, train_type = "lm", batch_size = 1, n_gram = 3, n_gram_stride = 3, return_int = TRUE, maxlen = 6, target_len = 3, vocabulary = c("A", "C", "G", "T"), output_format = "target_right") z <- gen() x <- z[[1]] y <- z[[2]] colnames(x) <- c("AAA", "CCC") x colnames(y) <- "TTT" y ``` ### reshape_xy Apply some function to the output of a generator call. ```{r} df <- data.frame(Sequence = "AAAATTTT", Header = "header_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) fx <- function(x = NULL, y = NULL) { return(x - 1) } fy <- function(x = NULL, y = NULL) { return(exp(y * 5)) } gen <- get_generator(path = fasta_path, reshape_xy = list(x = fx, y = fy), train_type = "label_folder", batch_size = 1, maxlen = 8) z <- gen() x <- z[[1]] x[1,,] y <- z[[2]] y ``` ### masked_lm Masks some parts of input sequence. Can be used for training BERT-like models. ```{r warning = FALSE, message = FALSE} nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "") df <- data.frame(Sequence = nt_seq, Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token random_rate = 0.025, # set 2.5% of input to random value identity_rate = 0.05, # leave 5% unchanged include_sw = TRUE) # 0,1 matrix showing where masking was applied gen <- get_generator(path = fasta_path, train_type = "masked_lm", masked_lm = masked_lm, batch_size = 1, n_gram = 1, n_gram_stride = 1, return_int = TRUE, maxlen = 100, vocabulary = c("A", "C", "G", "T")) z <- gen() x <- z[[1]] y <- z[[2]] sw <- z[[3]] df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ]) head(df) ``` Whenever sw (sample weight) column is 0, x and y columns are identical. Let's look at rows where sw is 1: ```{r warning = FALSE, message = FALSE} df %>% dplyr::filter(sw == 1) ``` Here 5 is the mask token, this is always the size of the vocabulary + 1. ```{r warning = FALSE, message = FALSE} df %>% dplyr::filter(sw == 1 & x == 5) # 10% masked part df %>% dplyr::filter(sw == 1 & x != 5) # 5% identity part and 2.5% random part (can randomly be the true value) ``` Can be combined with n-gram encoding and masking of fixed block size: ```{r warning = FALSE, message = FALSE} nt_seq <- rep(c("A", "C", "G", "T"), each = 25) %>% paste(collapse = "") df <- data.frame(Sequence = nt_seq, Header = "seq_1") fasta_path <- tempfile(fileext = ".fasta") fasta_file <- microseq::writeFasta(df, fasta_path) masked_lm <- list(mask_rate = 0.10, # replace 10% of input with special mask token random_rate = 0.05, # set 5% of input to random value identity_rate = 0.05, # leave 5% unchanged include_sw = TRUE, # 0,1 matrix showing where masking was applied block_len = 3) # always mask at least 3 tokens in a row gen <- get_generator(path = fasta_path, train_type = "masked_lm", masked_lm = masked_lm, batch_size = 1, n_gram = 3, seed = 12, n_gram_stride = 1, return_int = TRUE, maxlen = 100, vocabulary = c("A", "C", "G", "T")) z <- gen() x <- z[[1]] y <- z[[2]] sw <- z[[3]] df <- data.frame(x = x[1, ], y = y[1, ], sw = sw[1, ], position = 1:ncol(x)) head(df) tail(df) ``` We can check that sample weights appear only in blocks. ```{r warning = FALSE, message = FALSE} which(sw == 1) ``` Here 65 is the mask token (4^3 + 1 = size of the vocabulary + 1). ```{r warning = FALSE, message = FALSE} df %>% dplyr::filter(sw == 1 & x == 65) # 10% masked part df %>% dplyr::filter(sw == 1 & x != 65) # 5% identity part and 5% random part (can randomly be the true value) ```