In this tutorial, we explain how to make your data compatible with the move-dl commands.
For this tutorial we will work with a dataset taken from Walters et al. (2008) [1]. In their work, they report soil microbiome census data along with environmental data (e.g., temperature and precipitation) of different cultivars of maize.
We will start by downloading the files corresponding to their OTU table and metadata.
The move-dl pipeline requires continuous omics input to be formatted as a TSV file with one column per feature and one row per feature.
If we load the microbiome OTU table from the maize rhizosphere dataset, it will look something like this:
otuids | 11116.C02A66.1194587 | 11116.C06A63.1195666 | 11116.C08A61.1197689 |
---|---|---|---|
4479944 | 70 | 8 | 18 |
513055 | 2 | 16 | 1 |
519510 | 22 | 15 | 12 |
810959 | 5 | 0 | 3 |
849092 | 5 | 2 | 1 |
We have columns corresponding to samples and rows corresponding to features (OTUs), so we need to transpose this table for MOVE.
sampleids | 4479944 | 513055 | 519510 | 810959 | 849092 |
---|---|---|---|---|---|
11116.C02A66.1194587 | 70 | 2 | 22 | 5 | 5 |
11116.C06A63.1195666 | 8 | 16 | 15 | 0 | 2 |
11116.C08A61.1197689 | 18 | 1 | 12 | 3 | 1 |
Now, we can save our table as a TSV and we are ready to go. No need to do any further processing.
Other non-omics continuous data is formatted in a similar way.
For this tutorial, we are going to extract some continuous data from the maize metadata table. Let us load the table and take a peek:
X.SampleID | Precipitation3Days | INBREDS | Maize_Line | Description1 |
---|---|---|---|---|
11116.C02A66.1194587 | 0.14 | Oh7B | Non_Stiff_Stalk | rhizosphere |
11116.C06A63.1195666 | 0.14 | P39 | Sweet_Corn | rhizosphere |
11116.C08A61.1197689 | 0.14 | CML333 | Tropical | rhizosphere |
11116.C08A63.1196825 | 0.14 | CML333 | Tropical | rhizosphere |
11116.C12A64.1197667 | 0.14 | Il14H | Sweet_Corn | rhizosphere |
The original metadata table contains both categorical (e.g., Maize_Line) and continuous data (e.g., Precipitation3Days). We need to separate these into different files.
In this example, we select three columns: age, Precipitation3Days, and Temperature.
X.SampleID | age | Temperature | Precipitation3Days |
---|---|---|---|
11116.C02A66.1194587 | 12 | 76 | 0.14 |
11116.C06A63.1195666 | 12 | 76 | 0.14 |
11116.C08A61.1197689 | 12 | 76 | 0.14 |
11116.C08A63.1196825 | 12 | 76 | 0.14 |
11116.C12A64.1197667 | 12 | 76 | 0.14 |
Once again, we can save this table as a TSV, and we are ready to continue.
Categorical data like binary variables (e.g., with/without treatment) or discrete categories needs to be formatted in individual files.
The metadata table contains several discrete variables that can be useful for classification, such as maize line, cultivar, and type of soil. For each one of these, we need to create a separate TSV file that will look something like:
X.SampleID | Maize_Line |
---|---|
11116.C02A66.1194587 | Non_Stiff_Stalk |
11116.C06A63.1195666 | Sweet_Corn |
11116.C08A61.1197689 | Tropical |
11116.C08A63.1196825 | Tropical |
11116.C12A64.1197667 | Sweet_Corn |
We are missing two components to make our data compatible with move-dl. First, we need to create an additional text file with all the sample IDs (one ID per line, see example below). This file tells MOVE which samples to use, so the IDs in this file must be present in all the other input files.
Finally, we need to create a data config YAML file. The purpose of this file is to tell MOVE which files to load, where to find them, and where to save any output files.
The data config file for this tutorial would look like this:
Here we break down the fields of this file:
The data config file can have any name, but it must be saved in config/data directory. The final workspace structure should look like this::
tutorial/ │ ├── maize/ │ └── data/ │ ├── maize_field.tsv <- Type of soil data │ ├── maize_ids.txt <- Sample IDs │ ├── maize_line.tsv <- Maize line data │ ├── maize_metadata.tsv <- Age, temperature, precipitation data │ ├── maize_microbiome.tsv <- OTU table │ └── maize_variety.tsv <- Maize variety data │ └── config/ └── data/ └── maize.yaml <- Data configuration file
With your data formatted and ready, we can continue to run MOVE and exploring the associations between the different variables in your datasets. Have a look at our :doc:`introductory tutorial</tutorial/introduction>` for more information on this.
[1] | Walters WA, Jin Z, Youngblut N, Wallace JG, Sutter J, Zhang W, et al. Large-scale replicated field study of maize rhizosphere identifies heritable microbes. Proc Natl Acad Sci U S A. 2018; 115: 7368–7373. doi:10.1073/pnas.1800918115 |