a b/README.md
1
# Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS
2
This guidance creates a scalable environment in AWS to prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake. This solution demonstrates how to 1) build, package, and deploy libraries used for genomics data conversion, 2) provision serverless data ingestion pipelines for multi-modal data preparation and cataloging, 3) visualize and explore clinical data through an interactive interface, and 4) run interactive analytic queries against a multi-modal data lake. This solution also demonstrates how to use AWS Omics to create and work with a Sequence Store, Reference Store and Variant Store in a multi-modal context.
3
4
# Setup
5
You can setup the solution in your account by clicking the "Deploy sample code on Console" button on the [solution home page](https://aws.amazon.com/solutions/guidance/guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis/).
6
7
# Customization
8
9
## Running unit tests for customization
10
* Clone the repository, then make the desired code changes
11
* Next, run unit tests to make sure added customization passes the tests
12
```
13
cd ./deployment
14
chmod +x ./run-unit-tests.sh
15
./run-unit-tests.sh
16
```
17
18
## Prerequisites
19
20
1. Create a distribution bucket, i.e., my-bucket-name
21
2. Create a region based distribution, i.e., bucket my-bucket-name-us-west-2
22
3. Create a Cloud9 environment.
23
4. Clone this repo into that environment.
24
25
## Building and deploying distributable for customization
26
Configure the bucket name and region of your target Amazon S3 distribution bucket and run the following statements. 
27
28
```
29
_Note:_ You would have to create an S3 bucket with the prefix 'my-bucket-name-<aws_region>'; aws_region is where you are testing the customized solution.
30
```
31
32
```
33
#bucket where customized code will reside (without -<region> at the end. The -<region will be added>)
34
export DIST_OUTPUT_BUCKET=my-bucket-name 
35
36
#default region where resources will get created
37
#Use "us-east-1" to get publicly available data from AWS solution bucket
38
export REGION=my-region
39
40
#default name of the solution (use this name to get publicly available test datasets from AWS S3 bucket)
41
export SOLUTION_NAME=genomics-tertiary-analysis-and-data-lakes-using-aws-glue-and-amazon-athena
42
43
#version number for the customized code (use this version to get publicly available test datasets from AWS S3 bucket)
44
export VERSION=latest
45
```
46
47
#### Change to deployment directory.
48
```
49
cd deployment
50
```
51
52
#### Build the distributable.
53
```
54
chmod +x ./build-s3-dist.sh
55
./build-s3-dist.sh $DIST_OUTPUT_BUCKET $SOLUTION_NAME $VERSION
56
```
57
58
#### Deploy the distributable to an Amazon S3 bucket in your account. _Note:_ you must have the AWS Command Line Interface installed
59
```
60
aws s3 cp ./$SOLUTION_NAME.template s3://$DIST_OUTPUT_BUCKET-$REGION/$SOLUTION_NAME/$VERSION/
61
```
62
63
#### Deploy the global assets.
64
65
```
66
aws s3 cp ./global-s3-assets/ s3://$DIST_OUTPUT_BUCKET-$REGION/$SOLUTION_NAME/$VERSION --recursive
67
```
68
69
#### Deploy the regional assets.
70
 
71
```
72
aws s3 cp ./regional-s3-assets/ s3://$DIST_OUTPUT_BUCKET-$REGION/$SOLUTION_NAME/$VERSION --recursive
73
```
74
75
#### Copy the static assets.
76
 
77
```
78
./copy-static-files.sh [Optional]AWSProfile
79
```
80
81
#### Go to the DIST_OUTPUT_BUCKET and copy the OBJECT URL for latest/guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws.template.
82
83
#### Go to the AWS CloudFormation Console and create a new stack using the template URL copied.
84
85
# File Structure
86
The overall file structure for the application.
87
88
```
89
.
90
├── ATTRIBUTION.txt
91
├── CHANGELOG.md
92
├── CODE_OF_CONDUCT.md
93
├── CONTRIBUTING.md
94
├── LICENSE.txt
95
├── NOTICE.txt
96
├── README.md
97
├── buildspec.yml
98
├── deploy.sh
99
├── deployment
100
│   ├── build-s3-dist.sh
101
│── source
102
│   ├── GenomicsAnalysisCode
103
│   │   ├── TCIA_etl.yaml
104
│   │   ├── code_cfn.yml
105
│   │   ├── copyresources_buildspec.yml
106
│   │   ├── omics_cfn.yml
107
│   │   ├── omicsresources_buildspec.yml
108
│   │   ├── quicksight_cfn.yml
109
│   │   ├── resources
110
│   │   │   ├── notebooks
111
│   │   │   │   ├── cohort-building.ipynb
112
│   │   │   │   ├── runbook.ipynb
113
│   │   │   │   └── summarize-tcga-datasets.ipynb
114
│   │   │   ├── omics
115
│   │   │   │   ├── create_annotation_store_lambda.py
116
│   │   │   │   ├── create_reference_store_lambda.py
117
│   │   │   │   ├── create_variant_store_lambda.py
118
│   │   │   │   ├── import_annotation_lambda.py
119
│   │   │   │   ├── import_reference_lambda.py
120
│   │   │   │   └── import_variant_lambda.py
121
│   │   │   └── scripts
122
│   │   │       ├── create_tcga_summary.py
123
│   │   │       ├── image_api_glue.py
124
│   │   │       ├── run_tests.py
125
│   │   │       ├── tcga_etl_common_job.py
126
│   │   │       └── transfer_tcia_images_glue.py
127
│   │   ├── run_crawlers.sh
128
│   │   └── setup
129
│   │       ├── lambda.py
130
│   │       └── requirements.txt
131
│   ├── GenomicsAnalysisPipe
132
│   │   └── pipe_cfn.yml
133
│   ├── GenomicsAnalysisZone
134
│   │   └── zone_cfn.yml
135
│   ├── TCIA_etl.yaml
136
│   ├── setup.sh
137
│   ├── setup_cfn.yml
138
│   └── teardown.sh
139
├── template_cfn.yml
140
```
141
142
***
143
144
This solution collects anonymous operational metrics to help AWS improve the
145
quality of features of the solution. For more information, including how to disable
146
this capability, please see the [implementation guide](https://docs.aws.amazon.com/solutions/latest/guidance-for-multi-omics-and-multi-modal-data-integration-and-analysis-on-aws/appendix-i.html).
147
148
---
149
150
Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
151
152
Licensed under the Apache License Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at
153
154
    http://www.apache.org/licenses/
155
156
or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and limitations under the License.