Diff of /README.md [000000] .. [51428b]

Switch to unified view

a b/README.md
1
# Introduction
2
3
PopStrat is a simple example of population stratification analysis on genomics data using "deep learning" (neural networks).
4
That is, it aims to predict which population group an individual belongs to based on their genome.
5
6
For a more detailed explanation see [this blog post on bdgenomics.org](http://bdgenomics.org/blog/2015/07/10/genomic-analysis-using-adam/).
7
8
The following technologies are used:
9
10
 * [ADAM](https://github.com/bigdatagenomics/adam): a genomics analysis platform and associated file formats
11
 * [Apache Spark](https://spark.apache.org/): a fast engine for large-scale data processing
12
 * [H2O](http://0xdata.com/product/): an open source predictive analytics platform
13
 * [Sparking Water](http://0xdata.com/product/sparkling-water/): integration of H2O with Apache Spark
14
15
The example consists of a single Scala class: `PopStrat`.
16
17
# Prerequisites
18
19
Before building and running PopStrat ensure you have version 7 or later of the
20
[Java JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) installed.
21
22
# Building
23
24
To build from source first [download and install Maven](http://maven.apache.org/download.cgi).
25
Then at the command line type:
26
27
```
28
mvn clean package
29
```
30
31
This will build a JAR (target/uber-popstrat-0.1-SNAPSHOT.jar) containing the `PopStrat` class,
32
as well as all of its dependencies.
33
34
# Running
35
36
First [download Spark version 1.2.0](http://spark.apache.org/downloads.html) and unpack it on your machine.
37
38
Next you'll need to get some genomics data. Go to your
39
[nearest mirror of the 1000 genomes FTP site](http://www.1000genomes.org/data#DataAccess).
40
From the `release/20130502/` directory download
41
the `ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz` file and
42
the `integrated_call_samples_v3.20130502.ALL.panel` file. The first file file is the genotype data for chromosome 22,
43
and the second file is the panel file, which describes the population group for each sample in the genotype data.
44
45
Unzip the genotype data before continuing. This will require around 10GB of disk space.
46
47
To speed up execution and save disk space you can convert the genotype VCF file to [ADAM](https://github.com/bigdatagenomics/adam)
48
format (using the [ADAM](https://github.com/bigdatagenomics/adam) `transform` command) if you wish. However
49
this will take some time up-front. Both ADAM and VCF formats are supported.
50
51
Next run the following command:
52
53
```
54
YOUR_SPARK_HOME/bin/spark-submit --class "com.neilferguson.PopStrat" --master local[6] --driver-memory 6G target/uber-popstrat-0.1-SNAPSHOT.jar <genotypesfile> <panelfile>
55
```
56
57
Replacing &lt;genotypesfile&gt; with the path to your genotype data file (ADAM or VCF), and &lt;panelfile&gt; with the panel file
58
from 1000 genomes.
59
60
This runs PopStrat using a local (in-process) Spark master with 6 cores and 6GB of RAM. You can run against a different
61
Spark cluster by modifying the options in the above command line. See the
62
[Spark documentation](https://spark.apache.org/docs/1.2.0/submitting-applications.html) for further details.
63
64
Using the above data PopStrat may take up to 2-3 hours to run, depending on hardware. When it is finished you should
65
see output that looks something like the following:
66
67
```
68
Confusion Matrix (vertical: actual; across: predicted):
69
       ASW CHB GBR  Error      Rate
70
   ASW  60   1   0 0.0164 =  1 / 61
71
   CHB   0 103   0 0.0000 = 0 / 103
72
   GBR   0   1  90 0.0110 =  1 / 91
73
Totals  60 105  90 0.0078 = 2 / 255
74
```
75
76
This is a [confusion matrix](http://en.wikipedia.org/wiki/Confusion_matrix) which shows the predicted versus the actual
77
populations. All being well, you should see an overall accuracy of more than 99%
78
(only one or two predictions should be incorrect).
79
80
# Code
81
82
A single Scala class at `src/main/scala/com/neilferguson/PopStrat.scala` contains all of the code for PopStrat.
83
84
See [this blog post on bdgenomics.org](http://bdgenomics.org/blog/2015/07/10/genomic-analysis-using-adam/) for
85
a deep dive into the code.
86
87
The code is fairly straightforward and follows the following high level flow:
88
89
 1. Load the genotype and panel data from the specified files
90
 2. Filter out those samples that aren't in the populations we are trying to predict
91
 3. Filter out variants that are missing from some samples
92
 4. Reduce the number of dimensions in the data by filtering to a (fairly arbitrary) subset of variants
93
 5. Create a Spark `SchemaRDD` with each column representing a variant and each row representing a sample
94
 6. Convert the `SchemaRDD` to an H2O data frame.
95
 7. Convert the data frame into 50% training data and 50% test data
96
 8. Set the parameters for the deep learning model (we use two hidden layers each with 100 neurons) and train the model
97
 9. Score the entire data set (training and test data) against the model
98
99
# Credits
100
101
Thanks to the folks at [Big Data Genomics](http://bdgenomics.org) for the
102
[original blog post](http://bdgenomics.org/blog/2015/02/02/scalable-genomes-clustering-with-adam-and-spark/)
103
that inspired this.