PopStrat is a simple example of population stratification analysis on genomics data using "deep learning" (neural networks).
That is, it aims to predict which population group an individual belongs to based on their genome.
For a more detailed explanation see this blog post on bdgenomics.org.
The following technologies are used:
The example consists of a single Scala class: PopStrat
.
Before building and running PopStrat ensure you have version 7 or later of the
Java JDK installed.
To build from source first download and install Maven.
Then at the command line type:
mvn clean package
This will build a JAR (target/uber-popstrat-0.1-SNAPSHOT.jar) containing the PopStrat
class,
as well as all of its dependencies.
First download Spark version 1.2.0 and unpack it on your machine.
Next you'll need to get some genomics data. Go to your
nearest mirror of the 1000 genomes FTP site.
From the release/20130502/
directory download
the ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
file and
the integrated_call_samples_v3.20130502.ALL.panel
file. The first file file is the genotype data for chromosome 22,
and the second file is the panel file, which describes the population group for each sample in the genotype data.
Unzip the genotype data before continuing. This will require around 10GB of disk space.
To speed up execution and save disk space you can convert the genotype VCF file to ADAM
format (using the ADAM transform
command) if you wish. However
this will take some time up-front. Both ADAM and VCF formats are supported.
Next run the following command:
YOUR_SPARK_HOME/bin/spark-submit --class "com.neilferguson.PopStrat" --master local[6] --driver-memory 6G target/uber-popstrat-0.1-SNAPSHOT.jar <genotypesfile> <panelfile>
Replacing <genotypesfile> with the path to your genotype data file (ADAM or VCF), and <panelfile> with the panel file
from 1000 genomes.
This runs PopStrat using a local (in-process) Spark master with 6 cores and 6GB of RAM. You can run against a different
Spark cluster by modifying the options in the above command line. See the
Spark documentation for further details.
Using the above data PopStrat may take up to 2-3 hours to run, depending on hardware. When it is finished you should
see output that looks something like the following:
Confusion Matrix (vertical: actual; across: predicted):
ASW CHB GBR Error Rate
ASW 60 1 0 0.0164 = 1 / 61
CHB 0 103 0 0.0000 = 0 / 103
GBR 0 1 90 0.0110 = 1 / 91
Totals 60 105 90 0.0078 = 2 / 255
This is a confusion matrix which shows the predicted versus the actual
populations. All being well, you should see an overall accuracy of more than 99%
(only one or two predictions should be incorrect).
A single Scala class at src/main/scala/com/neilferguson/PopStrat.scala
contains all of the code for PopStrat.
See this blog post on bdgenomics.org for
a deep dive into the code.
The code is fairly straightforward and follows the following high level flow:
SchemaRDD
with each column representing a variant and each row representing a sampleSchemaRDD
to an H2O data frame.Thanks to the folks at Big Data Genomics for the
original blog post
that inspired this.