ExP Heatmap
Welcome to the ExP Heatmap python
package and command-line tool. Our software is focused on displaying multidimensional data, expecially the so called cross-population data - differences/similarities/p-values/or any other parameters of your choice between several groups/populations. Our method allows the user to quickly and efficiently evaluate millions of p-values or test statistics in one figure.
This tool is being developed in the Laboratory of Genomics and Bioinformatics, Institute of Molecular Genetics of the Academy of Sciences of the Czech Republic, v. v. i.
The ExP Heatmap manual is divided into following sections:
-
Requirements and install
-
Simple example
-
Workflow
-
Command-line tool
-
Python package usage and examples
-
Licence and final remarks
ExP heatmap example - LCT gene
This is the ExP heatmap of human lactose (LCT) gene on chromosome 2 and its surrounding genomic region displaying population differences between 26 populations of 1000 Genomes Project, phase 3. Displayed values are the adjusted rank p-values for cross-population extended haplotype homozygosity (XPEHH) selection test.
1. Requirements and install
Requirements
python
>= 3.8vcftools
for genomic data preparation (not needed if you want just plot your data) (repository)- space on disk (.vcf files are usually quite large)
Install
PyPI repository link (exp_heatmap)
pip install exp_heatmap
Install the latest version directly from this GitHub
pip install git+https://github.com/bioinfocz/exp_heatmap.git
2. Simple example
After installing the package, try to construct ExP heatmap in three simple steps:
- Download the prepared results of the extended haplotype homozygosity (XPEHH) selection test for the part of human chromosome 2, 1000 Genomes Project data: (example results)
- Unpack the zipped folder
chr2.xpehh.example/
in your working directory: unzip chr2.xpehh.example.zip
- Run the following command:
exp_heatmap plot chr2.xpehh.example/ --begin 135287850 --end 136287850 --output LCT_xpehh
The exp_heatmap
package will read the files from chr2.xpehh.example/
folder and create the ExP heatmap and save it as LCT_xpehh.png
file.
3. Workflow
As an workflow example we present an analysis of 1000 Genomes Project, phase 3 data of chromosome 22, chosen especially for its small size and thus reasonable fast computations. It is focused on ADM2 gene (link), which is active especially in reproductive system, and angiogenesis and cardiovascular system in general.
wget "ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr22_GRCh38.genotypes.20170504.vcf.gz" -O chr22.genotypes.vcf.gz
wget "ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel" -O genotypes.panel
wget "https://ddbj.nig.ac.jp/public/mirror_database/1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" -O chr22.genotypes.vcf.gz
wget "https://ddbj.nig.ac.jp/public/mirror_database/1000genomes/release/20130502/integrated_call_samples_v3.20130502.ALL.panel" -O genotypes.panel
vcftools --gzvcf chr22.genotypes.vcf.gz \
--remove-indels \
--recode \
--recode-INFO-all \
--out chr22.genotypes
exp_heatmap prepare chr22.genotypes.recode.vcf chr22.genotypes.recode.zarr
exp_heatmap compute chr22.genotypes.recode.zarr genotypes.panel chr22.genotypes.output
exp_heatmap plot chr22.genotypes.output --begin 50481556 --end 50486440 --title ADM2 --output adm2_GRCh38
exp_heatmap plot chr22.genotypes.output --begin 50910000 --end 50950000 --title ADM2 --output adm2_GRCh37
4. Command-line tool
After installing the exp_heatmap
using pip
as described above, you can use its basic functionality directly from the command line interface.
Get the data
Prepare the data
Extract only SNP
You can use an .vcf or .vcf.gz file
DATA="ALL.chr22_GRCh38.genotypes.20170504"
vcftools --gzvcf $DATA.vcf.gz --remove-indels --recode --recode-INFO-all --out $DATA
vcftools --vcf $DATA.vcf --remove-indels --recode --recode-INFO-all --out $DATA
Prepare data for computing
exp_heatmap prepare DATA.recode.vcf DATA.zarr
Compute pairwise values
exp_heatmap compute DATA.zarr genotypes.panel DATA.output
Besides the default cross-population extended haplotype homozygosity (XPEHH) test, you can use this exp_heatmap compute
with optional parameter -t
and one of the keywords:
xpehh
- computes cross-population extended haplotype homozygosity (XPEHH) test (default),xpnsl
- computes cross-population number of segregating sites by length (NSL) test,delta_tajima_d
- computes delta Tajima's D,hudson_fst
- computes pairwise genetic distance Fst (using the method od Hudson (1992)).
exp_heatmap compute DATA.zarr genotypes.panel DATA.output -t xpnsl
Display ExP heatmap
--begin
, --end
(required)
--title
(optional)
--cmap
(optional)
--output
(optional)
exp_heatmap plot DATA.output --begin BEING --end END --title TITLE --output NAME
5. Python package
Besides using ExP Heatmap as standalone command-line tool, more options and user-defined parameters' changes are available when ExP Heatmap is imported directly into your Python script.
Test files used in these examples (p-values, test results, VCF files etc.) can be downloaded HERE. They are based on results of cross-population selection tests of the lactase (LCT) gene area (
chr2:135,787,850-135,837,184).
Here we outline a solution to 3 possible and most common scenarios where the ExP is being used.
Possible model scenarios:
- a) you have values ready to display
- b) you have some kind of parameters/test results, need to compute the p-values and display them
- c) you only have the input data (VCF), need to compute the parameters/tests, turn them into p-values and display them as ExP heatmap
a) you have values ready to display
Your data are in a *.tsv file, tab-delimited text file (table), where the results or p-values are stored in columns, first column is 'CHROM', second column 'POS', followed by X columns of pairwise parameters (i.e. rank p-values). For 1000 Genomes data, that would mean 325 columns of pair-wise p-values for 26 populations.
from exp_heatmap.plot import plot_exp_heatmap
import pandas as pd
data_to_plot = pd.read_csv("LCT_xpnsl_pvalues.csv", sep="\t")
plot_exp_heatmap(data_to_plot,
begin=135287850,
end=136287850,
title="XP-NSL test on LCT gene in 1000 Genomes Project (phase 3) populations",
cmap="Blues",
output=False,
populations="1000Genomes",
xlabel="LCT gene, 1 Mbp window, 2:135,287,850-136,287,850, GRCh38")
### b) you have some kind of parameters/test results, need to compute the p-values and display them
Here, you will need to compute the p-values using a prepared function in `exp_heatmap` python package.
from exp_heatmap.plot import plot_exp_heatmap, create_plot_input, superpopulations, prepare_cbar_params
results_directory = "chr2_xpnsl_1000Genomes.test/"
data_to_plot = create_plot_input("chr2_xpnsl_1000Genomes.test/", begin=135287850, end=136287850, populations="1000Genomes")
plot_exp_heatmap(data_to_plot,
begin=135287850,
end=136287850,
title="XP-NSL test on LCT gene in 1000 Genomes Project (phase 3) populations",
cmap="Blues",
output=False,
populations="1000Genomes",
xlabel="LCT gene, 1 Mbp window, 2:135,287,850-136,287,850, GRCh38")
cmin, cmax, cbar_ticks = prepare_cbar_params(data_to_plot, n_cbar_ticks=4)
plot_exp_heatmap(data_to_plot,
begin=135000000,
end=137000000,
title="XP-NSL test results in African populations",
cmap="expheatmap",
output="xpnsl_Africa",
vertical_line=([135851073, "rs41525747"], [135851081, "rs41380347"], [135851176, "rs145946881"]),
populations=superpopulations["AFR"],
xlabel="LCT gene region, 2:135,000,000-137,000,000, GRCh38")
### c) you only have the input data (vcf)...
...and need to compute the parameters/tests, turn them into p-values and display them as ExP heatmap.
Here the process will differ depending on what kind test you want to run. Below we give different examples using common tools (`VCFtools`)
and pythonic library `scikit-allel`.
XX VCF to zarr
XX Compute test (different!)
XX Display!
XX
XX
6. Licence and final remarks
The ExP Heatmap package is available under the MIT License. (link)
If you would be interested in using this method in your commercial software under another licence, please, contact us at edvard.ehler@img.cas.cz.
Contributors
Acknowledgement