Security News
PyPI’s New Archival Feature Closes a Major Security Gap
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.
RFMix-reader
is a Python package designed to efficiently read and process output
files generated by RFMix
, a popular tool
for estimating local ancestry in admixed populations. The package employs a lazy
loading approach, which minimizes memory consumption by reading only the loci that
are accessed by the user, rather than loading the entire dataset into memory at
once. Additionally, we leverage GPU acceleration to improve computational speed.
rfmix-reader
can be installed using pip:
pip install rfmix-reader
GPU Acceleration:
rfmix-reader
leverages GPU acceleration for improved performance. To use this
functionality, you will need to install the following libraries for your specific
CUDA version:
RAPIDS
: Refer to official installation guide herePyTorch
: Installation instructions can be found hereAdditional Notes:
Docker
or Conda
environemnts. Compatibility
may vary.rfmix-reader
.
This is still much faster than processing the files with stardard scripting.Lazy Loading
Efficient Data Access
Seamless Integration
RFMix
output data.Loci Imputation
Whether you are working with large-scale genomic datasets or have limited
computational resources, RFMix-reader
offers an efficient and memory-conscious
solution for reading and processing RFMix
output files. Its lazy loading approach
ensures optimal resource utilization, making it a valuable tool for researchers
and bioinformaticians working with admixed population data.
Simulation data is available for testing two and three population admixture on Synapse: syn61691659.
This works similarly to pandas-plink
:
This is a two-part process.
To reduce computational time and memory, we leverage binary files.
While RFMix
does not generate these directly, we provide a function
for their creation: create_binaries
. This function can also be invoked
via the command line:
create-binaries [-h] [--version] [--binary_dir BINARY_DIR] file_prefix
.
from rfmix_reader import create_binaries
# Generate binary files
file_path = "examples/two_popuations/out/"
binary_dir = "./binary_files"
create_binaries(file_path, binary_dir=binary_dir)
You can also do this on the fly.
from rfmix_reader import read_rfmix
file_path = "examples/two_popuations/out/"
binary_dir = "./binary_files"
loci, rf_q, admix = read_rfmix(file_path, binary_dir=binary_dir,
generate_binary=True)
We do not have this turned on by default, as it is the
rate limiting step. It can take upwards of 20 to 25 minutes
to run depending on *fb.tsv
file size.
Once binary files are generated, you can the main function to process the RFMix results. With GPU this takes less than 5 minutes.
from rfmix_reader import read_rfmix
file_path = "examples/two_popuations/out/"
loci, rf_q, admix = read_rfmix(file_path)
Note: ./binary_files
is the default for binary_dir
,
so this is an optional parameter.
RFMix-reader
is adaptable for as many population admixtures as
needed.
from rfmix_reader import read_rfmix
file_path = "examples/three_popuations/out/"
binary_dir = "./binary_files"
loci, rf_q, admix = read_rfmix(file_path, binary_dir=binary_dir,
generate_binary=True)
Imputing local ancestry loci information to genotype variant locations improves
integration of the local ancestry information with genotype data. As such, we provide
the interpolate_array
function to efficiently interpolate missing values when local
ancestry loci information is converted to more variable genotype variant locations.
It leverages the power of Zarr
arrays, making it suitable for handling substantial datasets while managing memory
usage effectively.
NumPy
.tqdm
progress bar, providing
real-time feedback during execution._interpolate_col
function to perform
interpolation along each column of the dataset.import pandas as pd
import dask.array as da
# Outer merged dataframe of loci and variant locations
# "i" is from the loci information; "chrom" and "pos" from both dataframes
variant_loci_df = pd.DataFrame({'chrom': ['1', '1', '1', '1'],
'pos': [100, 200, 300, 400],
'i': [1, NA, NA, 2]})
# Dask array of admixture data, which will have few rows than variant_loci_df
admix = da.random.random((2, 3)) # Random data here
# This expands the Dask array (admix) and interpolates missing data
# Default chunk_size = 50000 assuming variant_loci_df 6-9M rows.
# Adjust this based on variant_loci_df size.
z = interpolate_array(variant_loci_df, admix, '/path/to/output', chunk_size=1)
# Check the shape of the resulting Zarr array, which should have the same
# row numbers as variant_loci_df
print(z.shape) # Output: (4, 3)
The helper functions _load_genotypes
and _load_admix
are designed to facilitate
the loading of loci and genotype data for constructing the variant_loci_df
DataFrame.
_load_genotypes(plink_prefix_path)
: This function uses the tensorqtl
library to read genotype data from PLINK files (PGEN
). It returns both the
loaded genotype data and a DataFrame containing variant information, which
includes chromosome and position details. The chromosome identifiers are
formatted to include the "chr" prefix for consistency._load_admix(prefix_path, binary_dir)
: This function employs the
rfmix_reader
library to load local ancestry data from specified paths. It
reads the ancestry information into a suitable format for further processing,
enabling integration with genotype data.These functions ensure accurate loading and formatting of variant and local ancestry data, streamlining subsequent analyses.
def _load_genotypes(plink_prefix_path):
from tensorqtl import pgen
pgr = pgen.PgenReader(plink_prefix_path)
variant_df = pgr.variant_df
variant_df.loc[:, "chrom"] = "chr" + variant_df.chrom
return pgr.load_genotypes(), variant_df
def _load_admix(prefix_path, binary_dir):
from rfmix_reader import read_rfmix
return read_rfmix(prefix_path, binary_dir=binary_dir)
def __testing__():
basename = "/projects/b1213/large_projects/brain_coloc_app/input"
# Local ancestry
prefix_path = f"{basename}/local_ancestry_rfmix/_m/"
binary_dir = f"{basename}/local_ancestry_rfmix/_m/binary_files/"
loci, _, admix = _load_admix(prefix_path, binary_dir)
loci.rename(columns={"chromosome": "chrom",
"physical_position": "pos"},
inplace=True)
# Variant data
plink_prefix = f"{basename}/genotypes/TOPMed_LIBD"
_, variant_df = _load_genotypes(plink_prefix)
variant_df = variant_df.drop_duplicates(subset=["chrom", "pos"],
keep='first')
# Keep all locations for more accurate imputation
variant_loci_df = variant_df.merge(loci.to_pandas(), on=["chrom", "pos"],
how="outer", indicator=True)\
.loc[:, ["chrom", "pos", "i", "_merge"]]
data_path = f"{basename}/local_ancestry_rfmix/_m"
z = interpolate_array(variant_loci_df, admix, data_path)
# Match variant data genomic positions
arr_geno = arr_mod.array(variant_loci_df[~(variant_loci_df["_merge"] == "right_only")].index)
new_admix = z[arr_geno.get(), :]
Note: Following imputation, variant_df
will include genomic positions for
both local ancestry and genotype data.
If you use this software in your work, please cite it.
Benjamin, K. J. M. (2024). RFMix-reader (Version v0.1.15) [Computer software]. https://github.com/heart-gen/rfmix_reader
Kynon JM Benjamin. "RFMix-reader: Accelerated reading and processing for local ancestry studies." bioRxiv. 2024. DOI: 10.1101/2024.07.13.603370.
This work was supported by grants from the National Institutes of Health, National Institute on Minority Health and Health Disparities (NIMHD) K99MD016964 / R00MD016964.
FAQs
RFMix-reader is a Python package designed to efficiently read and process output files generated by RFMix, a popular tool for estimating local ancestry in admixed populations. The package employs a lazy loading approach, which minimizes memory consumption by reading only the loci that are accessed by the user, rather than loading the entire dataset into memory at once.
We found that rfmix-reader demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Research
Security News
Malicious npm package postcss-optimizer delivers BeaverTail malware, targeting developer systems; similarities to past campaigns suggest a North Korean connection.
Security News
CISA's KEV data is now on GitHub, offering easier access, API integration, commit history tracking, and automated updates for security teams and researchers.