Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
This is a package for reading and writing bgen files.
This package uses cython to wrap c++ code for parsing bgen files. It's fairly quick, it can parse genotypes from 500,000 individuals at ~300 variants per second within a single python process (~450 million probabilities per second with a 3GHz CPU). Decompressing the genotype probabilities can be the slow step, zlib decompression takes 80% of the total time, using zstd compressed genotypes is ~2X faster.
This has been optimized for UKBiobank bgen files (i.e. bgen version 1.2 with zlib compressed 8-bit genotype probabilities, but the other bgen versions and zstd compression have also been tested using example bgen files).
pip install bgen
from bgen import BgenReader, BgenWriter
bfile = BgenReader(BGEN_PATH)
rsids = bfile.rsids()
# select a variant by indexing
var = bfile[1000]
# pull out genotype probabilities
probs = var.probabilities # returns 2D numpy array
dosage = var.minor_allele_dosage # returns 1D numpy array for biallelic variant
# iterate through every variant in the file
with BgenReader(BGEN_PATH, delay_parsing=True) as bfile:
for var in bfile:
dosage = var.minor_allele_dosage
# get all variants in a genomic region
variants = bfile.fetch('21', 10000, 5000000)
# or for writing bgen files
import numpy as np
from bgen import BgenWriter
geno = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]).astype(np.float64)
with BgenWriter(BGEN_PATH, n_samples=3) as bfile:
bfile.add_variant(varid='var1', rsid='rs1', chrom='chr1', pos=1,
alleles=['A', 'G'], genotypes=geno)
class BgenReader(path, sample_path='', delay_parsing=False)
# opens a bgen file. If a bgenix index exists for the file, the index file
# will be opened automatically for quicker access of specific variants.
Arguments:
path: path to bgen file
sample_path: optional path to sample file. Samples will be given integer IDs
if sample file is not given and sample IDs not found in the bgen file
delay_parsing: True/False option to allow for not loading all variants into
memory when the BgenFile is opened. This can save time when iterating
across variants in the file
Attributes:
samples: list of sample IDs
header: BgenHeader with info about the bgen version and compression.
Methods:
slicing: BgenVars can be accessed by slicing the BgenFile e.g. bfile[1000]
iteration: variants in a BgenFile can be looped over e.g. for x in bfile: print(x)
fetch(chrom, start=None, stop=None): get all variants within a genomic region
drop_variants(list[int]): drops variants by index from being used in analyses
with_rsid(rsid): returns list of BgenVars with given rsid
at_position(pos): returns list of BgenVars at a given position
varids(): returns list of varids for variants in the bgen file.
rsids(): returns list of rsids for variants in the bgen file.
chroms(): returns list of chromosomes for variants in the bgen file.
positions(): returns list of positions for variants in the bgen file.
class BgenVar(handle, offset, layout, compression, n_samples):
# Note: this isn't called directly, but instead returned from BgenFile methods
Attributes:
varid: ID for variant
rsid: reference SNP ID for variant
chrom: chromosome variant is on
pos: nucleotide position variant is at
alleles: list of alleles for variant
is_phased: True/False for whether variant has phased genotype data
ploidy: list of ploidy for each sample. Samples are ordered as per BgenFile.samples
minor_allele: the least common allele (for biallelic variants)
minor_allele_dosage: 1D numpy array of minor allele dosages for each sample
alt_dosage: 1D numpy array of alt allele dosages for each sample
probabilities: 2D numpy array of genotype probabilities, one sample per row
These are most likely for biallelic diploid variants. In that scenario
unphased probabilities have three columns, for homozygous first allele
(AA), heterozygous (Aa), homozygous second allele (aa).
In contrast, phased probabilities (for a biallelic diploid variant) would
have four columns, first two for haplotype 1 (hap1-allele1, hap1-allele2),
last two for haplotype 2 (hap2-allele1, hap2-allele2).
BgenVars can be pickled e.g. pickle.dumps(var)
class BgenWriter(path, n_samples, samples=[], compression='zstd' layout=2, metadata=None)
# opens a bgen file to write variants to. Automatically makes a bgenix index file
Arguments:
path: path to write data to
n_samples: number of samples that you have data for
samples: list of sample IDs (same length as n_samples)
compression: compression type: None, 'zstd', or 'zlib' (default='zstd')
layout: bgen layout format (default=2)
metadata: any additional metadata you want o include in the file (as str)
Methods:
add_variant_direct(variant)
Arguments:
variant: BgenVar, to be directly copied from one begn file to
another. This can be done when the new bgen file is for the same
set of samples as the one being read from. This is much faster
due to not having to decode and re-encode the genotype data.
add_variant(varid, rsid, chrom, pos, alleles, genotypes, ploidy=2,
phased=False, bit_depth=8)
Arguments:
varid: variant ID e.g. 'var1'
rsid: reference SNP ID e.g. 'rs1'
chrom: chromosome the variant is on e.g 'chr1'
pos: nucleotide position of the variant e.g. 100
alleles: list of allele strings e.g. ['A', 'C']
genotypes: numpy array of genotype probabilities, ordered as per the
bgen samples e.g. np.array([[0, 0, 1], [0.5, 0.5, 0]])
ploidy: ploidy state, either as integer to indicate constant ploidy
(e.g. 2), or numpy array of ploidy values per sample, e.g. np.array([1, 2, 2])
phased: whether the genotypes are for phased data or not (default=False)
bit_depth: how many bits to store each genotype as (1-32, default=8)
FAQs
Package for loading data from bgen files
We found that bgen demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.