Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
This code assesses whether de novo single-nucleotide variants are closer together within the coding sequence of a gene than expected by chance, or whether the amino acids for those SNVs are closer together in the protein structure than expected by chance. We use local-sequence based mutation rates to account for differential mutability of regions. The default rates are per-trinucleotide based see Nature Genetics 46:944–950, but you can use your own rates, or even longer sequence contexts, such as 5-mers or 7-mers.
pip install denovonear
Analyse de novo mutation clustering in the coding sequence with the CLI tool:
denovonear cluster \
--in data/example.grch38.dnms.txt \
--gencode data/example.grch38.gtf \
--fasta data/example.grch38.fa \
--out output.txt
Or test clustering within protein structures:
denovonear cluster-structure \
--in data/example.grch38.dnms.txt \
--structures PATH_TO_STRUCTURES.tar \
--gencode data/example.grch38.gtf \
--fasta data/example.grch38.fa \
--out output.structure.txt
explanation of options:
--in
: path to tab-separated table of de novo mutations. See example table below for columns, or example.grch38.dnms.txt
in data folder.--gencode
: path to GENCODE annotations in
GTF format for
transcripts and exons e.g.
example release. Can be gzipped, or uncompressed.--fasta
: path to genome fasta, matching genome build of gencode file--structures
: path to tar file containing PDB structures for all protein coding
genes. This has only been tested with AlphaFold human proteome tar files
e.g. https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000005640_9606_HUMAN_v4.tar
The code identifies the approriate structure pdb by first searching for uniprot
IDs via ensembl (starting with the transcript ID). This only permits variants
in the canonical transcript, and returns nan if a) no structure PDB is found,
b) multiple structures exists for the uniprot IDs, c) multiple chains are
in the structure file, d) the structure has missing residues or e) the
number of residues in the structure file does not match what is expected from
the CDS length. This uses the carbon atom coordinates for each residue to
place the amino acid position, and computes Euclidean distances between
these coordinates.If the --gencode or --fasta options are skipped (e.g. denovonear cluster --in INFILE --out OUTFILE
), gene annotations will be retrieved via an ensembl web
service. For that, you might need to specify --genome-build grch38
to ensure
the gene coordinates match your de novo mutation coordinates.
--rates PATHS_TO_RATES_FILES
--rates-format context OR genome
--cache-folder PATH_TO_CACHE_DIR
--genome-build "grch37" or "grch38" (default=grch37)
The rates option operates in two ways. The first (which requires --rates-format to be "context") is to pass in one path to a tab separated file with three columns: 'from', 'to', and 'mu_snp'. The 'from' column contains DNA sequence (where the length is an odd number) with the base to change at the central nucleotide. The 'to' column contains the sequence with the central base modified. The 'mu_snp' column contains the probability of the change (as per site per generation).
The second way to use the rates option is to pass in multiple paths to VCFs containing mutation rates for every genome position. This requires the --rates-format to be "genome". Currently the only supported rates files are ones from Roulette (https://www.biorxiv.org/content/10.1101/2022.08.20.504670v1), which can be found here: http://genetics.bwh.harvard.edu/downloads/Vova/Roulette/. This needs both the VCFs and their index files.
The cache folder defaults to making a folder named "cache" within the working directory. The genome build indicates which genome build the coordinates of the de novo variants are based on, and defaults to GRCh37.
gene_name | chr | pos | consequence | snp_or_indel |
---|---|---|---|---|
OR4F5 | chr1 | 69500 | missense_variant | DENOVO-SNP |
OR4F5 | chr1 | 69450 | missense_variant | DENOVO-SNP |
from denovonear.gencode import Gencode
from denovonear.cluster_test import cluster_de_novos
from denovonear.mutation_rates import load_mutation_rates
gencode = Gencode('./data/example.grch38.gtf', './data/example.grch38.fa')
symbol = 'OR4F5'
de_novos = {'missense': [69500, 69450, 69400], 'nonsense': []}
rates = load_mutation_rates()
p_values = cluster_de_novos(de_novos, gencode[symbol], rates, iterations=1000000)
Pull out site-specific rates by creating Transcript objects, then get the rates by consequence at each site
from denovonear.rate_limiter import RateLimiter
from denovonear.load_mutation_rates import load_mutation_rates
from denovonear.load_gene import construct_gene_object
from denovonear.site_specific_rates import SiteRates
# extract transcript coordinates and sequence from Ensembl
async with RateLimiter(per_second=15) as ensembl:
transcript = await construct_gene_object(ensembl, 'ENST00000346085')
mut_rates = load_mutation_rates()
rates = SiteRates(transcript, mut_rates)
# rates are stored by consequence, but you can iterate through to find all
# possible sites in and around the CDS:
for cq in ['missense', 'nonsense', 'splice_lof', 'synonymous']:
for site in rates[cq]:
site['pos'] = transcript.get_position_on_chrom(site['pos'], site['offset'])
# or if you just want the summed rate
rates['missense'].get_summed_rate()
You can identify transcripts containing de novos events with the
identify_transcripts.py
script. This either identifies all transcripts for a
gene with one or more de novo events, or identifies the minimal set of
transcripts to contain all de novos (where transcripts are prioritised on the
basis of number of de novo events, and length of coding sequence). Transcripts
can be identified with:
denovonear transcripts \
--de-novos data/example_de_novos.txt \
--out output.txt \
--all-transcripts
Other options are:
--minimise-transcripts
in place of --all-transcripts
, to find the minimal
set of transcripts--genome-build "grch37" or "grch38" (default=grch37)
You can generate mutation rates for either the union of alternative transcripts
for a gene, or for a specific Ensembl transcript ID with the
construct_mutation_rates.py
script. Lof and missense mutation rates can be
generated with:
denovonear rates \
--genes data/example_gene_ids.txt \
--out output.txt
The tab-separated output file will contain one row per gene/transcript, with each line containing a transcript ID or gene symbol, a log10 transformed missense mutation rate, a log10 transformed nonsense mutation rate, and a log10 transformed synonymous mutation rate.
FAQs
Package to examine de novo clustering
We found that denovonear demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.