Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
.. image:: https://forthebadge.com/images/badges/built-with-science.svg :alt: Made with science :target: https://cgrb.oregonstate.edu .. image:: https://img.shields.io/github/v/release/davised/automlsa2 :alt: GitHub version :target: https://github.com/davised/automlsa2/releases/latest .. image:: https://img.shields.io/pypi/v/automlsa2 :alt: PyPI package version :target: https://pypi.org/project/automlsa2 .. image:: https://img.shields.io/github/release-date/davised/automlsa2 :alt: GitHub Release Date :target: https://github.com/davised/automlsa2/releases .. image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg :alt: Maintained? Yes :target: https://github.com/davised/automlsa2/graphs/commit-activity .. image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg :alt: Made with python :target: https://python.org .. image:: https://img.shields.io/lgtm/grade/python/g/davised/automlsa2.svg?logo=lgtm&logoWidth=18 :alt: Language grade: Python :target: https://lgtm.com/projects/g/davised/automlsa2/context:python .. image:: https://img.shields.io/github/last-commit/davised/automlsa2 :alt: Latest commit :target: https://github.com/davised/automlsa2/commits .. image:: https://img.shields.io/github/commits-since/davised/automlsa2/latest :alt: GitHub commits since latest release (by date) :target: https://github.com/davised/automlsa2/commits/main .. image:: https://img.shields.io/pypi/dm/automlsa2 :alt: Downloads per month :target: https://pypistats.org/packages/automlsa2 .. image:: https://img.shields.io/tokei/lines/github/davised/automlsa2 :alt: Lines of code :target: https://github.com/davised/automlsa2.git .. image:: https://img.shields.io/badge/code%20style-black-000000.svg :alt: Black format :target: https://github.com/psf/black .. image:: https://img.shields.io/badge/mypy-checked-blue :alt: Checked with mypy :target: http://mypy-lang.org/ .. image:: https://img.shields.io/github/issues/davised/automlsa2 :alt: GitHub issues :target: https://github.com/davised/automlsa2/issues .. image:: https://img.shields.io/github/forks/davised/automlsa2 :alt: GitHub forks :target: https://github.com/davised/automlsa2/network .. image:: https://img.shields.io/github/stars/davised/automlsa2 :alt: GitHub stars :target: https://github.com/davised/automlsa2/stargazers
.. contents:: Table of Contents :backlinks: none
The intended audience is scientific researchers, computational biologists, and bioinformaticians who are interested in exploring phylogenetic and phylogenomic relationships between genes and organisms. The general idea is to allow the input of sequence data along with marker genes and output a robust phylogenetic tree. I've implemented commands to help with the installation of external dependencies, and I hope the software is easy to use.
If you have feature requests or otherwise have ideas about how to make this software better, please submit an issue with your ideas!
PhD/Masters students and undergraduates are especially encouraged to submit issues if they are having trouble using this software.
Windows users, please install WSL to make use of this software. Using a Linux distribution will make your life as a computational researcher significantly easier.
automlsa2 is distributed on PyPI <https://pypi.org/project/automlsa2/>
_ as
a universal wheel and is available on Linux/macOS and Windows (untested) and
supports Python 3.7+ and PyPy.
.. code-block:: bash
$ python3 -m pip install -U automlsa2
While I will do my best to keep the git version usable, stick to a release and/or pypi install for the most stable experience.
git version install:
.. code-block:: bash
$ git clone https://github.com/davised/automlsa2.git
$ cd automlsa2
$ python3 -m pip install -r requirements.txt
$ python3 -m pip install -U .
for developers, clone as above, then:
.. code-block:: bash
$ python3 -m pip install -e . --no-use-pep517
Python modules:
See requirements.txt
for more info.
External programs:
NCBI BLAST+ >= 2.10.0 <https://blast.ncbi.nlm.nih.gov>
_mafft >= 7.471 <https://mafft.cbrc.jp/alignment/software/>
_IQ-TREE COVID-19 release >= 2.1.1 <http://www.iqtree.org>
_You can install external programs using the automlsa2 --install_deps
command. These will be installed to ${HOME}/.local/external
unless
otherwise specified.
.. code-block:: bash
$ automlsa2 --files Genus_species_1.fna Genus_species_2.fna ... \
Genus_species_N.fna --query queries.fasta -t THREADS -- runID
Alternatively:
.. code-block:: bash
$ automlsa2 --dir path/to/genomes --query queries.fasta -t THREADS \
-- runID
automlsa2 is a re-imagination of autoMLSA.pl <https://github.com/osuchanglab/autoMLSA>
_
The entire codebase has been re-written in python. While the general algorithm produces similar output, and several steps are shared, there are many updates and differences between the two programs, which will be covered later.
The general overview can be summarized here:
BLAST searches are threaded, or, optionally, written to a file to be submitted to a compute cluster. mafft alignment commands can also be written to a file for submission to a compute cluster.
Input query files and genome directories are scanned for updates - if sequences are added, removed, or changed, the analysis is re-done.
Multiple queries targeting the same gene sequence can be used to improve coverage of disparate gene sequences, e.g. attempting to cover an entire phylum with multiple reference genomes being used.
.. code-block:: bash
$ automlsa2 -h
usage: automlsa2 [-h] [--query QUERY [QUERY ...]] [--files FILES [FILES ...]]
[--dir DIR [DIR ...]] [-e EVALUE] [-c COVERAGE] [-i IDENTITY]
[-p {blastn,tblastn}] [--config CONFIG] [--missing_check]
[-t THREADS] [--dups] [--allow_missing ALLOW_MISSING]
[--outgroup OUTGROUP] [--protect]
[--checkpoint {validate,preblast,filtering,prealign,postalign,nexus,none}]
[--install_deps [INSTALL_DEPS]] [--external EXTERNAL]
[--mafft MAFFT] [--iqtree IQTREE] [--debug] [--version]
[--quiet]
runid
This is a rewrite of autoMLSA.pl. Generates automated multi-locus sequence analyses.
positional arguments:
runid Name of the run directory.
optional arguments:
-h, --help show this help message and exit
--query <QUERY [QUERY ...]> Path to file with input seq(s).
--files <FILES [FILES ...]> Path to the target genome FASTA files.
--dir <DIR [DIR ...]> Path to the target genome directory with FASTA files.
-e EVALUE, --evalue EVALUE E-value cutoff for BLAST searches. [1e-5]
-c COVERAGE, --coverage COVERAGE Sets the coverage cut-off threshold. [50]
-i IDENTITY, --identity IDENTITY Sets the identity cut-off threshold. [30]
-p PROGRAM, --program PROGRAM Which BLAST program to run. [tblastn] {tblastn, blastn}
--config CONFIG Path to configuration json file to copy.
--missing_check Use this to confirm that settings have been
checked when genes are missing.
-t THREADS, --threads THREADS Number of threads to use. [1]
--dups Allow for duplicate query names for more sequence
coverage across disparate organisms.
--allow_missing ALLOW_MISSING Allow for N missing genes per genome. [0]
--outgroup OUTGROUP Name of outgroup file or strain to root on.
--protect Save files from getting overwritten. By default, as input
files update, older alignments and trees are deleted.
--checkpoint CHECKPOINT Name of stage to stop computing on. [none]
{validate,preblast,filtering,prealign,postalign,nexus,none}
--install_deps <[INSTALL_DEPS]> Install dependencies into given directory. [/.local/external]
--external EXTERNAL Path to installed external programs. [/.local/external]
--mafft MAFFT mafft settings [--localpair --maxiterate 1000 --reorder]
--iqtree IQTREE iqtree2 settings [-m MFP -B 1000 -alrt 1000 --msub
nuclear --merge rclusterf]
--debug Turn on debugging messages.
--version show program's version number and exit
--quiet Turn off progress messages.
One or more input target genome FASTA files is required, either using
--files
or --dir
. Additionally, one or more query FASTA files
containing one or more query gene sequences is necessary for analysis.
By default, protein queries are expected, and nucleotide FASTA sequence is
required for the target genomes. tblastn
is used to target the genome
sequences using the amino acid queries. blastn
is also available, targeting
the genome sequences using nucleotide queries.
Threads will speed things up significantly. BLAST searches are threaded in python; submitting multiple threads to the blast executable often does not result in much speed up, so each BLAST search is run with one CPU given.
Query marker genes often come from a well-studied representative of, at most, the same genus. Intergenera phylogenies should have a representative sequence from each genus. This can be accomplished by giving all examples of a particular gene the same name in the reference FASTA file. e.g.
.. code-block:: bash
>Gene1 Refgenus3 refspecies GHIGene1 Refgenus1 refspecies ABC
Gene1 Refgenus2 refspecies DEF
This ^ FASTA ^ file would have three representatives of Gene1 in the analysis. The resulting alignments would have one copy of the gene, with the best hits from each target genome included.
Target genome files will be named based on the filename in the final output. Generally, one will want to have Genus_species_strain.fasta or G_species_strain.fasta as the filenames prior to analysis.
Genomes can be downloaded using my get_assemblies
program, here:
https://pypi.org/project/get-assemblies/. Locally produced genomes can be
renamed as required.
The test data are included in a separate repository - https://github.com/davised/automlsa2-examples
To download this test data, you can run git submodule update --init
.
Alternatively, you can run the run_tests.sh
script and it will download
the git submodule for you and run the test command.
The tests should take about 2-3 minutes with 4 CPUs.
☐ Write detailed list of intermediate files.
☐ Compare functionality of this version to prior autoMLSA.pl version.
☑ Check for version numbers for external programs.
Bug reports are encouraged! Submit a github issue and I'll be happy to take a look. Also, feel free to clone and submit merge requests.
Ed Davis <mailto:ed@cgrb.oregonstate.edu>
_
Special thanks for helping me test the software and get the python code packaged:
Alex Weisberg <https://github.com/alexweisberg>
_Shawn O'Neil <https://github.com/oneilsh>
_Also, thanks to these groups for supporting me through my scientific career:
OSU Chang Lab <https://github.com/osuchanglab>
_Center for Genome Research and Biocomputing @ OSU <https://cgrb.oregonstate.edu>
_automlsa2 is distributed under the terms listed in the LICENSE
file. The
software is free for non-commercial use.
Copyright (c) 2020 Oregon State University
All Rights Reserved.
FAQs
Automated Multi-Locus Sequence Analysis tool
We found that automlsa2 demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.