🐍🗡️ PySWRD
Cython bindings and Python interface to SWORD (Smith Waterman On Reduced Database), a method for fast database search.
🗺️ Overview
Searching a sequence inside a database of target sequences involves aligning
the sequence to all the targets to find the highest scoring ones, which has
a high computational cost. Several methods have been proposed over the years
that use a pre-filter to select. In BLAST[1], k-mers are extracted
from the query, and only targets containing high-scoring k-mers, with respect to
the scoring matrix, are actually aligned.
SWORD[2] proposes a pre-filter built
on perfect hashing of short mismatching k-mers. The k-mers generated from the
query sequence also include k-mers with mismatches to improve sensitivity.
When a k-mer is found in a target sequence, SWORD computes the diagonal where it
is located, similarly to FASTA[3]. Target sequences are then selected based on the
number of hits they have on the same diagonal. The pairwise alignment
is then handled by the platform-accelerated Opal[4]
library.
PySWRD is a Python module that provides bindings to
the heuristic filter part of SWORD
using Cython. It implements a user-friendly, Pythonic
interface to build a heuristic filter, process a database in chunks, and
produce the indices of targets passing the filter. The resulting indices
can be used to filter a PyOpal database,
using Opal for pairwise alignment like
the original C++ implementation.
- no binary dependency: PySWRD is distributed as a Python package, so
you can add it as a dependency to your project, and stop worrying about the
SWORD binary being present on the end-user machine.
- no intermediate files: Everything happens in memory, in a Python object
you control, so you don't have to invoke the SWORD CLI using a sub-process
and temporary files.
- better portability: Using only the heuristic filter of SWORD allows
the code to be independent of the local CPU features, unlike SWORD and
Opal which require SIMD. PySWRD delegates the SIMD compilation and
dynamic dispatch to PyOpal to make the package easier to install. It
also benefits from the wider platform support of PyOpal compared to
the original Opal, featuring support for Windows and for Aarch64 CPUs.
🔧 Installing
PySWRD is available for all modern Python versions (3.6+).
It can be installed directly from PyPI,
which hosts some pre-built x86-64 wheels for Linux, MacOS, and Windows,
as well as the code required to compile from source with Cython:
$ pip install pyswrd
💡 Example
PySWRD does not provide I/O, so the sequences to be used have to be loaded through
another library, such as Biopython. PySWRD only requires
the sequences to be available as Python strings:
targets = [
'MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPK',
'MSIIGATRLQNDKSDTYSAGPCYAGGCSAFTPRGTCGKDWDLGEQTCASG',
'MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPGSIVPWKMNRE',
'MYQAINPCPQSWYGSPQLEREIVCKMSGAPHYPNYYPVHPNALGGAWFDT',
'MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSN'
]
queries = [
'MASNTVSAQGGSNRPVRDFSNIQDVAQFLLFDPIWNEQPG',
'MSFKVYDPIAELIATQFPTSNPDLQIINNDVLVVSPHKIT',
'MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGK'
]
Use the high-level search
function, which wraps the internal classes in a single
function to quickly run many-to-many searches in the event all your sequences are in
memory. It expects the sequences as iterable of Python strings, and yields hits
passing E-value and alignment thresholds:
import pyswrd
for hit in pyswrd.search(queries, targets):
print(hit.query_index, hit.target_index, hit.score, hit.evalue)
Different parameters can be passed to pyswrd.search
and are passed to the
SWORD filter and Opal alignment. For instance, to run SWORD in fast mode
instead of the default sensitive mode, and using the PAM70 matrix instead
of BLOSUM62, use:
for hit in pyswrd.search(queries, targets, scorer_name="PAM70", score_threshold=0, kmer_length=5):
print(hit.query_index, hit.target_index, hit.score, hit.evalue)
By default multithreading is supported, using one thread per CPU on the local
machine as reported by os.cpu_count
, but it can be changed with the threads
argument:
for hit in pyswrd.search(queries, targets, threads=1):
print(hit.query_index, hit.target_index, hit.score, hit.evalue)
You can also use the pyswrd.HeuristicFilter
class directly if you wish to
manage the data yourself, or if you want to use a different aligner.
⏱️ Benchmarks
The table below shows the time for running pyswrd.search
using 196 proteins
as queries (uniprot_sprot196.fasta
) against a database of 12,701 proteins
(uniprot_sprot12071.fasta
) pre-loaded into memory:
| threads=1 | threads=2 | threads=4 | threads=8 | threads=12 |
---|
max_candidates=10 | 0.87s | 0.83s | 0.83s | 0.80s | 0.76s |
max_candidates=50 | 0.98s | 0.91s | 0.98s | 0.97s | 1.04s |
max_candidates=100 | 1.24s | 1.33s | 1.44s | 1.63s | 1.67s |
max_candidates=500 | 1.86s | 1.83s | 1.95s | 2.09s | 2.15s |
max_candidates=1000 | 2.87s | 2.64s | 2.83s | 2.82s | 2.90s |
max_candidates=5000 | 9.33s | 8.11s | 7.59s | 6.60s | 6.06s |
max_candidates=15000 | 21.50s | 15.85s | 14.74s | 11.83s | 11.34s |
max_candidates=30000 | 23.44s | 16.13s | 14.61s | 12.47s | 11.08s |
no filter (Opal) | 31.38s | 23.60s | 19.57s | 15.43s | 14.60s |
BLAST+ (blastp ) | 7.46s | 4.97s | 4.01s | 3.63s | 3.66s |
The max_candidates
parameter controls the strictness of the SWORD heuristic filter, and reduces
the total number of alignments made by Opal, at the cost of a lowered sensivity
(see SWORD Supplementary Figs. S1 and S2.).
SWORD uses 15,000 candidates in fast mode and 30,000 candidates in sensitive mode by default.
This was benchmarked against the NCBI NR
database, which contains more than 54M sequences; it is likely a smaller max_candidates
value can
be selected for smaller databases and/or databases with less redundant sequences without loss of sensitivity.
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker
if you need to report or ask something. If you are filing in on a bug,
please include as much information as you can about the issue, and try to
recreate the same bug in a simple, easily reproducible situation.
🏗️ Contributing
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
📋 Changelog
This project adheres to Semantic Versioning
and provides a changelog
in the Keep a Changelog format.
⚖️ License
This library is provided under the GNU General Public License v3.0.
SWORD was written by Robert Vaser and is distributed under the terms of the
GPLv3 as well. See vendor/sword/LICENSE
for more information. SWORD redistributes additional
libraries under the terms of the MIT License.
This project is in no way not affiliated, sponsored, or otherwise endorsed
by the SWORD authors. It was developed
by Martin Larralde during his PhD project
at the European Molecular Biology Laboratory in
the Zeller team.
📚 References
- [1] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, David J. Lipman. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi:10.1016/S0022-2836(05)80360-2. PMID:2231712.
- [2] Robert Vaser, Dario Pavlović, Mile Šikić. SWORD—a highly efficient protein database search. Bioinformatics, Volume 32, Issue 17, September 2016, Pages i680–i684, doi:10.1093/bioinformatics/btw445.
- [3] David J. Lipman, William R. Pearson. Rapid and sensitive protein similarity searches. Science. 1985 Mar 22;227(4693):1435-41. doi:10.1126/science.2983426. PMID:2983426.
- [4] Korpar Matija, Martin Šošić, Dino Blažeka, Mile Šikić. SW#db: ‘GPU-Accelerated Exact Sequence Similarity Database Search’. PLoS One. 2015 Dec 31;10(12):e0145857. doi:10.1371/journal.pone.0145857. PMID:26719890. PMC4699916.