🔪🧅 Diced 
A Rust re-implementation of the MinCED algorithm to Detect Instances of CRISPRs in Environmental Data.

🗺️ Overview
MinCED is a method developed by Connor T. Skennerton
to identify Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs)
in isolate and metagenomic-assembled genomes. It was derived from the CRISPR
Recognition Tool [1]. It uses a fast scanning algorithm to identify
candidate repeats, combined with an extension step to find maximally spanning
regions of the genome that feature a CRISPR repeat.
Diced is a Rust reimplementation of the MinCED method, using the original
Java code as a reference. It produces exactly the same results as MinCED,
corrects some bugs, and is much faster. The Diced implementation is available
as a Rust library for convenience.
This is the Python version, there is a Rust crate available as well.
📋 Features
- library interface: The Rust implementation is written as library to facilitate
reusability in other projects. It is used to implement a Python library using
PyO3 to generate a native extension.
- single dependency: Diced is distributed as a Python package, so you
can add it as a dependency to your project, and stop worrying about the
Java Virtual Machine being present on the end-user machine.
- zero-copy: The
Scanner which iterates over candidate CRISPRs is zero-copy if
provided with a simple &str reference, but it also supports data behind smart
pointers such as Rc<str> or Arc<str>. The original Python string and its
substrings are never copied.
- fast string matching: The Java implementation uses a handwritten implementation
of the Boyer-Moore algorithm[2], while the Rust
implementation uses the
str::find method of the standard library, which
implements the Two-way algorithm[3]. In addition, the memchr crate can be used as a fast SIMD-capable
implementation of the memmem function.
💡 Example
Diced supports any sequence in string format.
import Bio.SeqIO
import diced
record = Bio.SeqIO.read("diced/tests/data/Aquifex_aeolicus_VF5.fna", "fasta")
sequence = str(record.seq)
for crispr in diced.scan(sequence):
print(
crispr.start,
crispr.end,
len(crispr.repeats),
crispr.repeats[0],
)
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue
tracker if you need to report
or ask something. If you are filing in on a bug, please include as much
information as you can about the issue, and try to recreate the same bug
in a simple, easily reproducible situation.
📋 Changelog
This project adheres to Semantic Versioning
and provides a changelog
in the Keep a Changelog format.
⚖️ License
This library is provided under the open-source
GPLv3 license, or later.
The code for this implementation was derived from the
MinCED source code, which is
available under the GPLv3 as well.
This project is in no way not affiliated, sponsored, or otherwise endorsed
by the original MinCED authors. It was developed
by Martin Larralde during his PhD project at
the Leiden University Medical Center in the
Zeller team.
📚 References
-
[1] Bland, C., Ramsey, T. L., Sabree, F., Lowe, M., Brown, K., Kyrpides, N. C., & Hugenholtz, P. (2007). 'CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats'. BMC bioinformatics, 8, 209. PMID:17577412 doi:10.1186/1471-2105-8-209.
-
[2] Boyer, R. S. and & Moore, J. S. (1977). 'A fast string searching algorithm'. Commun. ACM 20, 10 762–772. doi:10.1145/359842.359859
-
[3] Crochemore, M. & Perrin, D. (1991). 'Two-way string-matching'. J. ACM 38, 3, 650–674. doi:10.1145/116825.116845