
Product
Introducing License Overlays: Smarter License Management for Real-World Code
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
BlitzText is a high-performance library for efficient keyword extraction and replacement in strings. It is based on the FlashText and Aho-Corasick algorithm. There are both Rust and Python implementations. Main difference form Aho-Corasick is that BlitzText only matches the longest pattern in a greedy manner.
Add this to your Cargo.toml
:
[dependencies]
blitztext = "0.1.0"
or
cargo add blitztext
Install the library using pip:
pip install blitztext
use blitztext::KeywordProcessor;
fn main() {
let mut processor = KeywordProcessor::new();
processor.add_keyword("rust", Some("Rust Lang"));
processor.add_keyword("programming", Some("Coding"));
let text = "I love rust programming";
let matches = processor.extract_keywords(text, None);
for m in matches {
println!("Found '{}' at [{}, {}]", m.keyword, m.start, m.end);
}
let replaced = processor.replace_keywords(text, None);
println!("Replaced text: {}", replaced);
// Output: "I love Rust Lang Coding"
}
from blitztext import KeywordProcessor
processor = KeywordProcessor()
processor.add_keyword("rust", "Rust Lang")
processor.add_keyword("programming", "Coding")
text = "I love rust programming"
matches = processor.extract_keywords(text)
for m in matches:
print(f"Found '{m.keyword}' at [{m.start}, {m.end}]")
replaced = processor.replace_keywords(text)
// Output: "I love Rust Lang Coding"
print(f"Replaced text: {replaced}")
For processing multiple texts in parallel:
// Rust
let texts = vec!["Text 1", "Text 2", "Text 3"];
let results = processor.parallel_extract_keywords_from_texts(&texts, None);
# Python
texts = ["Text 1", "Text 2", "Text 3"]
results = processor.parallel_extract_keywords_from_texts(texts)
Both Rust and Python implementations support fuzzy matching:
// Rust
let matches = processor.extract_keywords(text, Some(0.8));
# Python
matches = processor.extract_keywords(text, threshold=0.8)
You can enable case-sensitive matching:
// Rust
let mut processor = KeywordProcessor::with_options(true, false);
processor.add_keyword("Rust", Some("Rust Lang"));
let matches = processor.extract_keywords("I love Rust and rust", None);
// Only "Rust" will be matched, not "rust"
# Python
processor = KeywordProcessor(case_sensitive=True)
processor.add_keyword("Rust", "Rust Lang")
matches = processor.extract_keywords("I love Rust and rust")
# Only "Rust" will be matched, not "rust"
Enable overlapping matches:
// Rust
let mut processor = KeywordProcessor::with_options(false, true);
processor.add_keyword("word", None);
processor.add_keyword("sword", None);
let matches = processor.extract_keywords("I have a sword", None);
// "word" will be matched
# Python
processor = KeywordProcessor(allow_overlaps=True)
processor.add_keyword("word")
matches = processor.extract_keywords("I have a sword")
# "word" will be matched
This library uses the concept of non-word boundaries to determine where words begin and end. By default, alphanumeric characters and underscores are considered part of a word. You can customize this behavior to fit your specific needs.
// Rust
let mut processor = KeywordProcessor::new();
processor.add_keyword("rust", None);
processor.add_keyword("programming", Some("coding"));
let text = "I-love-rust-programming-and-1coding2";
// Default behavior: '-' is a word separator
let matches = processor.extract_keywords(text, None);
assert_eq!(matches.len(), 2);
// Matches: "rust" and "coding"
// Add '-' as a non-word boundary
processor.add_non_word_boundary('-');
// Now '-' is considered part of words
let matches = processor.extract_keywords(text, None);
assert_eq!(matches.len(), 0);
// No matches, because "rust" and "programming" are now part of larger "words"
# Python
processor = KeywordProcessor()
processor.add_keyword("rust")
processor.add_keyword("programming", "coding")
text = "I-love-rust-programming-and-1coding2"
# Default behavior: '-' is a word separator
matches = processor.extract_keywords(text)
assert len(matches) == 2
# Matches: "rust" and "coding"
# Add '-' as a non-word boundary
processor.add_non_word_boundary('-')
# Now '-' is considered part of words
matches = processor.extract_keywords(text)
assert len(matches) == 0
# No matches, because "rust" and "programming" are now part of larger "words"
// Rust
processor.set_non_word_boundaries(&['-', '_', '@']);
# Python
processor.set_non_word_boundaries(['-', '_', '@'])
BlitzText is designed for high performance, making it suitable for processing large volumes of text. Benchmark details here.
Mult-threaded performance:
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any problems, please file an issue along with a detailed description.
This project is licensed under the MIT License.
FAQs
A library for fast keyword extraction and replacement in strings.
We found that blitztext demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Product
Customize license detection with Socket’s new license overlays: gain control, reduce noise, and handle edge cases with precision.
Product
Socket now supports Rust and Cargo, offering package search for all users and experimental SBOM generation for enterprise projects.
Product
Socket’s precomputed reachability slashes false positives by flagging up to 80% of vulnerabilities as irrelevant, with no setup and instant results.