
Security News
Deno 2.6 + Socket: Supply Chain Defense In Your CLI
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.
abbreviation-extractor
Advanced tools
Abbreviation Extractor is a high-performance Rust library with Python bindings for extracting abbreviation-definition pairs from text, particularly focused on biomedical text. It implements an improved version of the Schwartz-Hearst algorithm, offering enhanced accuracy and speed. It's based the original python implementation.
Speed Comparison With Other Abbreviation Extraction Libraries
Extraction Accuracy Comparison With Other Abbreviation Extraction Libraries
Add this to your Cargo.toml:
abbreviation-extractor = "0.1.3"
pip install abbreviation-extractor-rs
use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};
let text = "The World Health Organization (WHO) is a specialized agency.";
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs(text, options);
for pair in result {
println!("Abbreviation: {}, Definition: {}", pair.abbreviation, pair.definition);
}
from abbreviation_extractor import extract_abbreviation_definition_pairs
text = "The World Health Organization (WHO) is a specialized agency."
result = extract_abbreviation_definition_pairs(text)
for pair in result:
print(f"Abbreviation: {pair.abbreviation}, Definition: {pair.definition}")
from abbreviation_extractor import extract_abbreviation_definition_pairs
text = "The World Health Organization (WHO) is a specialized agency. The World Heritage Organization (WHO) is different."
# Get only the most common definition for each abbreviation
result = extract_abbreviation_definition_pairs(text, most_common_definition=True)
# Get only the first definition for each abbreviation
result = extract_abbreviation_definition_pairs(text, first_definition=True)
# Disable tokenization (if the input is already tokenized)
result = extract_abbreviation_definition_pairs(text, tokenize=False)
# Combine options
result = extract_abbreviation_definition_pairs(text, most_common_definition=True, tokenize=True)
for pair in result:
print(f"Abbreviation: {pair.abbreviation}, Definition: {pair.definition}")
use abbreviation_extractor::{extract_abbreviation_definition_pairs, AbbreviationOptions};
let text = "The World Health Organization (WHO) is a specialized agency. The World Heritage Organization (WHO) is different.";
// Get only the most common definition for each abbreviation
let options = AbbreviationOptions::new(true, false, true);
let result = extract_abbreviation_definition_pairs(text, options);
// Get only the first definition for each abbreviation
let options = AbbreviationOptions::new(false, true, true);
let result = extract_abbreviation_definition_pairs(text, options);
// Disable tokenization (if the input is already tokenized)
let options = AbbreviationOptions::new(false, false, false);
let result = extract_abbreviation_definition_pairs(text, options);
for pair in result {
println!("Abbreviation: {}, Definition: {}", pair.abbreviation, pair.definition);
}
For processing multiple texts in parallel, you can use the extract_abbreviation_definition_pairs_parallel function:
use abbreviation_extractor::{extract_abbreviation_definition_pairs_parallel, AbbreviationOptions};
let texts = vec![
"The World Health Organization (WHO) is a specialized agency.",
"The United Nations (UN) works closely with WHO.",
"The European Union (EU) is a political and economic union.",
];
let options = AbbreviationOptions::default();
let result = extract_abbreviation_definition_pairs_parallel(texts, options);
for extraction in result.extractions {
println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}
from abbreviation_extractor import extract_abbreviation_definition_pairs_parallel
texts = [
"The World Health Organization (WHO) is a specialized agency.",
"The United Nations (UN) works closely with WHO.",
"The European Union (EU) is a political and economic union.",
]
result = extract_abbreviation_definition_pairs_parallel(texts)
for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")
For extracting abbreviations from large files, you can use the extract_abbreviations_from_file function:
use abbreviation_extractor::{extract_abbreviations_from_file, AbbreviationOptions, FileExtractionOptions};
let file_path = "path/to/your/large/file.txt";
let abbreviation_options = AbbreviationOptions::default();
let file_options = FileExtractionOptions::default();
let result = extract_abbreviations_from_file(file_path, abbreviation_options, file_options);
for extraction in result.extractions {
println!("Abbreviation: {}, Definition: {}", extraction.abbreviation, extraction.definition);
}
from abbreviation_extractor import extract_abbreviations_from_file
file_path = "path/to/your/large/file.txt"
result = extract_abbreviations_from_file(file_path)
for extraction in result.extractions:
print(f"Abbreviation: {extraction.abbreviation}, Definition: {extraction.definition}")
You can customize the file extraction process by specifying additional parameters:
result = extract_abbreviations_from_file(
file_path,
most_common_definition=True,
first_definition=False,
tokenize=True,
num_threads=4,
show_progress=True,
chunk_size=2048 * 1024 # 2MB chunks
)
Below is a comparison of how the abbreviation extractor performs in comparison to other libraries, namely Schwartz-Hearst and ScispaCy in terms of accuracy and speed.
| Abbrv | Ground Truth | abbreviation-extractor (This Library) | abbreviation-extraction | ScispaCy |
|---|---|---|---|---|
| '3-meAde' | '3-methyl-adenine' | '3-methyl-adenine' | '3-methyl-adenine' | 'N/A' |
| '5'UTR' | '5' untranslated region' | '5' untranslated region' | 'N/A' | 'N/A' |
| '5LO' | '5-lipoxygenase' | '5-lipoxygenase' | '5-lipoxygenase' | 'N/A' |
| 'AAV' | 'adeno-associated virus' | 'adeno-associated virus' | 'associated virus' | 'adeno-associated virus' |
| 'ACP' | 'Enoyl-acyl carrier protein' | 'Enoyl-acyl carrier protein' | 'acyl carrier protein' | 'Enoyl-acyl carrier protein' |
| 'ADIOL' | '5-androstene-3beta, 17beta-diol' | '5-androstene-3beta, 17beta-diol' | 'androstene-3beta, 17beta-diol' | '5-androstene-3beta, 17beta-diol' |
| cAMP | 'cyclic AMP' | 'cyclic AMP' | 'N/A' | |
| 'ALAD' | '5-aminolaevulinic acid dehydratase' | '5-aminolaevulinic acid dehydratase' | 'N/A' | '5-aminolaevulinic acid dehydratase' |
| 'AMPK' | 'AMP-activated protein kinase' | 'AMP-activated protein kinase' | 'N/A' | 'AMP-activated protein kinase' |
| 'AP' | 'apurinic/apyrimidinic site' | 'apurinic/apyrimidinic site' | 'apyrimidinic site' | 'apurinic/apyrimidinic site' |
| 'AcCoA' | 'acetyl coenzyme A' | 'acetyl coenzyme A' | 'N/A' | 'acetyl coenzyme A' |
| 'Ahr' | 'aryl hydrocarbon receptor' | 'aryl hydrocarbon receptor' | 'N/A' | 'aryl hydrocarbon receptor' |
| 'BD' | 'binding domain' | 'binding domain' | 'N/A' | 'binding domain' |
| '8-OxoG' | '7,8-dihydro-8-oxoguanine' | '7,8-dihydro-8-oxoguanine' | '8-oxoguanine' | 'N/A' |
| dsRNA | double-stranded RNA | double-stranded RNA | double-stranded RNA | 'N/A' |
| 'BERI' | 'Biomolecular Engineering Research Institute' | 'Biomolecular Engineering Research Institute' | 'N/A' | 'Biomolecular Engineering Research Institute' |
| 'CTLs | 'cytotoxic T lymphocytes' | 'cytotoxic T lymphocytes' | 'N/A' | 'N/A' |
| 'C-RBD' | 'C-terminal RNA binding domain' | 'C-terminal RNA binding domain' | 'N/A' | 'C-terminal RNA binding domain' |
| 'CAP' | 'cyclase-associated protein' | 'cyclase-associated protein' | 'N/A' | 'cyclase-associated protein' |
For detailed API documentation, please refer to the Rust docs or the Python module docstrings.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
This library is based on the Schwartz-Hearst algorithm:
The implementation is inspired by the original Python variant by Phil Gooch: abbreviation-extractor
FAQs
A library for extracting abbreviations from text.
We found that abbreviation-extractor demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.

Security News
New DoS and source code exposure bugs in React Server Components and Next.js: what’s affected and how to update safely.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.