Launch Week Day 4: Introducing Data Exports.Learn More
Socket
Book a DemoSign in
Socket

underthesea-core

Package Overview
Dependencies
Maintainers
1
Versions
43
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

underthesea-core

Underthesea Core

pipPyPI
Version
3.3.0
Maintainers
1

Underthesea Core

PyPI version Python 3.10+

Underthesea Core is a powerful extension of the popular natural language processing library Underthesea, which includes a range of efficient data preprocessing tools and machine learning models for training. Built with Rust for optimal performance, Underthesea Core offers fast processing speeds and is easy to implement, with Python bindings for seamless integration into existing projects. This extension is an essential tool for developers looking to build high-performance NLP systems that deliver accurate and reliable results.

Installation

pip install underthesea-core

Version

Current version: 2.0.0

What's New in 2.0.0

  • L-BFGS optimizer with OWL-QN for L1 regularization
  • 10x faster feature lookup with flat data structure
  • 1.24x faster than python-crfsuite for word segmentation
  • Loop unrolling and unsafe bounds-check elimination for performance

Usage

CRFTrainer

Train a CRF model with L-BFGS optimization:

from underthesea_core import CRFTrainer, CRFTagger

# Prepare training data
# X: list of sequences, each sequence is a list of feature lists (one per token)
# y: list of label sequences
X_train = [
    [["word=Tôi", "is_upper=False"], ["word=yêu", "is_upper=False"], ["word=Việt", "is_upper=True"], ["word=Nam", "is_upper=True"]],
    [["word=Hà", "is_upper=True"], ["word=Nội", "is_upper=True"], ["word=đẹp", "is_upper=False"]],
]
y_train = [
    ["O", "O", "B-LOC", "I-LOC"],
    ["B-LOC", "I-LOC", "O"],
]

# Create trainer with L-BFGS optimizer
trainer = CRFTrainer(
    loss_function="lbfgs",  # L-BFGS with OWL-QN (recommended)
    l1_penalty=1.0,         # L1 regularization
    l2_penalty=0.001,       # L2 regularization
    max_iterations=100,
    verbose=1
)

# Train and get model
model = trainer.train(X_train, y_train)
print(f"Labels: {model.get_labels()}")
print(f"Features: {model.num_state_features()}")

# Save model
model.save("ner_model.bin")

CRFTagger

Load a trained model and make predictions:

from underthesea_core import CRFTagger, CRFModel

# Load model and create tagger
model = CRFModel.load("ner_model.bin")
tagger = CRFTagger.from_model(model)

# Or load directly
tagger = CRFTagger()
tagger.load("ner_model.bin")

# Predict labels for a sequence
features = [
    ["word=Tôi", "is_upper=False"],
    ["word=sống", "is_upper=False"],
    ["word=ở", "is_upper=False"],
    ["word=Hà", "is_upper=True"],
    ["word=Nội", "is_upper=True"],
]
labels = tagger.tag(features)
print(labels)  # ['O', 'O', 'O', 'B-LOC', 'I-LOC']

# Get labels with score
labels, score = tagger.tag_with_score(features)
print(f"Labels: {labels}, Score: {score}")

# Get marginal probabilities
marginals = tagger.marginals(features)
print(f"Marginals shape: {len(marginals)}x{len(marginals[0])}")

CRFFeaturizer

Extract features from tokenized sentences:

from underthesea_core import CRFFeaturizer

features = ["T[-1]", "T[0]", "T[1]"]
dictionary = set(["sinh viên"])
featurizer = CRFFeaturizer(features, dictionary)
sentences = [[["sinh", "X"], ["viên", "X"], ["đi", "X"], ["học", "X"]]]
featurizer.process(sentences)
# [[['T[-1]=BOS', 'T[0]=sinh', 'T[1]=viên'],
#   ['T[-1]=sinh', 'T[0]=viên', 'T[1]=đi'],
#   ['T[-1]=viên', 'T[0]=đi', 'T[1]=học'],
#   ['T[-1]=đi', 'T[0]=học', 'T[1]=EOS']]]

API Reference

CRFTrainer

ParameterTypeDefaultDescription
loss_functionstr"lbfgs""lbfgs" (recommended) or "perceptron"
l1_penaltyfloat0.0L1 regularization coefficient
l2_penaltyfloat0.01L2 regularization coefficient
max_iterationsint100Maximum training iterations
learning_ratefloat0.1Learning rate (perceptron only)
averagingboolTrueUse averaged perceptron
verboseint1Verbosity (0=quiet, 1=progress, 2=detailed)

CRFTagger

MethodDescription
tag(features)Predict labels for a sequence
tag_with_score(features)Predict labels with sequence score
marginals(features)Get marginal probabilities
labels()Get all label names
num_labels()Get number of labels

CRFModel

MethodDescription
save(path)Save model to file
load(path)Load model from file
get_labels()Get all label names
num_state_features()Get number of state features
num_transition_features()Get number of transition features

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts