
Product
Introducing Socket Firewall Enterprise: Flexible, Configurable Protection for Modern Package Ecosystems
Socket Firewall Enterprise is now available with flexible deployment, configurable policies, and expanded language support.
instant-segment
Advanced tools
Instant Segment is a fast Apache-2.0 library for English word segmentation. It is based on the Python wordsegment project written by Grant Jenks, which is in turn based on code from Peter Norvig's chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009).
For the microbenchmark included in this repository, Instant Segment is ~500x faster than the Python implementation. The API was carefully constructed so that multiple segmentations can share the underlying state to allow parallel usage.
Instant Segment works by segmenting a string into words by selecting the splits with the highest probability given a corpus of words and their occurrences.
For instance, provided that choose and spain occur more frequently than
chooses and pain, and that the pair choose spain occurs more frequently
than chooses pain, Instant Segment can help identify the domain
choosespain.com as ChooseSpain.com which more likely matches user intent.
Read about how we built and improved Instant Segment for use in production at Instant Domain Search to help our users find relevant domains they can register.
pip install instant-segment
[dependencies]
instant-segment = "0.8.1"
The following examples expect unigrams and bigrams to exist. See the
examples (Rust,
Python) to see how to construct
these objects.
import instant_segment
segmenter = instant_segment.Segmenter(unigrams, bigrams)
search = instant_segment.Search()
segmenter.segment("instantdomainsearch", search)
print([word for word in search])
--> ['instant', 'domain', 'search']
use instant_segment::{Search, Segmenter};
use std::collections::HashMap;
let segmenter = Segmenter::new(unigrams, bigrams);
let mut search = Search::default();
let words = segmenter
.segment("instantdomainsearch", &mut search)
.unwrap();
println!("{:?}", words.collect::<Vec<&str>>())
--> ["instant", "domain", "search"]
Check out the tests for more thorough examples: Rust, Python
To run the tests run the following:
cargo t -p instant-segment --all-features
You can also test the Python bindings with:
make test-python
FAQs
Fast English word segmentation
We found that instant-segment demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Product
Socket Firewall Enterprise is now available with flexible deployment, configurable policies, and expanded language support.

Security News
Open source dashboard CNAPulse tracks CVE Numbering Authorities’ publishing activity, highlighting trends and transparency across the CVE ecosystem.

Product
Detect malware, unsafe data flows, and license issues in GitHub Actions with Socket’s new workflow scanning support.