
Security News
MCP Community Begins Work on Official MCP Metaregistry
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
Hint: If you're interested in state-of-the-art voice solutions you might also want to have a look at Linguflex, the original project from which stream2sentence is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.
pip install stream2sentence
Pass a generator of characters or text chunks to generate_sentences()
to get a generator of sentences in return.
Here's a basic example:
from stream2sentence import generate_sentences
# Dummy generator for demonstration
def dummy_generator():
yield "This is a sentence. And here's another! Yet, "
yield "there's more. This ends now."
for sentence in generate_sentences(dummy_generator()):
print(sentence)
This will output:
This is a sentence.
And here's another!
Yet, there's more.
This ends now.
One main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.
The generate_sentences()
function offers various parameters to fine-tune its behavior:
generator: Iterator[str]
context_size: int = 12
context_size_look_overhead: int = 12
context_size
for sentence splitting.minimum_sentence_length: int = 10
minimum_first_fragment_length: int = 10
These parameters control how quickly and frequently the generator yields sentence fragments:
quick_yield_single_sentence_fragment: bool = False
quick_yield_for_all_sentences: bool = False
quick_yield_single_sentence_fragment
to True.quick_yield_every_fragment: bool = False
quick_yield_for_all_sentences
and quick_yield_single_sentence_fragment
to True.cleanup_text_links: bool = False
cleanup_text_emojis: bool = False
tokenize_sentences: Callable = None
tokenizer
.tokenizer: str = "nltk"
language: str = "en"
log_characters: bool = False
sentence_fragment_delimiters: str = ".?!;:,\n…)]}。-"
full_sentence_delimiters: str = ".?!\n…。"
force_first_fragment_after_words: int = 15
Instead of a purely lexigraphical strategy, a time based strategy is available. A target tokens per second (tps) is input, and generate_sentences will yield the best available output (full sentence, longest fragment, or any available buffer, in that order) if it is approaching a "deadline" where what has been output would be slower than the input tps target. If LLM is more than two full sentences ahead of the target it will output a sentence even if it's ahead of the "deadline"
from stream2sentence.stream2sentence_time_based import generate_sentences
generator (Iterator[str])
target_tps: float = 4
lead_time: float = 1
max_wait_for_fragments = [3, 2]
min_output_lengths: int[] = [2, 3, 3, 4]
preferred_sentence_fragment_delimiters: str[] = ['. ', '? ', '! ', '\n']
sentence_fragment_delimiters: str[] = ['; ', ': ', ', ', '* ', '**', '– ']
delimiter_ignore_prefixes: str[]
wait_for_if_non_fragment: str[]
deadline_offsets_static: float[] = [1]
deadline_offsets_dynamic: float[] = [0]
:
Any Contributions you make are welcome and greatly appreciated.
git checkout -b feature/AmazingFeature
).git commit -m 'Add some AmazingFeature'
).git push origin feature/AmazingFeature
).This project is licensed under the MIT License. For more details, see the LICENSE
file.
Project created and maintained by Kolja Beigel.
FAQs
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
We found that stream2sentence demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The MCP community is launching an official registry to standardize AI tool discovery and let agents dynamically find and install MCP servers.
Research
Security News
Socket uncovers an npm Trojan stealing crypto wallets and BullX credentials via obfuscated code and Telegram exfiltration.
Research
Security News
Malicious npm packages posing as developer tools target macOS Cursor IDE users, stealing credentials and modifying files to gain persistent backdoor access.