
Security News
Software Engineering Daily Podcast: Feross on AI, Open Source, and Supply Chain Risk
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.
text-prep-lite
Advanced tools
Lightweight text preprocessing utilities for Natural Language Processing (NLP) written in TypeScript.
text-prep-lite provides two core helpers:
normalizeText – clean & normalise raw text into a predictable representation.tokenize – break text into lowercase word tokens.The library is intentionally dependency-free and suitable for browsers, Node.js, and serverless environments.
Natural-language data is messy. Before tokenisation or feeding text into an NLP model you often need to:
text-prep-lite does those common steps with zero runtime dependencies.
npm install text-prep-lite
# or
yarn add text-prep-lite
import { normalizeText, tokenize } from "text-prep-lite";
const raw = " I can't believe it's not butter! 🧈 ";
const cleaned = normalizeText(raw, {
expandContractions: true,
removePunctuation: true,
removeEmojis: true,
});
// → "i cannot believe it is not butter"
const tokens = tokenize(raw);
// → ["i", "can", "t", "believe", "it", "s", "not", "butter"]
normalizeText(input: string, options?: NormalizeOptions): stringReturns a cleaned version of input.
NormalizeOptions:
| Option | Default | Description |
|---|---|---|
expandContractions | false | Expand contractions for the selected locale. |
removePunctuation | false | Strip punctuation characters. |
removeEmojis | false | Remove Unicode emoji characters. |
locale | 'en' | BCP-47 language tag for locale-specific rules (currently: en, sq, fr, de, he). |
Supported locales
en – English (default)sq – Albanianfr – Frenchde – Germanhe – Hebrewes – Spanishzh – Chinese (Mandarin)yue – Chinese (Cantonese)// French example
normalizeText("C'est incroyable!", { expandContractions: true, locale: "fr" });
// → "ce est incroyable!" (punctuation kept in this call)
tokenize(input: string): string[]Returns an array of tokens.
tokenize has no options – it always lowercases, strips punctuation & emojis, and splits on whitespace.
👉 Need word embeddings for semantic analysis?
Check out wink-embeddings-small-en-50d
👉 Need a simple and robust PDF text extraction utility with a quality interface?
Check out [pdf-worker-package]https://www.npmjs.com/package/pdf-worker-package
# run tests
npm test
# build library
npm run build
MIT © Cavani21/thegreatbey
npm inpm test – run lint & unit testsPlease add tests for any new feature or bug-fix.
FAQs
Lightweight text preprocessing utilities for NLP in TypeScript.
We found that text-prep-lite demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Socket CEO Feross Aboukhadijeh joins Software Engineering Daily to discuss modern software supply chain attacks and rising AI-driven security risks.

Security News
GitHub has revoked npm classic tokens for publishing; maintainers must migrate, but OpenJS warns OIDC trusted publishing still has risky gaps for critical projects.

Security News
Rust’s crates.io team is advancing an RFC to add a Security tab that surfaces RustSec vulnerability and unsoundness advisories directly on crate pages.