🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more →

Demo Install Sign in

smoothtext

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

smoothtext

A Python library for readability and textual metrics analysis, supporting multiple languages.

0.4.0

Source

PyPI

Maintainers: 1

SmoothText

Introduction

SmoothText is a Python library for calculating readability scores of texts and statistical information for texts in multiple languages.

The design principle of this library is to ensure high accuracy.

Requirements

Python Version

Python 3.10 or higher.

External Dependencies

Library	Version	License	Notes
NLTK	`>=3.9.1`	`Apache 2.0`	Conditionally optional.
Stanza	`>=1.10.1`	`Apache 2.0`	Conditionally optional.
CMUdict	`>=1.0.32`	`GPLv3+`	Required if `Stanza` is the selected backend.
Unidecode	`>=1.3.8`	`GNU GPLv2`	Required.
Pyphen	`>=0.17.0`	`GPL 2.0+/LGPL 2.1+/MPL 1.1`	Required.
emoji	`>=2.14.1`	`BSD`	Required.

Either NLTK or Stanza must be installed and used with the SmoothText library.

Features

Readability Analysis

SmoothText can calculate readability scores of text in the following languages, using the following formulas.

Method	Description
`compute_readability`	Computes the readability score of a text using a specified formula.

English

Method	Formula	Authors	Notes
`automated_readability_index`	Automated Readability Index	Smith & Senter, 1967	-
`flesch_reading_ease`	Flesch Reading Ease	Flesch, 1948	-
`flesch_kincaid_grade`	Flesch-Kincaid Grade	Kincaid et al., 1975	-
`flesch_kincaid_grade_simplified`	Flesch-Kincaid Grade Simplified	Kincaid et al., 1975	Essentially, the same as `Flesch-Kincaid Grade`. However, the output will be rounded due to the constant rounding.
`gunning_fog_index`	Gunning Fog Index	Gunning, 1952	-

Notes:

Although SmoothText supports both US English and GB English, formulas work best with US English.

German

Method	Formula	Authors	Notes
`amstad`	Flesch Reading Ease	Amstad, 1978	German adaptation of `Flesch Reading Ease`.
`wiener_sachtextformel`	Wiener Sachtextformel	Bamberger & Vanecek, 1984	German adaptation of `Flesch-Kincaid Grade`. All versions (1 through 4) are supported.

Russian

Method	Formula	Authors	Notes
`matskovskiy`	Matskovskiy	Matskovskiy, 1976	German adaptation of `Flesch Reading Ease`.

Turkish

Method	Formula	Authors	Notes
`atesman`	Ateşman	Ateşman, 1997	Turkish adaptation of `Flesch Reading Ease`.
`bezirci_yilmaz`	Bezirci-Yılmaz	Bezirci & Yılmaz, 2010	Turkish adaptation of `Flesch-Kincaid Grade`.

Sentencizing, Tokenization, and Syllabification

SmoothText can extract sentences, words, or syllables from texts.

Method	Description
Sentence Level
`sentencize`	Splits text into sentences using language-aware rules
`count_sentences`	Returns the number of sentences found in the text
Word Level
`tokenize`	Extracts word tokens from text; can group by sentences with the split_sentences flag
`count_words`	Counts the number of alphanumeric words in a text
`word_frequencies`	Returns a dictionary of word frequencies with optional lemmatization
Syllable Level
`syllabify`	Splits words into syllables; can be applied to words, tokens, or sentences
`count_syllables`	Counts syllables in words or text using language-specific rules
`syllable_frequencies`	Returns a dictionary mapping syllable counts to frequency in the analyzed text
Character Level
`count_consonants`	Counts the number of consonant characters in text
`count_vowels`	Counts the number of vowel characters in text
Emoji Handling
`demojize`	Converts emoji characters to their text descriptions with custom delimiters
`remove_emojis`	Removes all emoji characters from text

Notes

count_syllables is likely to produce more accurate results in comparison to the syllabify method.
At the moment, lemmatization is only supported for English with the Stanza as the backend. Other languages and backends will ignore the lemmatization flag.

Language	Sentencizing	Tokenization	Syllabification
English	✔ (`NLTK`, `Stanza`)	✔ (`NLTK`, `Stanza`)	✔ (`CMU Dictionary`, `Pyphen`)
German	✔ (`NLTK`, `Stanza`)	✔ (`NLTK`, `Stanza`)	✔ (`Pyphen`)
Russian	✔ (`NLTK`, `Stanza`)	✔ (`NLTK`, `Stanza`)	✔ (`Pyphen`)
Turkish	✔ (`NLTK`, `Stanza`)	✔ (`NLTK`, `Stanza`)	✔ (Custom formula)

Pyphen may not produce accurate results sometimes. Thus, whenever possible, custom syllabification formulas or dictionaries are preferred.

Reading Time

SmoothText can calculate how long would a text take to read. The reading time is calculated based on the average reading speed of an adult.

Method	Description
`reading_aloud_time`	Calculates the reading time of a text.
`reading_time`	Calculates the reading time of a text.
`silent_reading_time`	Calculates the silent reading time.

Installation

You can install SmoothText via pip.

pip install smoothtext

Usage

Importing and Initializing the Library

SmoothText comes with four submodules: Backend, Language, ReadabilityFormula and SmoothText.

from smoothtext import Backend, Language, ReadabilityFormula, SmoothText

Instancing

SmoothText was not designed to be used with static methods. Thus, an instance must be created to access its methods.

When creating an instance, the language and the backend to be used with it can be specified.

The following will create a new SmoothText instance configured to be used with the English language (by default, the United States variant) using NLTK as the backend.

st = SmoothText('en', 'nltk')

Once an instance is created, its backend cannot be changed, but its working language can be changed at any time.

st.language = 'tr'  # Now configured to work with Turkish.
st.language = 'en-gb'  # Switching back to English, but to the United Kingdom variant.

Readying the Backends

When an instance is created, the instance will first attempt to import and download the required backend/language data. To avoid this, and to prepare the required packages in advance, we can use the static SmoothText.prepare() method.

SmoothText.prepare('nltk', 'en,tr')  # Preparing NLTK to be used with English and Turkish

Computing Readability Scores

Each language has its own set of readability formulas. When computing the readability score of a text in a language, one of the supporting formulas must be used. Using SmoothText, there are three ways to perform this calculation.

text: str = 'Forrest Gump is a 1994 American comedy-drama film directed by Robert Zemeckis.'  # https://en.wikipedia.org/wiki/Forrest_Gump

# Generic computation method
st.compute_readability(text, ReadabilityFormula.Flesch_Reading_Ease)

# Using instance as a callable for generic computation
st(text, ReadabilityFormula.Flesch_Reading_Ease)

# Specific formula method
st.flesch_reading_ease(text)

Tokenizing and Calculating Text Statistics

SmoothText is designed to work with sentences, words/tokens, and syllables.

Other Features

Refer to the documentation for a complete list of available methods.

Inconsistencies

NLTK and Stanza have different tokenization rules. This may cause differences in the number of tokens/sentences between the two backends.

The syllabification of words may differ within the same language variant. For example, the word "hello" has two syllables in American English but one in British English. See the code snippet below.
- To avoid this as much as possible, CMUdict is used for English as the default syllabification method. However, it may not be available in some cases. In such cases, Pyphen will be used as a fallback.

from pyphen import Pyphen

us = Pyphen(lang="en_US")
print(us.inserted("hello"))
# Output: 'hel-lo'

gb = Pyphen(lang="en_GB")
print(gb.inserted("hello"))
# Output: 'hello'

Documentation

See here for API documentation.

License

SmoothText has an MIT license. See LICENSE.

Keywords

FAQs

What is smoothtext?

Is smoothtext well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

smoothtext

SmoothText

Introduction

Requirements

Python Version

External Dependencies

Features

Readability Analysis

English

German

Russian

Turkish

Sentencizing, Tokenization, and Syllabification

Reading Time

Installation

Usage

Importing and Initializing the Library

Instancing

Readying the Backends

Computing Readability Scores

Tokenizing and Calculating Text Statistics

Other Features

Inconsistencies

Backend Related Inconsistencies

Language Related Inconsistencies

Documentation

License

Keywords

Related posts

pnpm 10.12 Introduces Global Virtual Store and Expanded Version Catalogs

Node.js Moves Toward Stable TypeScript Support with Amaro 1.0