🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
DemoInstallSign in
Socket

smoothtext

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

smoothtext

A Python library for readability and textual metrics analysis, supporting multiple languages.

0.4.0
Source
PyPI
Maintainers
1

SmoothText

Tests license versions pypi downloads

Introduction

SmoothText is a Python library for calculating readability scores of texts and statistical information for texts in multiple languages.

The design principle of this library is to ensure high accuracy.

Requirements

Python Version

Python 3.10 or higher.

External Dependencies

LibraryVersionLicenseNotes
NLTK>=3.9.1Apache 2.0Conditionally optional.
Stanza>=1.10.1Apache 2.0Conditionally optional.
CMUdict>=1.0.32GPLv3+Required if Stanza is the selected backend.
Unidecode>=1.3.8GNU GPLv2Required.
Pyphen>=0.17.0GPL 2.0+/LGPL 2.1+/MPL 1.1Required.
emoji>=2.14.1BSDRequired.

Either NLTK or Stanza must be installed and used with the SmoothText library.

Features

Readability Analysis

SmoothText can calculate readability scores of text in the following languages, using the following formulas.

MethodDescription
compute_readabilityComputes the readability score of a text using a specified formula.

English

MethodFormulaAuthorsNotes
automated_readability_indexAutomated Readability IndexSmith & Senter, 1967-
flesch_reading_easeFlesch Reading EaseFlesch, 1948-
flesch_kincaid_gradeFlesch-Kincaid GradeKincaid et al., 1975-
flesch_kincaid_grade_simplifiedFlesch-Kincaid Grade SimplifiedKincaid et al., 1975Essentially, the same as Flesch-Kincaid Grade. However, the output will be rounded due to the constant rounding.
gunning_fog_indexGunning Fog IndexGunning, 1952-

Notes:

  • Although SmoothText supports both US English and GB English, formulas work best with US English.

German

MethodFormulaAuthorsNotes
amstadFlesch Reading EaseAmstad, 1978German adaptation of Flesch Reading Ease.
wiener_sachtextformelWiener SachtextformelBamberger & Vanecek, 1984German adaptation of Flesch-Kincaid Grade. All versions (1 through 4) are supported.

Russian

MethodFormulaAuthorsNotes
matskovskiyMatskovskiyMatskovskiy, 1976German adaptation of Flesch Reading Ease.

Turkish

MethodFormulaAuthorsNotes
atesmanAteşmanAteşman, 1997Turkish adaptation of Flesch Reading Ease.
bezirci_yilmazBezirci-YılmazBezirci & Yılmaz, 2010Turkish adaptation of Flesch-Kincaid Grade.

Sentencizing, Tokenization, and Syllabification

SmoothText can extract sentences, words, or syllables from texts.

MethodDescription
Sentence Level
sentencizeSplits text into sentences using language-aware rules
count_sentencesReturns the number of sentences found in the text
Word Level
tokenizeExtracts word tokens from text; can group by sentences with the split_sentences flag
count_wordsCounts the number of alphanumeric words in a text
word_frequenciesReturns a dictionary of word frequencies with optional lemmatization
Syllable Level
syllabifySplits words into syllables; can be applied to words, tokens, or sentences
count_syllablesCounts syllables in words or text using language-specific rules
syllable_frequenciesReturns a dictionary mapping syllable counts to frequency in the analyzed text
Character Level
count_consonantsCounts the number of consonant characters in text
count_vowelsCounts the number of vowel characters in text
Emoji Handling
demojizeConverts emoji characters to their text descriptions with custom delimiters
remove_emojisRemoves all emoji characters from text

Notes

  • count_syllables is likely to produce more accurate results in comparison to the syllabify method.
  • At the moment, lemmatization is only supported for English with the Stanza as the backend. Other languages and backends will ignore the lemmatization flag.
LanguageSentencizingTokenizationSyllabification
English
(NLTK, Stanza)

(NLTK, Stanza)

(CMU Dictionary, Pyphen)
German
(NLTK, Stanza)

(NLTK, Stanza)

(Pyphen)
Russian
(NLTK, Stanza)

(NLTK, Stanza)

(Pyphen)
Turkish
(NLTK, Stanza)

(NLTK, Stanza)

(Custom formula)

Pyphen may not produce accurate results sometimes. Thus, whenever possible, custom syllabification formulas or dictionaries are preferred.

Reading Time

SmoothText can calculate how long would a text take to read. The reading time is calculated based on the average reading speed of an adult.

MethodDescription
reading_aloud_timeCalculates the reading time of a text.
reading_timeCalculates the reading time of a text.
silent_reading_timeCalculates the silent reading time.

Installation

You can install SmoothText via pip.

pip install smoothtext

Usage

Importing and Initializing the Library

SmoothText comes with four submodules: Backend, Language, ReadabilityFormula and SmoothText.

from smoothtext import Backend, Language, ReadabilityFormula, SmoothText

Instancing

SmoothText was not designed to be used with static methods. Thus, an instance must be created to access its methods.

When creating an instance, the language and the backend to be used with it can be specified.

The following will create a new SmoothText instance configured to be used with the English language (by default, the United States variant) using NLTK as the backend.

st = SmoothText('en', 'nltk')

Once an instance is created, its backend cannot be changed, but its working language can be changed at any time.

st.language = 'tr'  # Now configured to work with Turkish.
st.language = 'en-gb'  # Switching back to English, but to the United Kingdom variant.

Readying the Backends

When an instance is created, the instance will first attempt to import and download the required backend/language data. To avoid this, and to prepare the required packages in advance, we can use the static SmoothText.prepare() method.

SmoothText.prepare('nltk', 'en,tr')  # Preparing NLTK to be used with English and Turkish

Computing Readability Scores

Each language has its own set of readability formulas. When computing the readability score of a text in a language, one of the supporting formulas must be used. Using SmoothText, there are three ways to perform this calculation.

text: str = 'Forrest Gump is a 1994 American comedy-drama film directed by Robert Zemeckis.'  # https://en.wikipedia.org/wiki/Forrest_Gump

# Generic computation method
st.compute_readability(text, ReadabilityFormula.Flesch_Reading_Ease)

# Using instance as a callable for generic computation
st(text, ReadabilityFormula.Flesch_Reading_Ease)

# Specific formula method
st.flesch_reading_ease(text)

Tokenizing and Calculating Text Statistics

SmoothText is designed to work with sentences, words/tokens, and syllables.

Other Features

Refer to the documentation for a complete list of available methods.

Inconsistencies

  • NLTK and Stanza have different tokenization rules. This may cause differences in the number of tokens/sentences between the two backends.
  • The syllabification of words may differ within the same language variant. For example, the word "hello" has two syllables in American English but one in British English. See the code snippet below.
    • To avoid this as much as possible, CMUdict is used for English as the default syllabification method. However, it may not be available in some cases. In such cases, Pyphen will be used as a fallback.
from pyphen import Pyphen

us = Pyphen(lang="en_US")
print(us.inserted("hello"))
# Output: 'hel-lo'

gb = Pyphen(lang="en_GB")
print(gb.inserted("hello"))
# Output: 'hello'

Documentation

See here for API documentation.

License

SmoothText has an MIT license. See LICENSE.

Keywords

readability

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts