Socket
Socket
Sign inDemoInstall

textdescriptives

Package Overview
Dependencies
Maintainers
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

textdescriptives

A library for calculating a variety of features from text using spaCy


Maintainers
2

TextDescriptives

spacy github actions pytest github actions docs status Open in Streamlit

A Python library for calculating a large variety of metrics from text(s) using spaCy v.3 pipeline components and extensions.

🔧 Installation

pip install textdescriptives

📰 News

  • We now have a TextDescriptives-powered web-app so you can extract and downloads metrics without a single line of code! Check it out here
  • Version 2.0 out with a new API, a new component, updated documentation, and tutorials! Components are now called by "textdescriptives/{metric_name}. New coherence component for calculating the semantic coherence between sentences. See the documentation for tutorials and more information!

⚡ Quick Start

Use extract_metrics to quickly extract your desired metrics. To see available methods you can simply run:

import textdescriptives as td
td.get_valid_metrics()
# {'quality', 'readability', 'all', 'descriptive_stats', 'dependency_distance', 'pos_proportions', 'information_theory', 'coherence'}

Set the spacy_model parameter to specify which spaCy model to use, otherwise, TextDescriptives will auto-download an appropriate one based on lang. If lang is set, spacy_model is not necessary and vice versa.

Specify which metrics to extract in the metrics argument. None extracts all metrics.

import textdescriptives as td

text = "The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it."
# will automatically download the relevant model (´en_core_web_lg´) and extract all metrics
df = td.extract_metrics(text=text, lang="en", metrics=None)

# specify spaCy model and which metrics to extract
df = td.extract_metrics(text=text, spacy_model="en_core_web_lg", metrics=["readability", "coherence"])

Usage with spaCy

To integrate with other spaCy pipelines, import the library and add the component(s) to your pipeline using the standard spaCy syntax. Available components are descriptive_stats, readability, dependency_distance, pos_proportions, coherence, and quality prefixed with textdescriptives/.

If you want to add all components you can use the shorthand textdescriptives/all.

import spacy
import textdescriptives as td
# load your favourite spacy model (remember to install it first using e.g. `python -m spacy download en_core_web_sm`)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/all") 
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length

TextDescriptives includes convenience functions for extracting metrics from a Doc to a Pandas DataFrame or a dictionary.

td.extract_dict(doc)
td.extract_df(doc)
textfirst_order_coherencesecond_order_coherencepos_prop_DETpos_prop_NOUNpos_prop_AUXpos_prop_VERBpos_prop_PUNCTpos_prop_PRONpos_prop_ADPpos_prop_ADVpos_prop_SCONJflesch_reading_easeflesch_kincaid_gradesmoggunning_fogautomated_readability_indexcoleman_liau_indexlixrixn_stop_wordsalpha_ratiomean_word_lengthdoc_lengthproportion_ellipsisproportion_bullet_pointsduplicate_line_chr_fractionduplicate_paragraph_chr_fractionduplicate_5-gram_chr_fractionduplicate_6-gram_chr_fractionduplicate_7-gram_chr_fractionduplicate_8-gram_chr_fractionduplicate_9-gram_chr_fractionduplicate_10-gram_chr_fractiontop_2-gram_chr_fractiontop_3-gram_chr_fractiontop_4-gram_chr_fractionsymbol_#_to_word_ratiocontains_lorem ipsumpassed_quality_checkdependency_distance_meandependency_distance_stdprop_adjacent_dependency_relation_meanprop_adjacent_dependency_relation_stdtoken_length_meantoken_length_mediantoken_length_stdsentence_length_meansentence_length_mediansentence_length_stdsyllables_per_token_meansyllables_per_token_mediansyllables_per_token_stdn_tokensn_unique_tokensproportion_unique_tokensn_charactersn_sentences
0The world is changed(...)0.6330020.5733230.0975610.1219510.07317070.1707320.1463410.1951220.07317070.07317070.0487805107.879-0.04857145.683923.94286-2.45429-0.70857112.71430.4240.8536592.951224100000.2322580.23225800000.05806450.17419400FalseFalse1.775240.5531880.4571430.07228063.2857131.54127763.098391.0857110.36811735230.6571431215

📖 Documentation

TextDescriptives has a detailed documentation as well as a series of Jupyter notebook tutorials. All the tutorials are located in the docs/tutorials folder and can also be found on the documentation website.

Documentation
📚 Getting startedGuides and instructions on how to use TextDescriptives and its features.
👩‍💻 DemoA live demo of TextDescriptives.
😎 TutorialsDetailed tutorials on how to make the most of TextDescriptives
📰 News and changelogNew additions, changes and version history.
🎛 API ReferencesThe detailed reference for TextDescriptive's API. Including function documentation
📄 PaperThe preprint of the TextDescriptives paper.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc