[![PyPI version](https://img.shields.io/pypi/v/TagStats.svg)](https://pypi.python.org/pypi/TagStats/)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/TagStats.svg)](https://pypi.python.org/pypi/TagStats/)
[![PyPI license](https://img.shields.io/pypi/l/TagStats.svg)](https://pypi.python.org/pypi/TagStats/)

A concise yet efficient implementation for computing the statistics of each tag's set of key phrases over input lines in Python 3.
One of the major applications is for [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis), where each tag is a sentiment with the respective key phrases describing the sentiment.

# How it Works

A [trie](https://en.wikipedia.org/wiki/Trie) structure is constructed to index all the key phrases. Then each line is matched towards the index to compute the respective statistics.
The time complexity is $O(m^2 \cdot n)$, where $m$ is the maximum number of words in each line and $n$ is the number of lines.

# Installation

This package is available on PyPI. Just use `pip3 install -U TagStats` to install it.

# Usage

You can simply call `compute(content, tagDict)`, where `content` is a list of lines and `tagDict` is a dictionary with each tag name as key and the respective set of key phrases as value.

``` python
from tagstats import compute

print(compute(
    [
        "a b c",
        "b c",
        "a c e"
    ],
    {
        "+": ["a b", "a c"],
        "-": ["b c"]
    }
))
```

The output is a dictionary with each tag name as key and the respective totaled frequencies for lines as value.

``` python
{'+': [1, 0, 1], '-': [1, 1, 0]}
```

You can change the default tokenizer, by specifying a compiled regex as separator to `tagstats.tagstats.tokenizer`. You can disable the tokenizer to allow matching over word boundaries, by specifying `None`.

You can pre-build the index by calling `index(tagDict)`, and later reuse it more than once as an optional parameter of `compute(content, tagDict, index)`. 

# Tip

I strongly encourage using PyPy instead of CPython to run the script for best performance.

Statistics for each tag's set of key phrases. 

Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team

TagStats

Statistics for each tag's set of key phrases

chuancong

(Experimental) An ambiguous license classifier was found.

What is an ambiguous license classifier?

Ambiguous License Classifier

Source files are encoded using a non-standard text encoding.

What is bad text encoding?

Bad text encoding

Package version is not a valid semantic version (semver).

What is bad semver?

Bad semver

Package has dependencies with an invalid semantic version. This could be a sign of beta, low quality, or unmaintained dependencies.

What is bad dependency semver?

Bad dependency semver

Source files contain bidirectional unicode control characters. This could indicate a Trojan source supply chain attack. See: trojansource.codes for more information.

What are bidirectional unicode control characters?

Bidirectional unicode control characters

This package has multiple bin scripts with the same name. This can cause non-deterministic behavior when installing or could be a sign of a supply chain attack.

What is bin script confusion?

Bin script confusion

Semantic versions published out of chronological order.

What is a chronological version anomaly?

Chronological version anomaly

Project maintainer's SSH key has been compromised.

What is a compromised SSH key?

Compromised SSH key

(Experimental) Copyleft license information was found.

What do I need to know about license files?

Copyleft License

Contains a Critical Common Vulnerability and Exposure (CVE).

What is a critical CVE?

Title

Critical CVE

Contains a high severity Common Vulnerability and Exposure (CVE).

What is a CVE?

High CVE

Uses debug, reflection and dynamic code execution features.

What is debug access?

Debug access

The maintainer of the package marked it as deprecated. This could indicate that a single version should not be used, or that the package is no longer maintained and any new vulnerabilities will not be fixed.

What is a deprecated package?

Deprecated

(Experimental) Contains a known deprecated SPDX license exception.

What is a deprecated SPDX exception?

Deprecated SPDX exception

(Experimental) License is deprecated which may have legal implications regarding the package's use.

What is a deprecated license?

Deprecated license

Package name is similar to other popular packages and may not be the package you want.

What is a typosquat?

Possible typosquat attack

Dynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.

What is dynamic require?

Dynamic require

Package does not contain any code. It may be removed, is name squatting, or the result of a faulty package publish.

What is an empty package?

Empty package

Package accesses environment variables, which may be a sign of credential stuffing or data theft.

What is environment variable access?

Environment variable access

(Experimental) Something was found which is explicitly marked as unlicensed.

Explicitly Unlicensed Item

Package optionally loads a dependency which is not specified within any of the package.json dependency fields. It may inadvertently be importing dependencies specified by other packages.

What are extraneous dependencies?

Name

Extraneous dependency

Contains a dependency which resolves to a file. This can obfuscate analysis and serves no useful purpose.

What are file dependencies?

File dependency

Accesses the file system, and could potentially read sensitive data.

What is filesystem access?

Filesystem access

Package has a dependency with a floating version range. This can cause issues if the dependency publishes a new major version.

What are wildcard dependencies?

Wildcard dependency

Contains a dependency which resolves to a remote git URL. Dependencies fetched from git URLs are not immutable and can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are git dependencies?

Git dependency

Contains a dependency which resolves to a GitHub URL. Dependencies fetched from GitHub specifiers are not immutable can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are GitHub dependencies?

GitHub dependency

AI has identified unusual behaviors that may pose a security risk.

What is an AI-detected potential code anomaly?

AI-detected potential code anomaly

AI has identified this package as a potential typosquat of a more popular package. This suggests that the package may be intentionally mimicking another package's name, description, or other metadata.

What is AI-detected potential typosquatting?

AI-detected possible typosquat

AI has identified this package as malware. This is a strong signal that the package may be malicious.

What is AI-detected potential malware?

AI-detected potential malware

AI has determined that this package may contain potential security issues or vulnerabilities.

What are AI-detected potential security risks?

AI-detected potential security risk

Contains native code (e.g., compiled binaries or shared libraries). Including native code can obscure malicious behavior.

Why is native code a concern?

Native code

Contains high entropy strings. This could be a sign of encrypted data, leaked secrets or obfuscated code.

What are high entropy strings?

High entropy strings

Contains unicode homoglyphs which can be used in supply chain confusion attacks.

What are unicode homoglyphs?

Unicode homoglyphs

Contains a dependency which resolves to a remote HTTP URL which could be used to inject untrusted code and reduce overall package reliability.

What are http dependencies?

HTTP dependency

Install scripts are run when the package is installed. The majority of malware in npm is hidden in install scripts.

What is an install script?

Install scripts

Package has an invalid manifest file and can cause installation problems if you try to use it.

What is an invalid manifest file?

Invalid manifest file

Source files contain invisible characters. This could indicate source obfuscation or a supply chain attack.

What are invisible characters?

Invisible chars

(Experimental) Package license has recently changed.

What is a license change?

License change

(Experimental) Contains an SPDX license exception.

What is a license exception?

License exception

This package is not allowed per your license policy. Review the package's license to ensure compliance.

What is a license policy violation?

License Policy Violation

Contains long string literals, which may be a sign of obfuscated or packed code.

What's wrong with long strings?

Long strings

Package has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.

What is a major refactor?

Major refactor

This package is malware. We have asked the package registry to remove it.

What is known malware?

Known malware

This package has inconsistent metadata. This could be malicious or caused by an error when publishing the package.

What is manifest confusion?

Manifest confusion

Contains a medium severity Common Vulnerability and Exposure (CVE).

What is a medium CVE?

Medium CVE

Contains a low severity Common Vulnerability and Exposure (CVE).

What is a mild CVE?

Low CVE

This package contains minified code. This may be harmless in some cases where minified code is included in packaged libraries, however packages on npm should not minify code.

What's wrong with minified code?

Minified code

(Experimental) A package's licensing information has fine-grained problems.

Misc. License Issues

The package was published by an npm account that no longer exists.

What is a non-existent author?

Non-existent author

A required dependency is not declared in package.json and may prevent the package from working.

What is a missing dependency?

Missing dependency

(Experimental) Package does not have a license and consumption legal status is unknown.

What is a missing license?

Missing license

This package is missing its tarball. It could be removed from the npm registry or there may have been an error when publishing.

What is a missing tarball?

Missing package tarball

(Experimental) Package contains multiple licenses.

What is a mixed license?

Mixed license

(Experimental) Package contains a modified version of an SPDX license exception. Please read carefully before using this code.

What is a modified license exception?

Modified license exception

(Experimental) Package contains a modified version of an SPDX license. Please read carefully before using this code.

What is a modified license?

Modified license

What is network access?

Network access

A new npm collaborator published a version of the package for the first time. New collaborators are usually benign additions to a project, but do indicate a change to the security surface area of a package.

What is new author?

New author

Package does not specify a list of contributors or an author in package.json.

Why is contributor and author data important?

No contributors or author data

Package does not have a linked bug tracker in package.json.

Why are bug trackers important?

No bug tracker

(Experimental) License information could not be found.

No License Found

Package does not have a README. This may indicate a failed publish or a low quality package.

Why are READMEs important?

No README

Package does not have a linked source code repository. Without this field, a package will have no reference to the location of the source code use to generate the package.

Why are missing repositories important?

No repository

Package does not have any tests. This is a strong signal of a poorly maintained or low quality package.

What does no tests mean?

No tests

Package is not semver >=1. This means it is not stable and does not support ^ ranges.

TagStats

How it Works

Installation

Usage

Tip

Keywords

Related posts

TagStats

How it Works

Installation

Usage

Tip

Keywords

Related posts

8 More Malicious Firefox Extensions: Exploiting Popular Game Recognition, Hijacking User Sessions, and Stealing OAuth Credentials

Official Go SDK for MCP in Development, Stable Release Expected in August