New Research: Supply Chain Attack on Axios Pulls Malicious Dependency from npm.Details →
Socket
Book a DemoSign in
Socket

text-phash

Package Overview
Dependencies
Maintainers
1
Versions
8
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

text-phash

Compute and compare perceptual hashes for text strings to check similarity.

latest
Source
npmnpm
Version
1.0.8
Version published
Maintainers
1
Created
Source

TextPHash

Perceptual Hash for text strings.

  • Source repository: Github: mlefkon/text-phash
  • NPM Package: NPM: text-phash

What it does

  • Computes a perceptual hash for a text string.
  • Compares perceptual hashes to give a percent similarity between two text strings.

Usage

const TextPHash = require('text-phash')
// OR
import TextPHash from 'text-phash'

let hashA = TextPHash.computePHash("The quick brown fox jumped over the black fence.")
let hashB = TextPHash.computePHash("Over the black fence, the quick brown fox jumped.")
let pctMatch = TextPHash.percentMatch(hashA, hashB)
console.log(hashA) // 00500000000000000000000500000000000F0050005000000000000000500000
console.log(hashB) // 00500005000000000000000500000000000F0000005000000000000000500000
console.log(pctMatch);  // 77.77777777777779

Methodology

  • Supply text (can be one word or a lengthy book)
  • Tokenize text into neighboring word-groups. Number of words in each group is set in options:NGRAM_WORDS.
  • Initialize a [hashHits] array with zeros, one 'counter' for each possible hash value. Number of hash values is set in options:WORD_HASH_BITS.
  • Hash each word-group.
  • For each hash encountered, increment it's 'counter' in the [hashHits] array
  • Normalize all [hashHits] counters between 0, for no hits, and a set maximum (set in options:HIT_VALUE_BITS) hits.
  • Convert [hashHits] array into a hexadecimal string.
  • Compare two hashes by converting hex back into [hashHits] array and comparing the difference in hits.

Functions

For optional options parameter {object}, supply one or more properties from the 'Default Options' object below.

computePHash()

TextPHash.computePHash(text)
TextPHash.computePHash(text, options)
  • Returns a hexadecimal number representing a binary string (2 ^ WORD_HASH_BITS x 2 ^ HIT_VALUE_BITS) bits long. Using the default options, this will be a 64 digit hexadecimal string.

percentMatch()

TextPHash.percentMatch(pHashA, pHashB)
TextPHash.percentMatch(pHashA, pHashB, options)
  • If options are supplied, they must be the same as those used to create the hashes.
  • Returns a number between zero and 100.

Default Options

Available on the static class object TextPHash.DefaultOptions:

  • NGRAM_WORDS: default = 2

    Number of 'neighbor' words that will be hashed together.

    For example, a value of 1: ABCDE=>[A,B,C,D,E], 2: ABCDE => [AB, BC, CD, DE], 3: ABCDE => [ABC,BCD,CDE]

  • WORD_HASH_FUNCTION: default = TextPHash.WordHashDJB

    A function that does a non-unique hash on each word-group/ngram.

    Select any TextPHash.WordHash... function in TextPHash class (DJB, FNV1a, Murmur3). Or provide your own with signature: (strText, intHashBitSize) => intHash

  • WORD_HASH_BITS: default = 6

    The binary size of hash produced by WORD_HASH_FUNCTION.

    Hashes are not meant to be unique, so this can be a low number. The hashes build a histogram of melded word frequencies. This is the 'x value' in the word-group-hash histogram. So if this is '6', there will be 2^6 possible hashes, or 64 'x values'.

  • HIT_VALUE_BITS: default = 4

    Binary size of hit counter for a single hash. Actual hits are adjusted down to these discrete values.

    So if this is '4' and hash counters range from 0 to a max of 140 hits, the 140 value will be adjusted to (2^4)-1, or a max value of 15. A hash counter with lower value, say 70 hits, would get an adjusted value of 8. This is the 'y value' in the word-group-hash histogram.

Keywords

phash

FAQs

Package last updated on 04 Apr 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts