🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more
Socket
DemoInstallSign in
Socket

vembed

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

vembed

Package providing methods to create Vector Embeddings from Strings, calculate similarities between lists of Strings, and Generate Visualizations such as Heatmaps from simple Lists.

0.242
PyPI
Maintainers
1

vembed



Library for Generating Vector Embeddings, performing Similarity searches, and creating Visualizations from Data.




pip3 install vembed


Strings to Embeddings


  • Convert a String to a Vector Embedding.
from vembed import string_to_embedding

input_string = "This is a test sentence."

embedding = string_to_embedding(input_string)

#  [0.337, 0.143, 0.714 ...]

  • Use Batching to Convert Lists of Strings to their Vector Float Representations.
from vembed import lists_to_embeddings

embeddings = lists_to_embeddings(["Convert to a List[Float]", "Another String", "More Strings!"])

# print(embeddings) [[0.123, 0.456, ...], [0.789, 0.012, ...]]

Serialization


Functions for Embedding Serialization for Network Transfer.


  • Protobuf Serialization for usage with gRPC Services

  • JSON Serialization for usage with REST API's


from vembed import lists_to_embeddings, embeddings_to_proto_format, embeddings_to_json_format

embeddings = lists_to_embeddings(["CSV,Row,1,with,some,data" , "CSV,Row,2,with,other,cols"])

# Convert to a Protobuf Serializable Format to send over a gRPC Service
proto_embedding = embeddings_to_proto_format(embeddings)

# Convert to a JSON String for usage with REST API's
json_embedding = embeddings_to_json_format(embeddings)

Similarity


Semantic Similarity Between Entities


Extract Insights such as Patterns or Relevancy from your Data.


  • Calculating Similarity for Entities.

from vembed import calculate_similarities, plot_similarities

customer_feedback = ["Loved the recent update","The app is user-friendly",
                    "Facing issues after the update","The new interface is great"]

themes            = ["positive feedback","negative feedback","app interface","app functionality"]

cos_df, dot_df = calculate_similarities(customer_feedback, themes, print_results=True)

# Prints and Returns Results

# Results

Cosine Similarities:

Query: 'Loved the recent update'
  Data: 'positive feedback' => Similarity Score: 0.45
  Data: 'app functionality' => Similarity Score: 0.22
  Data: 'app interface'     => Similarity Score: 0.19
  Data: 'negative feedback' => Similarity Score: 0.11

Query: 'Facing issues after the update'
  Data: 'negative feedback' => Similarity Score: 0.31
  Data: 'positive feedback' => Similarity Score: 0.27
  Data: 'app interface'     => Similarity Score: 0.24
  Data: 'app functionality' => Similarity Score: 0.21

Dot Product Similarities:

Query: 'Loved the recent update'
  Data: 'positive feedback' => Similarity Score: 4.51
  Data: 'app functionality' => Similarity Score: 2.06
  Data: 'negative feedback' => Similarity Score: 1.91
  Data: 'app interface'     => Similarity Score: 1.80

Query: 'Facing issues after the update'
  Data: 'negative feedback' => Similarity Score: 2.92
  Data: 'positive feedback' => Similarity Score: 2.51
  Data: 'app interface'     => Similarity Score: 2.07
  Data: 'app functionality' => Similarity Score: 1.82

  • Generating Clean, Beautiful Visualizations from Data.

from vembed import plot_similarities

# .... cos_df, dot_df = calculate_similarities(queries, data)

# Create HeapMap for Visualizing Relationships

plot_similarities(cos_df, dot_df, save_path="heatmaps/customer_feedback_similarity.png")

# View and access the Heatmap at /heatmaps/customer_feedback_similarity.png

Cosine and Dot Product Vector Similarity Measures

@Coefficient Legend

Negative [ - ] - Low Similarity
Zero     [ 0 ] - Orthogonal , No Commonality
Positive [ + ] - Strong Similarity 

Cosine Similarity


  • Ranges between -1 and 1

  • Recommended when the Context and Similarity is important - and Frequency is not important (Magnitude)


  • Use Case for Cosine Similarity

    • In the following example, Direction (thematic orienation) - climate change, agriculture is relevant.

    • Cosine Similarity is useful here as we want to find the relevancy of documents discussing similar topics (direction) - irrespective of the length of frequency of specific words (Magnitude)


@Usage

queries = ["Climate change effects on agriculture"]

data =    [
           "Effects of climate change on wheat production",
           "Agriculture in developing countries",
           "Climate change and its impact on global food security",
           "Advances in agricultural technology"
          ]

# Calculate cosine similarities
cos_df, _ = calculate_similarities(queries, data, sorted=True, print_results=True)

Dot Product Similarity


  • Ranges between any Real Number

  • When both the Magnitude and Direction of the vectors are important, and we are dealing with vectors in a similar scale.

  • When the Frequency (Magnitude) as well as the Direction (Relevancy) is both important.


  • Use Case for Dot Product

    • Direction (Types of Articles) and Magnitude (Frequency of Reading Habits) - both have relevancy.

@Usage

user_reading_profile = [
                        "Read many articles on machine learning", 
                        "Occasionally reads about space exploration" 
                       ]

article_options      = [
                        "Latest trends in machine learning",
                        "Beginner's guide to space travel",
                        "In-depth analysis of neural networks",
                        "Recent discoveries in astronomy"
                       ]

# Calculate dot product similarities

_, dot_df = calculate_similarities(user_reading_profile, article_options, sorted=True, print_results=True)

  • Calculating Similarity
@Usage

queries =  [ 
            "What is the capital of France?", 
            "How is the weather today?"
           ]

data    =  [
             "Paris is the capital of France.",
             "The weather is sunny.",
             "Berlin is the capital of Germany.",
             "It is raining in Berlin."
           ]

# Calculate similarities and Print Results

cos_df, dot_df = calculate_similarities(queries, data, sorted=True, print_results=True)


# Cosine Similarities:
# ...

# Dot Product Similarities:
# ...

Visualizations


Generate Visualizations from Embeddings such as HeatMap Distributions


  • Create a Visualization to display the Entity Similarities using a Heatmap.

@Usage

customer_feedback = [
                      "Loved the recent update",
                      "The app is user-friendly",
                      "Facing issues after the update",
                      "The new interface is great"
                    ]

themes            = [
                      "positive feedback",
                      "negative feedback",
                      "app interface",
                      "app functionality"
                    ]

# Heatmap of Both Cosine and Dot Product
cos_df, dot_df = calculate_similarities(customer_feedback, themes, sorted=True)
plot_similarities(cos_df, dot_df, save_path="customer_feedback_similarity.png")


# Heatmap of Only Cosine Similarity
cos_df, _ = calculate_similarities(customer_feedback, themes, sorted=True)
plot_similarities(cos_df, None, save_path="customer_feedback_similarity.png")


# Heatmap of Only Dot Product Similarity
_, dot_df = calculate_similarities(customer_feedback, themes, sorted=True)
plot_similarities(None, dot_df, save_path="customer_feedback_similarity.png")


# View customer_feedback_similarity.png to see the Heatmap

Test Suite


Latest Test Run


  • gRPC Tests
  • JSON Tests
  • Numpy Tests
  • Serialization Tests
  • Embedding Generation Tests
  • Batched Embedding Generation Tests
  • Custom Model Tests
  • Caching Tests
  • Transformer Metadata Tests
  • Module Resolution Tests


Build and Run Locally from Source


git clone git@github.com:kuro337/vembed.git

# Create Isolated Virtual Env
python3 -m venv venv
source venv/bin/activate

# Install Deps
pip install -e .

# Run Tests
chmod +x RUN_TESTS.sh
./RUN_TESTS.sh

# Create Dist 
pip3 install build && python3 -m build

# Use Built Dist in any project
pip3 install ./vembed/dist/vembed-0.24-py3-none-any.whl

Dependencies

  • sentence_transformers
  • torch
  • transformers
  • pandas
  • matplotlib
  • seaborn


Note: vembed GPU Usage can be enabled from using Nvidia Cuda and Torch if a supported Nvidia Graphics Card is Available.


# Nvidia CUDA and PyTorch Test
python3 tests/test_cuda.py

# Run all Tests
python3 -m unittest discover -s tests -v

Checking Virtual or System Environment Deps and Cache Size


# Check Disk Allocation for Packages 
du -h venv | sort -hr | head -n 10

2.8G    venv/lib/python3.11/site-packages/nvidia
1.4G    venv/lib/python3.11/site-packages/torch
1.3G    venv/lib/python3.11/site-packages/torch/lib
1.2G    venv/lib/python3.11/site-packages/nvidia/cudnn/lib
1.2G    venv/lib/python3.11/site-packages/nvidia/cudnn
596M    venv/lib/python3.11/site-packages/nvidia/cublas

# Checking System Cache

# Show pip cache location
pip cache dir # /home/user/.cache/pip

# Getting Top Folders from Cache by Size
du -h /home/user/.cache/pip | sort -hr | head -n 10

# Remove Cached Files
pip cache purge 

# Cached Files
pip cache list

# Installing Packages without Cache
pip install --no-cache-dir <package_name>

Author: kuro337

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts