
Security News
NIST Under Federal Audit for NVD Processing Backlog and Delays
As vulnerability data bottlenecks grow, the federal government is formally investigating NIST’s handling of the National Vulnerability Database.
A state of the art NLP model to generate insights from a text piece
requirements.txt
from insigen import insigen
model = insigen()
topic_distribution = model.get_distribution(document)
insigen
:use_pretrained_embeds
: Setting this parameter to False will allow you to train your own embeddings. Further parameters need to be specified for training
embed_file
: This parameter should be used when you've trained your own embeddings. Specify the path to your sentence embeddings.
dataset_file
: This parameter should be used when you've trained your own embeddings. Specify the path to your own dataset.
embedding_model
: (Default = all-mpnet-base-v2) Insigen uses sentence bert models to train it's embeddings. Valid models are:
all-distilroberta-v1
all-mpnet-base-v2
all-MiniLM-L12-v2
all-MiniLM-L6-v2
Important parameters for get_distribution
document
: The text for which the topic distribution is to be generatedmetric
: This metric defines how the topics will be found. Can be set to 'threshold', to get all the topics above a similarity threshold. Defaults to 'max'. 'Max' metric gets the top "n" topicsmax_count
: This argument should be used with max metric. It specifies the top x amount of topics that get fetched. Defaults to 1.threshold
: This argument should be used with threshold metric. It specifies the threshold similarity over which all topics will be fetched. Defaults to 0.5.frequency = model.get_keyword_frequency(document, min_len=2, max_len=3)
#Generate a wordcloud using the frequency
cloud = model.generate_wordcloud(frequency)
get_keyword_frequency
document
: The text for which the keyword frequency is to be generatedfrequency_threshold
: minimum frequency of a n-gram to be considered in the keywords (min_len
and max_len
are also used to adjust the length of n-grams in the text)summary = model.generate_summary(article, topic_match=relevant_topic))
# To get a list of topics, use this
#print(model.unique_topics)
generate_summary
document
:The text for which the summary is to be generatedtopic_match
: a topic that can match with the text. This adds additional weight to sentences that are more related to the topic. use model.unique_topics
to get a list of topics that can match. Defaults to None, in which case weightage to related sentence will not be given.topic_weight
: Adds weightage to the topic similarity score. Increasing this parameter results to more topic oriented summary. Defaults to 1.similarity_weight
: Adds weightage to sentence similarity score. Increasing this parameter results in extracting more co-related sentences. Defaults to 1.position_weight
: Adds weightage to the position of the sentences. Increasing this parameter results to more position oriented summary; i.e Texts present early in the document are given more weightatge. Defaults to 10.num_sentences
: This specifies the number of sentences that are to be included in the summary. Defaults to 10.embeddings = model.train_embeds(dataset)
train_embeds
dataset
: A pandas dataframe for the dataset to be trainedbatch_size
: Batches to divide the dataset into. Defaults to 32.Create embedded vectors of labelled training articles
Find mean embeddings of each topic in the corpus to create topic vectors and create clusters of articles
Use KNN to place new articles in the topic vector cluster
Chunking each article and finding relevant topic from the topic vectors
FAQs
Generates Insights from text pieces such as Documents or Articles
We found that insigen demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
As vulnerability data bottlenecks grow, the federal government is formally investigating NIST’s handling of the National Vulnerability Database.
Research
Security News
Socket’s Threat Research Team has uncovered 60 npm packages using post-install scripts to silently exfiltrate hostnames, IP addresses, DNS servers, and user directories to a Discord-controlled endpoint.
Security News
TypeScript Native Previews offers a 10x faster Go-based compiler, now available on npm for public testing with early editor and language support.