Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
TextRank is an unsupervised keyword extraction algorithm based on PageRank. Other strategies for keyword extraction generally rely on either statistics (like inverse document frequency and term frequency) which ignore context, or they rely on machine learning, requiring a corpus of training data which likely will not be suitable for all applications. TextRank is found to produce superior results in many situations with minimal computational cost.
gem install text_rank
TextRank::TokenFilter::PartOfSpeech
TextRank::CharFilter::StripHtml
TextRank
require 'text_rank'
text = <<-END
In a castle of Westphalia, belonging to the Baron of Thunder-ten-Tronckh, lived
a youth, whom nature had endowed with the most gentle manners. His countenance
was a true picture of his soul. He combined a true judgment with simplicity of
spirit, which was the reason, I apprehend, of his being called Candide. The old
servants of the family suspected him to have been the son of the Baron's
sister, by a good, honest gentleman of the neighborhood, whom that young lady
would never marry because he had been able to prove only seventy-one
quarterings, the rest of his genealogical tree having been lost through the
injuries of time.
END
# Default, basic keyword extraction. Try this first:
keywords = TextRank.extract_keywords(text)
# Keyword extraction with all of the bells and whistles:
keywords = TextRank.extract_keywords_advanced(text)
# Fully customized extraction:
extractor = TextRank::KeywordExtractor.new(
strategy: :sparse, # Specify PageRank strategy (dense or sparse)
damping: 0.85, # The probability of following the graph vs. randomly choosing a new node
tolerance: 0.0001, # The desired accuracy of the results
char_filters: [...], # A list of filters to be applied prior to tokenization
tokenizers: [...], # A list of tokenizers to perform tokenization
token_filters: [...], # A list of filters to be applied to each token after tokenization
graph_strategy: ..., # A class or strategy instance for producing a graph from tokens
rank_filters: [...], # A list of filters to be applied to the keyword ranks after keyword extraction
)
# Add another filter to the end of the char_filter chain
extractor.add_char_filter(:AsciiFolding)
# Add a part of speech filter to the token_filter chain BEFORE the Stopwords filter
pos_filter = TextRank::TokenFilter::PartOfSpeech.new(parts_to_keep: %w[nn])
extractor.add_token_filter(pos_filter, before: :Stopwords)
# Perform the extraction with at most 100 iterations
extractor.extract(text, max_iterations: 100)
PageRank
It is also possible to use this gem for PageRank only.
require 'page_rank'
PageRank.calculate(strategy: :sparse, damping: 0.8, tolerance: 0.00001) do
add('node_a', 'node_b', weight: 3.2)
add('node_b', 'node_d', weight: 2.1)
add('node_b', 'node_e', weight: 4.7)
add('node_e', 'node_a', weight: 1.3)
end
There are currently two pure Ruby implementations of PageRank:
max_iterations
matrix multiplications or until the tolerance is reached. This is more of a
canonical implementation and is fine for small or dense graphs, but it is not
advised for large, sparse graphs as Ruby is not fast when it comes to matrix
multiplication. Each iteration is O(N^3) where N is the number of graph nodes.MIT. See the LICENSE
file.
R. Mihalcea and P. Tarau, “TextRank: Bringing Order into Texts,” in Proceedings of EMNLP 2004. Association for Computational Linguistics, 2004, pp. 404–411.
Brin, S.; Page, L. (1998). "The anatomy of a large-scale hypertextual Web search engine". Computer Networks and ISDN Systems 30: 107–117.
FAQs
Unknown package
We found that text_rank demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.