
Security News
CISA’s 2025 SBOM Guidance Adds Hashes, Licenses, Tool Metadata, and Context
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.
jekyll-related-posts
Advanced tools
Proper related posts plugin for Jekyll - uses document correlation matrix on TF-IDF (optionally with Latent Semantic Indexing).
Example is provided at http://jekyll-related-posts.dev.amadeusz.me - posts are based on Reuters-21578 data set.
I am going to try to start blogging, again. Anyway I am studying at Decision Support Systems Group and I have found document correlation problem somehow interesting.
For my own purposes I have created related posts Jekyll plugin based on well known algorithms such as TFIDF and LSI.
Initially you had to install the plugin manually, however the plugin is a gem now - follow instructions to install the plugin:
jekyll-related-posts
:gem 'jekyll-related-posts'
to your
Gemfile
and run bundle install
,gem install jekyll-related-posts
.gems: ['jekyll-related-posts']
to your _config.yml
.<related-posts />
somewhere in your _layouts/post.html
file.jekyll build
, don't forget to blog about the plugin!You can customize default related posts template by creating
related.html
in your layouts directory. Plugin behaviour can be
altered by options in _config.yml
, under related:
section.
Each document is tokenized and stemmed, every word found is treated as keyword for analysis (except for some stop words).
TF-IDF matrix for the whole site is calculated (including extra provided weights), then if given accuraccy is lower than 1.0, LSI algorithm is used to compute new simplified vector space. Document correlation matrix is created using dot product of the matrix and its transpose.
For each of the post' related documents are inserted into priority queue (sorted by score from document correlation matrix), assuming the score is greater than minimal required score. Selected few bests related posts are retrieven from the queue.
Liquid template for each post is rendered and <related-posts />
is
replaced with the outcomes of algorithm.
In your _config.yml
file (under related:
) you can configure:
max_count: 5
- maximum number of related posts,min_score: 0.1
- minimal required score to treat post as related,accuracy: 0.75
- percentage of keywords used as basis for document
correlation matrix (if 1.0 then no LSI is computed, otherwise LSI is
computed and dimensions are reduced to accuracy * |keywords|
)You can configure weights of words providing dictionary with them to
weights
. In example weight of 2
means for term frequency algorithm
that the word occurred twice as much in the document as in reality.
For casual blogs, performance should not be an issue.
I did not benchmark the plugin, however for the dataset given in the example (containing ~900 documents, ~7000 keywords) rendering time (including Jekyll hoodoo stuff) is more less 70 seconds (on Xeon, using 750MB RAM). Computation related to this plugin is about 20 seconds long. It should be noticed that I'm using OpenBLAS and standard LAPACK distributed with Ubuntu (performance is similar on OS X using builtin Acccelerate framework).
Unfortunately the plugin is not compatible with Jekyll 3.0 new incremental builds, even though it requires at least Jekyll 3.0 (for the plugin hooks).
FAQs
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.
Security News
A clarification on our recent research investigating 60 malicious Ruby gems.
Security News
ESLint now supports parallel linting with a new --concurrency flag, delivering major speed gains and closing a 10-year-old feature request.