What is lunr-languages?
The lunr-languages npm package extends the functionality of the Lunr.js library to support multiple languages. It provides language-specific stemmers, stop word lists, and other tools to enhance search indexing and querying for non-English languages.
What are lunr-languages's main functionalities?
Language-specific Stemmers
This feature allows you to use language-specific stemmers to improve search accuracy. The code sample demonstrates how to set up a French stemmer and create an index with French content.
const lunr = require('lunr');
require('lunr-languages/lunr.stemmer.support')(lunr);
require('lunr-languages/lunr.fr')(lunr);
const idx = lunr(function () {
this.use(lunr.fr);
this.field('title');
this.field('body');
this.add({
'title': 'Bonjour',
'body': 'Le monde est beau'
});
});
console.log(idx.search('beau'));
Stop Word Lists
This feature allows you to use custom stop word lists to exclude common words from the index. The code sample demonstrates how to set up a custom stop word list for French.
const lunr = require('lunr');
require('lunr-languages/lunr.stemmer.support')(lunr);
require('lunr-languages/lunr.fr')(lunr);
require('lunr-languages/lunr.stopword')(lunr);
lunr.fr.stopWordFilter = lunr.generateStopWordFilter(['et', 'le', 'la']);
const idx = lunr(function () {
this.use(lunr.fr);
this.field('title');
this.field('body');
this.add({
'title': 'Bonjour',
'body': 'Le monde est beau'
});
});
console.log(idx.search('monde'));
Multi-language Support
This feature allows you to create a search index that supports multiple languages simultaneously. The code sample demonstrates how to set up an index that supports both French and German.
const lunr = require('lunr');
require('lunr-languages/lunr.stemmer.support')(lunr);
require('lunr-languages/lunr.multi')(lunr);
require('lunr-languages/lunr.fr')(lunr);
require('lunr-languages/lunr.de')(lunr);
const idx = lunr(function () {
this.use(lunr.multi('fr', 'de'));
this.field('title');
this.field('body');
this.add({
'title': 'Bonjour',
'body': 'Le monde est beau'
});
this.add({
'title': 'Hallo',
'body': 'Die Welt ist schön'
});
});
console.log(idx.search('schön'));
Other packages similar to lunr-languages
elasticlunr
Elasticlunr is a lightweight full-text search library that is similar to Lunr.js but offers more flexibility and customization options. It supports multiple languages and provides a more modular approach to building search indexes.
search-index
Search-index is a powerful and flexible search library that supports full-text search, faceted search, and more. It is designed to be highly customizable and can handle large datasets efficiently. It also supports multiple languages and offers advanced features like real-time indexing and querying.
flexsearch
Flexsearch is a high-performance full-text search library that offers fast indexing and querying capabilities. It supports multiple languages and provides a range of configuration options to optimize search performance. It is designed to be lightweight and efficient, making it suitable for use in both client-side and server-side applications.
Lunr Languages
Lunr Languages is a Lunr addon that helps you search in documents written in the following languages:
- German
- French
- Spanish
- Italian
- Dutch
- Danish
- Portuguese
- Finnish
- Romanian
- Hungarian
- Russian
- Norwegian
- Swedish
- Turkish
- Japanese
- Thai
- Arabic
- Chinese
- Vietnamese
- Sankrit
- Kannada
- Telugu
- Hindi
- Tamil
- Korean
- Armenian
- Hebrew
- Greek
- Contribute with a new language
Lunr Languages is compatible with Lunr version 0.6
, 0.7
, 1.0
and 2.X
.
How to use
Lunr-languages works well with script loaders (Webpack, requirejs) and can be used in the browser and on the server.
In a web browser
The following example is for the German language (de).
Add the following JS files to the page:
<script src="lunr.js"></script>
<script src="lunr.stemmer.support.js"></script>
<script src="lunr.de.js"></script>
then, use the language in when initializing lunr:
var idx = lunr(function () {
this.use(lunr.de);
this.field('title', { boost: 10 });
this.field('body');
});
That's it. Just add the documents and you're done. When searching, the language stemmer and stopwords list will be the one you used.
In a web browser, with RequireJS
Add require.js
to the page:
<script src="lib/require.js"></script>
then, use the language in when initializing lunr:
require(['lib/lunr.js', '../lunr.stemmer.support.js', '../lunr.de.js'], function(lunr, stemmerSupport, de) {
stemmerSupport(lunr);
de(lunr);
var idx = lunr(function () {
this.use(lunr.de);
this.field('title', { boost: 10 })
this.field('body')
});
});
With node.js
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.de.js')(lunr);
var idx = lunr(function () {
this.use(lunr.de);
this.field('title', { boost: 10 })
this.field('body')
});
Indexing multi-language content
If your documents are written in more than one language, you can enable multi-language indexing. This ensures every word is properly trimmed and stemmed, every stopword is removed, and no words are lost (indexing in just one language would remove words from every other one.)
var lunr = require('./lib/lunr.js');
require('./lunr.stemmer.support.js')(lunr);
require('./lunr.ru.js')(lunr);
require('./lunr.multi.js')(lunr);
var idx = lunr(function () {
this.use(lunr.multiLanguage('en', 'ru'));
});
You can combine any number of supported languages this way. The corresponding lunr language scripts must be loaded (English is built in).
If you serialize the index and load it in another script, you'll have to initialize the multi-language support in that script, too, like this:
lunr.multiLanguage('en', 'ru');
var idx = lunr.Index.load(serializedIndex);
How to add a new language
Check the Contributing section
How does Lunr Languages work?
Searching inside documents is not as straight forward as using indexOf()
, since there are many things to consider in order to get quality search results:
- Tokenization
- Given a string like "Hope you like using Lunr Languages!", the tokenizer would split it into individual words, becoming an array like
['Hope', 'you', 'like', 'using', 'Lunr', 'Languages!']
- Though it seems a trivial task for Latin characters (just splitting by the space), it gets more complicated for languages like Japanese. Lunr Languages has this included for the Japanese language.
- Trimming
- After tokenization, trimming ensures that the words contain just what is needed in them. In our example above, the trimmer would convert
Languages!
into Languages
- So, the trimmer basically removes special characters that do not add value for the search purpose.
- Stemming
- What happens if our text contains the word
consignment
but we want to search for consigned
? It should find it, since its meaning is the same, only the form is different. - A stemmer extracts the root of words that can have many forms and stores it in the index. Then, any search is also stemmed and searched in the index.
- Lunr Languages does stemming for all the included languages, so you can capture all the forms of words in your documents.
- Stop words
- There's no point in adding or searching words like
the
, it
, so
, etc. These words are called Stop words - Stop words are removed so your index will only contain meaningful words.
- Lunr Languages includes stop words for all the included languages.
Technical details & Credits
I've created this project by compiling and wrapping stemmers toghether with stop words from various sources (including users contributions) so they can be directly used with all the current versions of Lunr.
I am providing code in the repository to you under an open source license. Because this is my personal repository, the license you receive to my code is from me and not my employer (Facebook)