daq-proc
Simple document and query processor for nowsearch.xyz to makes search running in the browser and node.js a little better. Removes stopwords (smaller index and less irrelevant hits), extract keywords to filter on and prepares ngrams for auto-complete functionality.
Demo
- document processor. It showcases the document processor end. Just add some words and figure it out.
- query processor (lacking
leven-match
showcase, just hit-highlighter for now).
This library is not creating anything new, but just packaging 6 libraries that goes well togehter into one browser distribution file. Also showing how it may be usefull through tests and the interactive demo.
Libraries that daq-proc is depending on
cheerio
- Here specifically used to extract text from all- or parts of the HTML.eklem-headline-parser
- Determines the most relevant keywords in a headline by considering article contexthit-highlighter
- Higlighting hits from a query in a result item.leven-match
- Calculating Levenshtein match between words in two arrays within given distance. Good for fuzzy matching.ngraminator
- Generate n-grams.stopword
- Removes stopwords from an array of words. To keep your index small and remove all words without a scent of information and/or remove stopwords from the query, making the search engine work less hard to find relevant results.words'n'numbers
- Extract words and optionally numbers from a string of text into arrays. Arrays that can be fed to stopword
, eklem-headline-parser
, ngraminator
and hit-highlighter
.
Browser
Example - document processing side
<script src="daq-proc.js"></script>
<script>
const {cheerio, ehp, highlight, lvm, ngraminator, sw, wnn} = dqp
const headlineString = 'Document and query processing for the browser!'
const bodyString = 'Yay! The day is here =) We now have document and query processing for the browser. It is mostly packaging 4 modules together in a browser distribution file. The modules are words-n-numbers, stopword, ngraminator and eklem-headline-parser'
let headlineArray = wnn.extract(headlineString, {regex: wnn.wordsAndNumbers, toLowercase: true})
let bodyArray = wnn.extract(bodyString, {regex: wnn.wordsAndNumbers, toLowercase: true})
console.log('Word arrays: ')
console.dir(headlineArray)
console.dir(bodyArray)
let headlineStopped = sw.removeStopwords(headlineArray)
let bodyStopped = sw.removeStopwords(bodyArray)
console.log('Stopword removed arrays: ')
console.dir(headlineStopped)
console.dir(bodyStopped)
let headlineNgrams = ngraminator(headlineStopped, [2,3,4])
let bodyNgrams = ngraminator(bodyStopped, [2,3,4])
console.log('Ngram arrays: ')
console.dir(headlineNgrams)
console.dir(bodyNgrams)
let keywords = ehp.findKeywords(headlineStopped, bodyStopped, 5)
console.log('Keyword array: ')
console.dir(keywords)
</script>
Example - Query side
<script src="daq-proc.js"></script>
<script>
const {cheerio, ehp, highlight, lvm, ngraminator, sw, wnn} = dqp
const query = ['interesting', 'words']
const searchResult = ['some', 'interesting', 'words', 'to', 'remember']
highlight(query, searchResult)
const index = ['return', 'all', 'word', 'matches', 'between', 'two', 'arrays', 'within', 'given', 'levenshtein', 'distance', 'intended', 'use', 'is', 'to', 'words', 'in', 'a', 'query', 'that', 'has', 'an', 'index', 'good', 'for', 'autocomplete', 'type', 'functionality,', 'and', 'some', 'cases', 'also', 'searching']
const query = ['qvery', 'words', 'levensthein']
lvm.levenMatch(query, index, {distance: 2})
</script>
Node.js
It's fully possible to use on Node.js too. The tests are both for Node.js and the browser. It's only wrapping 6 libraries for the ease of use in the browser, but could come in handy for i.e. simple crawler scenarios.
Something missing?
Create an issue so we can discuss =).