![Oracle Drags Its Feet in the JavaScript Trademark Dispute](https://cdn.sanity.io/images/cgdhsj6q/production/919c3b22c24f93884c548d60cbb338e819ff2435-1024x1024.webp?w=400&fit=max&auto=format)
Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
org.opensextant:solr-text-tagger
Advanced tools
This project implements a "naive" text tagger based on Lucene / Solr, using Lucene FST (Finite State Transducer) technology under the hood for remarkable low-memory properties. It is "naive" because it does simple text word based substring tagging without consideration of any natural language context. It operates on the results of how you configure text analysis in Lucene and so it's quite flexible to match things like phonetics for sounds-like tagging if you wanted to. For more information, see the presentation video/slides referenced below.
The tagger can be used for finding entities/concepts in large text, or for doing likewise in queries to enhance query-understanding.
For a list of changes with version of this tagger, to include Solr & Java version compatibility, see CHANGES.md
Pertaining to Lucene's Finite State Transducers:
See the QUICK_START.md file for a set of instructions to get you going ASAP.
The build requires Java (v8 or v9) and Maven.
To compile and run tests, use:
%> mvn test
To compile, test, and build the jar (placed in target/), use
%> mvn package
A Solr schema.xml needs 2 things
<uniqueKey>
). Setting docValues=true on this field is recommended.If you want to support typical keyword search on the names, not just tagging, then index the names in an additional field with a typical analysis configuration to your preference.
For tagging, the name field's index analyzer needs to end in either shingling for "partial" (i.e. sub name phrase) matching of a name, or more likely using ConcatenateFilter for complete name matching. ConcatenateFilter acts similar to shingling but it concatenates all tokens into one final token with a space separator. The query time analysis should not have Shingling or ConcatenateFilter.
Prior to shingling or the ConcatenateFilter, preceding text analysis should result in consecutive positions (i.e. the position increment of each term must always be 1). As-such, Synonyms and some configurations of WordDelimiterFilter are not supported. On the other hand, if the input text has a position increment greater than one (e.g. stop word) then it is handled properly as if an unknown word was there. Support for synonyms or any other filters producing posInc=0 is a feature that has largely been overcome in the 1.1 version but it has yet to be ported to 2.x; see Issue #20, RE the PhraseBuilder
To make the tagger work as fast as possible, configure the name field with postingsFormat="Memory";. In doing so, all the terms/postings are placed into an efficient FST data structure.
Here is a sample field type config that should work quite well:
<fieldType name="tag" class="solr.TextField" positionIncrementGap="100" postingsFormat="Memory"
omitTermFreqAndPositions="true" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="org.opensextant.solrtexttagger.ConcatenateFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
A Solr solrconfig.xml needs a special request handler, configured like this.
<requestHandler name="/tag" class="org.opensextant.solrtexttagger.TaggerRequestHandler">
<lst name="defaults">
<str name="field">name_tag</str>
<str name="fq">PUT SOME SOLR QUERY HERE; OPTIONAL</str><!-- filter out -->
</lst>
</requestHandler>
field
: The field that represents the corpus to match on, as described above.fq
: (optional) A query that matches a subset of documents for name matching.Also, to enable custom so-called postings formats, ensure that your solrconfig.xml has a codecFactory defined like this:
<codecFactory name="CodecFactory" class="solr.SchemaCodecFactory" />
For tagging, you HTTP POST data to Solr similar to how the ExtractingRequestHandler (Tika) is invoked. A request invoked via the "curl" program could look like this:
curl -XPOST \
'http://localhost:8983/solr/collection1/tag?overlaps=NO_SUB&tagsLimit=5000&fl=*' \
-H 'Content-Type:text/plain' -d @/mypath/myfile.txt
overlaps
: choose the algorithm to determine which overlapping tags should be
retained, versus being pruned away. Options are:ALL
: Emit all tags.NO_SUB
: Don't emit a tag that is completely within another tag (i.e. no subtag).LONGEST_DOMINANT_RIGHT
: Given a cluster of overlapping tags, emit the longest
one (by character length). If there is a tie, pick the right-most. Remove
any tags overlapping with this tag then repeat the algorithm to potentially
find other tags that can be emitted in the cluster.matchText
: A boolean indicating whether to return the matched text in the tag
response. This will trigger the tagger to fully buffer the input before tagging.tagsLimit
: The maximum number of tags to return in the response. Tagging
effectively stops after this point. By default this is 1000.rows
: Solr's standard param to say the maximum number of documents to return,
but defaulting to 10000 for a tag request.skipAltTokens
: A boolean flag used to suppress errors that can occur if, for
example, you enable synonym expansion at query time in the analyzer, which you
normally shouldn't do. Let this default to false unless you know that such
tokens can't be avoided.ignoreStopwords
: A boolean flag that causes stopwords (or any condition causing positions to
skip like >255 char words) to be ignored as if it wasn't there. Otherwise, the behavior is to treat
them as breaks in tagging on the presumption your indexed text-analysis configuration doesn't have
a StopWordFilter. By default the indexed analysis chain is checked for the presence of a
StopWordFilter and if found then ignoreStopWords is true if unspecified. You probably shouldn't
have a StopWordFilter configured and probably won't need to set this param either.xmlOffsetAdjust
: A boolean indicating that the input is XML and furthermore that the offsets of
returned tags should be adjusted as necessary to allow for the client to insert an open and closing
element at the positions. If it isn't possible to do so then the tag will be omitted. You are
expected to configure HTMLStripCharFilter in the schema when using this option.
This will trigger the tagger to fully buffer the input before tagging.htmlOffsetAdjust
: Similar to xmlOffsetAdjust except for HTML content that may have various issues
that would never work with an XML parser. There needn't be a top level element, and some tags
are known to self-close (e.g. BR). The tagger uses the Jericho HTML Parser for this feature
(dual LGPL & EPL licensed).nonTaggableTags
: (only with htmlOffsetAdjust) Omits tags that would enclose one of these HTML
elements. Comma delimited, lower-case. For example 'a' (anchor) would be a likely choice so that
links the application inserts don't overlap other links.fl
: Solr's standard param for listing the fields to return.echoParams
, wt
, indent
, etc.The output is broken down into two parts, first an array of tags, and then Solr documents referenced by those tags. Each tag has the starting character offset, an ending character (+1) offset, and the Solr unique key field value. The Solr documents part of the response is Solr's standard search results format.
FAQs
A text tagger based on Lucene / Solr
We found that org.opensextant:solr-text-tagger demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.