DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework.
DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework.
DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework.
DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
KefirBB is a Java-library for text processing. Initially it was developed for BB2HTML translation. But flexible configuration allows to use it in different cases. For example for parsing Markdown, Textile, and for HTML filtration.
Provides modules that allow basic language support for Chinese using the Solr/Lucene smartcn analyzer. This includes a (1) Bundle providing the Solr Analyzer; (2) an NLP processing Engine that detects Sentences and Tokenizes Chinese Text and (3) an LabelTokenizer needed to match tokens of the analyzed text with the labels of Entities in the matched vocabularies.
Provides modules that bring language support for Japanese using the Solr/Lucene kuromoji analyzer. This includes a (1) Bundle providing the Solr Analyzer; (2) an NLP processing Engine that Tokenizes, detects sentences, POS taggs, extracts Named Entities and Lemmatizes Japanese text (3) an LabelTokenizer needed to match tokens of the analyzed text with the labels of Entities in the matched vocabularies.
GroupDocs.Editor for Java is a powerful document editing API using HTML. API can be used with any external, opensource or paid HTML editor. Editor API will process to load documents, convert it to HTML, provide HTML to external UI and then save HTML to original document after manipulation. It can also be used to generate different PDF files, Microsoft Word (DOC, DOCX), Excel spreadsheets (XLS, XSLSX), PowerPoint presentations (PPT, PPTX) and TXT documents. Manipulate Using HTML: Load Document Edit content using HTML Edit styles Perform Editor operations Convert back to supported file Document Editor is a computer program for editing HTML, the markup of a webpage. Although the HTML markup of a web page can be written with any text editor, specialized HTML editors can offer convenience and added functionality. For example, many HTML editors handle not only HTML, but also related technologies such as CSS, XML and JavaScript or ECMAScript. In some cases they also manage communication with remote web servers via FTP and WebDAV, and version control systems such as Subversion or Git. Many word processing, graphic design and page layout programs that are not dedicated to web design, such as Microsoft Word or Quark XPress, also have the ability to function as HTML editors.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Common framework for large text files processing tools.
GroupDocs.Viewer is an online document viewer that lets you read documents in your browser, regardless of whether you have the software that they were created in. You can view many types to word processing documents (DOC, DOCX, TXT, RTF, ODT), presentations (PPT, PPTX), spreadsheets (XLS, XLSX), portable files (PDF), and image files (JPG, BMP, GIF, TIFF). For each file, you get a high-fidelity rendering, showing the document just as it would if you opened it in the software it was created in. Layout and formatting is retained and you see an exact copy of the original. GroupDocs.Viewer lets you really read the document. You can search text documents, copy text and even embed the document – GroupDocs.Viewer and all - in a web page. You can print or download the file from GroupDocs.Viewer if you need to work with it offline.
Library for test processing
Text processor. A library to process texts with a block-like languaje
Text processing utilities.
DKPro Similarity is an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. The framework is designed to complement DKPro Core, a collection of software components for natural language processing based on the Apache UIMA framework.
The BioMedical Information Collection and Understanding System (BioMedICUS) is a system for large-scale text analysis and processing of biomedical and clinical reports.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Software changelogs (Plain text processing)
Software changelogs (Plain text processing)
Utility functions for text processing.
Utility functions for text processing.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
A collection of scala and java classes for some basic natural language processing (NLP) for the Sanskrit language, contributed by the open source SanskritNLP project and friends. Some notable facilities: * Transliterate text from one script or encoding scheme to another. * Deal with babylon dictionaries. * Use bots to write to wiki projects (wiktionary, wikisource etc..). * Basic metre identification. * Some grammar simulation. Contributions and suggestions are invited at https://github.com/sanskrit-coders/sanskritnlpjava . (Sister projects there may also be of interest.)
Text processing utilities.
Jamal macro library to process text files
Jamal macro library to process text files
Information extraction is the process of identifying specified classes of entities, relations, and events in natural language text – creating structured data from unstructured input. JET, the Java Extraction Toolkit, developed at New York University over the past fifteen years, provides a rich set of tools for research and education in information extraction from English text. These include standard language processing tools such as a tokenizer, sentence segmenter, part-of-speech tagger, name tagger, regular-expression pattern matcher, and dependency parser. Also provided are relation and event extractors based on the specifications of the U.S. Government's ACE [Automatic Content Extraction] program. The program is provided under an Apache 2.0 license.
Rich text processing and markup generation.
ScanditLabelCapture coordinates the process of simultaneously capturing data contained in multiple barcodes and text.
XBIS is an encoding format for XML documents that is fully convertible to and from text, with information set equivalence between the original document text and regenerated document text. It's intended for use in transmitting XML documents between application components, and is therefore designed for processing speed. The current Java language implementation offers several times the performance of SAX2 parsers working from text documents across a wide range of document types and sizes, and across JVMs tested, while also providing a substantial reduction in document size for most types of XML documents.
An in-memory EntityLinking engine that uses Lucenes FST (Finite State Transducer) technology. This engine is based on code provided by the Solr Text Tagger (https://github.com/OpenSextant/SolrTextTagger/) but provides a deep integration with Apache Stanbol (Datafile provider, NLP processing module and existing EntityLinking functionality).
GroupDocs.Watermark for Java is a powerful document watermarking API to add image and text watermarks. Furthermore, API works to search and remove the watermarks which were already added to the documents by other third-party softwares. The watermarks added by this API are hard to remove by any third-party tools. It is straight-forward and self-descriptive for integration into the custom applications. The most notable features are: - Add text and image watermarks into documents and images - Search for possible watermarks in documents and remove them - Support various document formats: Pdf; MS Office: Word, Excel, PowerPoint, Visio - Support various image formats: png, bmp, jpeg, jpeg2000, gif, tiff, webp (including multiframe gif and tiff) - Process documents and images attached to stored email messages (msg, oft, eml, emlx formats are supported) - Add watermarks to images inside documents of all supported formats - Two ways of watermark adding/removing are supported: using generalized approach and working with supported format specifics
RuSH is an efficient, reliable, and easy adaptable rule-based sentence segmentation solution. It is specifically designed to handle the telegraphic written text in clinical note. It leverages a nested hash table to execute simultaneous rule processing, which reduces the impact of the rule-base growth on execution time and eliminates the effect of rule order on accuracy. If you wish to cite RuSH in a publication, please use: Jianlin Shi ; Danielle Mowery ; Kristina M. Doing-Harris ; John F. Hurdle.RuSH: a Rule-based Segmentation Tool Using Hashing for Extremely Accurate Sentence Segmentation of Clinical Text. AMIA Annu Symp Proc. 2016: 1587. The full text can be found at: https://knowledge.amia.org/amia-63300-1.3360278/t005-1.3362920/f005-1.3362921/2495498-1.3363244/2495498-1.3363247?timeStamp=1479743941616 This version allows defining section scopes for sentence segmentation.
Utility functions for text processing.
In some data processing tasks we need to use huge maps or sets that are bigger than available JVM heap space or they are loading too slow to standard Java or Scala Maps. We use TSV format (text file with tab separated columns) for persist this kind of Maps or Sets. Some columns are used as a key and rest of columns as a value. Idea of this library is simple. We can prepare these maps once (sort by key), store it to file and then use it as memory mapped file. Searching key in sorted file has log(n) complexity. If more processes uses the same memory mapped file, it exists in memory just once (on Linux). This file can be loaded lazy by OS.
In some data processing tasks we need to use huge maps or sets that are bigger than available JVM heap space or they are loading too slow to standard Java or Scala Maps. We use TSV format (text file with tab separated columns) for persist this kind of Maps or Sets. Some columns are used as a key and rest of columns as a value. Idea of this library is simple. We can prepare these maps once (sort by key), store it to file and then use it as memory mapped file. Searching key in sorted file has log(n) complexity. If more processes uses the same memory mapped file, it exists in memory just once (on Linux). This file can be loaded lazy by OS.
In some data processing tasks we need to use huge maps or sets that are bigger than available JVM heap space or they are loading too slow to standard Java or Scala Maps. We use TSV format (text file with tab separated columns) for persist this kind of Maps or Sets. Some columns are used as a key and rest of columns as a value. Idea of this library is simple. We can prepare these maps once (sort by key), store it to file and then use it as memory mapped file. Searching key in sorted file has log(n) complexity. If more processes uses the same memory mapped file, it exists in memory just once (on Linux). This file can be loaded lazy by OS.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Several component for processing dramatic texts (theatre plays) with Apache UIMA.
Noble Tools Suite, is a set of Natural Language Processing (NLP) tools and Application Programming Interfaces (API) written in Java for interfacing with ontologies, auto coding text and extracting information from free test. The Noble Tools suite also includes a generic ontology API for interfacing with Web Ontology Language (OWL) files, OBO and BioPortal ontologies and a number of support utilities and methods useful for NLP (e.g. string normalization, ngram and stemming)
Utilities for processing AIS messages; e.g. tracking, free-text filter expressions, archiving, and more.
Utilities including math, charsequence based text processing, sequences etc.
Java implementation of text processing such as stemmers