Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.
Quickner is blazing fast, simple to use, and easy to configure using a TOML file.
Installation
python3 -m venv env
source env/bin/activate
pip install quickner
Usage
Using the config file
from quickner import Quickner, Config
config = Config(path="config.toml")
quick = Quickner(config=config)
quick.process()
Using Documents
from quickner import Quickner, Document
rust = Document("rust is made by Mozilla")
python = Document("Python was created by Guido van Rossum")
java = Document("Java was created by James Gosling")
documents = [rust, python, java]
quick = Quickner(documents=documents)
quick
>>> Entities: 0 | Documents: 3 | Annotations:
>>> quick.documents
[Document(id="87e03d58b1ba4d72", text=rust is made by Mozilla, label=[]), Document(id="f1da5d23ef88f3dc", text=Python was created by Guido van Rossum, label=[]), Document(id="e4324f9818e7e598", text=Java was created by James Gosling, label=[])]
>>> quick.entities
[]
Using Documents and Entities
from quickner import Quickner, Document, Entity
texts = (
"rust is made by Mozilla",
"Python was created by Guido van Rossum",
"Java was created by James Gosling at Sun Microsystems",
"Swift was created by Chris Lattner and Apple",
)
documents = [Document(text) for text in texts]
entities = (
("Rust", "PL"),
("Python", "PL"),
("Java", "PL"),
("Swift", "PL"),
("Mozilla", "ORG"),
("Apple", "ORG"),
("Sun Microsystems", "ORG"),
("Guido van Rossum", "PERSON"),
("James Gosling", "PERSON"),
("Chris Lattner", "PERSON"),
)
entities = [Entity(*(entity)) for entity in entities]
quick = Quickner(documents=documents, entities=entities)
quick.process()
>>> quick
Entities: 6 | Documents: 3 | Annotations: PERSON: 2, PL: 3, ORG: 1
>>> quick.documents
[Document(id=87e03d58b1ba4d72, text=rust is made by Mozilla, label=[(0, 4, PL), (16, 23, ORG)]), Document(id=f1da5d23ef88f3dc, text=Python was created by Guido van Rossum, label=[(0, 6, PL), (22, 38, PERSON)]), Document(id=e4324f9818e7e598, text=Java was created by James Gosling, label=[(0, 4, PL), (20, 33, PERSON)])]
Find documents by label or entity
When you have annotated your documents, you can use the find_documents_by_label and find_documents_by_entity methods to find documents by label or entity.
Both methods return a list of documents, and are not case sensitive.
Example:
>>> quick.find_documents_by_label("PERSON")
[Document(id=f1da5d23ef88f3dc, text=Python was created by Guido van Rossum, label=[(0, 6, PL), (22, 38, PERSON)]), Document(id=e4324f9818e7e598, text=Java was created by James Gosling, label=[(0, 4, PL), (20, 33, PERSON)])]
>>> quick.find_documents_by_entity("Guido van Rossum")
[Document(id=f1da5d23ef88f3dc, text=Python was created by Guido van Rossum, label=[(0, 6, PL), (22, 38, PERSON)])]
>>> quick.find_documents_by_entity("rust")
[Document(id=87e03d58b1ba4d72, text=rust is made by Mozilla, label=[(0, 4, PL), (16, 23, ORG)])]
>>> quick.find_documents_by_entity("Chris Lattner")
[Document(id=3b0b3b5b0b5b0b5b, text=Swift was created by Chris Lattner and Apple, label=[(0, 5, PL), (21, 35, PERSON), (40, 45, ORG)])]
Get a Spacy Compatible Generator Object
You can use the spacy method to get a spacy compatible generator object.
The generator object can be used to feed a spacy model with the annotated data, you still need to convert the data into DocBin format.
Example:
>>> quick.spacy()
<builtins.SpacyGenerator object at 0x102311440>
>>> chunks = quick.spacy(chunks=2)
>>> for chunk in chunks:
... print(chunk)
...
[('rust is made by Mozilla', {'entitiy': [(0, 4, 'PL'), (16, 23, 'ORG')]}), ('Python was created by Guido van Rossum', {'entitiy': [(0, 6, 'PL'), (22, 38, 'PERSON')]})]
[('Java was created by James Gosling at Sun Microsystems', {'entitiy': [(0, 4, 'PL'), (20, 33, 'PERSON'), (37, 53, 'ORG')]}), ('Swift was created by Chris Lattner and Apple', {'entitiy': [(0, 5, 'PL'), (21, 34, 'PERSON'), (39, 44, 'ORG')]})]
Single document annotation
You can also annotate a single document with a list of entities.
This is useful when you want to annotate a document with a list of entities is not in the list of entities of the Quickner object.
Example:
from quickner import Document, Entity
rust = Document.from_string("rust is made by Mozilla")
rust = Document("rust is made by Mozilla")
entities = [Entity("Rust", "PL"), Entity("Mozilla", "ORG")]
>>> rust.annotate(entities, case_sensitive=True)
>>> rust
Document(id="87e03d58b1ba4d72", text=rust is made by Mozilla, label=[(16, 23, ORG)])
>>> rust.annotate(entities, case_sensitive=False)
>>> rust
Document(id="87e03d58b1ba4d72", text=rust is made by Mozilla, label=[(16, 23, ORG), (0, 4, PL)])
Load from file
Initialize the Quickner object from a file containing existing annotations.
Quickner.from_jsonl and Quickner.from_spacy are class methods that return a Quickner object and are able to parse the annotations and entities from a jsonl or spaCy file.
from quickner import Quickner
quick = Quickner.from_jsonl("annotations.jsonl")
quick = Quickner.from_spacy("annotations.json")
Configuration
The configuration file is a TOML file with the following structure:
[general]
[logging]
level = "debug"
[texts]
[texts.input]
filter = false
path = "texts.csv"
[texts.filters]
accept_special_characters = ".,-"
alphanumeric = false
case_sensitive = false
max_length = 1024
min_length = 0
numbers = false
punctuation = false
special_characters = false
[annotations]
format = "spacy"
[annotations.output]
path = "annotations.jsonl"
[entities]
[entities.input]
filter = true
path = "entities.csv"
save = true
[entities.filters]
accept_special_characters = ".-"
alphanumeric = false
case_sensitive = false
max_length = 20
min_length = 0
numbers = false
punctuation = false
special_characters = true
[entities.excludes]
Features Roadmap and TODO
License
MOZILLA PUBLIC LICENSE Version 2.0
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
Authors