Talisman Document Model
Talisman Document Model (TDM) python implementation.
Easy way for document creation, modification, serialization and deserialization.
Main concepts
Document is a special container for partially structured data representation.
It consists of content nodes, structure and extracted facts.
Basic usage
Document creation
To create document one of Document factory should be used.
For most cases tdm.DefaultDocumentFactory
is the best choice.
If you need to create document with non-default domain validation, tdm.TalismanDocumentFactory
could be used
(see Domain section for details).
The following code could be used to create empty document
from tdm import DefaultDocumentFactory
doc = DefaultDocumentFactory.create_document()
doc = DefaultDocumentFactory.create_document(id_='my document')
Document content
Talisman document content could be represented as oriented partly-ordered acyclic graph.
Nodes
There are two principal types of nodes:
- data nodes – store some unstructured data pieces;
- service nodes – represent composite structure (such as lists, tables etc.).
All the nodes are inherited from tdm.abstract.datamodel.AbstractNode
and are python frozen
dataclasses.
Each node contains at least some metadata and markup.
Both of metadata and markup is designed to enrich node with some additional information about its form or content.
The main difference is that the set of metadata for each node is used mostly for node display settings and fixed in advance,
while the possible markup is not fixed and can contain almost any information.
Node metadata type is defined by node class implementation, while for markup multipurpose tdm.abstract.datamodel.markup.FrozenMarkup
could be used. One should guarantee markup to contain immutable data structures.
For node updates dataclass replace method could be utilized.
Node implementations are stored in tdm.datamodel.nodes
package.
Here is an example of creation and changing nodes.
from tdm.datamodel.nodes import TextNode, TextNodeMetadata
from tdm.abstract.datamodel.markup import FrozenMarkup
from dataclasses import replace
text_node = TextNode('text only is required, other fields could be omitted')
modified_node = replace(text_node, metadata=TextNodeMetadata(language='ru'), markup=FrozenMarkup.freeze({'markup key': 'value'}))
changed_content = replace(text_node, content="another text")
TalismanDocument
has two methods to change document nodes: with_nodes
and without_nodes
(both return new document as document is immutable).
from dataclasses import replace
from tdm import DefaultDocumentFactory
from tdm.abstract.datamodel.markup import FrozenMarkup
from tdm.datamodel.nodes import ImageNode, TextNode, TextNodeMetadata
doc = DefaultDocumentFactory.create_document()
text_node = TextNode("some text", metadata=TextNodeMetadata(language='en'), id='0')
image_node = ImageNode("/path/to/image", id='1')
doc = doc.with_nodes([text_node, image_node])
_ = doc.id2node
_ = doc.nodes
_ = tuple(doc.get_nodes(TextNode, filter_=lambda n: n.metadata.language == 'en'))
modified_node = replace(text_node, markup=FrozenMarkup.freeze({'markup key': 'value'}))
doc = doc.with_nodes([modified_node])
_ = doc.id2node
try:
doc = doc.with_nodes([replace(text_node, content="another content")])
except ValueError:
pass
try:
doc = doc.with_nodes([ImageNode("/some/image", id='0')])
except ValueError:
pass
modified = doc.without_nodes([image_node])
modified = doc.without_nodes(['1'])
modified = doc.without_nodes(['0', image_node])
modified = doc.without_nodes(['1', '2', '3'])
_ = modified.id2node
translation_node = TextNode('некоторый текст', metadata=TextNodeMetadata(language='ru'))
from tdm.datamodel.node_links import TranslationNodeLink
from tdm.datamodel.mentions import NodeMention
translation_link = TranslationNodeLink(NodeMention(modified_node), NodeMention(translation_node), language='ru')
doc = doc.with_nodes([translation_node]).with_node_links([translation_link])
try:
doc = doc.without_nodes(['0'])
except ValueError:
pass
doc = doc.without_nodes(['0'], cascade=True)
Edges
Basically document nodes could be organized as tree-like ordered structures.
Almost each node could have children nodes.
Such connections represent document logical structure and as a rule correspond to document reading order.
Structural node links possibility depends on node types (o – ordered, s – singleton, u – unordered).
Note: automatic validation is not implemented yet
from\to | Text | Key | Image | File | List | JSON | Table | TableRow | TableCell |
---|
Text | o | - | o | o | o | o | o | - | - |
Key | s | - | s | s | s | s | s | - | - |
Image | - | - | - | - | - | - | - | - | - |
File | - | - | - | - | - | - | - | - | - |
List | o | - | o | o | o | o | o | - | - |
JSON | - | u | - | - | - | - | - | - | - |
Table | - | - | - | - | - | - | - | o | - |
TableRow | - | - | - | - | - | - | - | - | o |
TableCell | o | - | o | o | o | o | o | - | - |
To add/modify document structure there are several TalismanDocument methods:
with_structure
– add some node structure links;with_node_parent
– add one node structure link;with_roots
– remove parent links from specified nodes;with_main_root
– add/update node and remove parent link (if present). Mark this node as main root.
These methods have update
parameter tht leads nodes update before structure update.
With force
flag structure updates overwrite existing conflict edges (if possible).
from tdm import DefaultDocumentFactory
from tdm.datamodel.nodes import TextNode, TextNodeMetadata, ImageNode, ListNode
doc = DefaultDocumentFactory.create_document()
header = TextNode("Header", metadata=TextNodeMetadata(header=True, header_level=1))
paragraph = TextNode("First paragraph")
list_node = ListNode()
item1 = TextNode("first item")
item2 = TextNode("second item")
image = ImageNode("/path/to/image")
with_nodes_doc = doc.with_nodes([header, paragraph, list_node, item1, item2, image])
with_structure_doc = with_nodes_doc.with_structure({
header: [paragraph, list_node, image],
list_node: [item1, item2]
})
the_same_doc = doc.with_structure({
header: [paragraph, list_node, image],
list_node: [item1, item2]
}, update=True)
assert with_structure_doc == the_same_doc
_ = with_structure_doc.roots
_ = with_structure_doc.child_nodes(header)
_ = with_structure_doc.parent(item1)
try:
_ = with_structure_doc.with_node_parent(item2, header)
except ValueError:
pass
modified = with_structure_doc.with_node_parent(item2, header, force=True)
_ = modified.child_nodes(header)
_ = modified.child_nodes(list_node)
modified = with_structure_doc.with_roots([list_node])
_ = modified.roots
_ = modified.child_nodes(header)
try:
_ = with_structure_doc.with_node_parent(header, item1)
except ValueError:
pass
try:
_ = with_structure_doc.with_node_parent(header, item1, force=True)
except ValueError:
pass
Example for table constructing
from typing import Dict, List
from tdm import DefaultDocumentFactory
from tdm.abstract.datamodel import AbstractNode
from tdm.datamodel.nodes import TableCellNode, TableCellNodeMetadata, TableNode, TableRowNode, TextNode, TextNodeMetadata
TABLE_WIDTH = 3
TABLE_HEIGHT = 4
doc = DefaultDocumentFactory.create_document()
header = TextNode("Some header", metadata=TextNodeMetadata(header=True, header_level=1))
table = TableNode()
table_structure: Dict[AbstractNode, List[AbstractNode]] = {
header: [table],
table: []
}
for i in range(TABLE_HEIGHT):
row = TableRowNode()
table_structure[table].append(row)
table_structure[row] = []
for j in range(TABLE_WIDTH):
cell = TableCellNode(metadata=TableCellNodeMetadata(header=(i == 0)))
table_structure[row].append(cell)
table_structure[cell] = [TextNode(f"{i},{j}")]
doc = doc.with_structure(table_structure, update=True)
Node semantic links
Along with structural links, document nodes can be linked by semantic links.
Semantic links (as opposed to structural links) have identifiers.
There are no restrictions on the multiplicity and the presence of cycles in the document content graph for semantic links.
Moreover, semantic links can link not a whole node, but part of document nodes.
Semantic links are directed and has some additional semantic meaning.
Now there are three types of node semantic links:
TranslationNodeLink
– means that one target node is a translation of source node. This link has additional parameter – language.ReferenceNodeLink
– means that target node is a referenced node of source node.
This link could be used to represent text footnotes or bibliography references.EquivalenceNodeLink
– means that target node contains the same information (maybe in another form) that the source node.
This link usually used for OCR retrieved texts representation.
Document has some methods for adding, modifying or retrieving node semantic links.
These methods are very similar to methods for nodes manipulation:
id2node_link
– mapping from identifier to semantic link;node_links
– mapping from link type to semantic links of that type;get_node_link
– get node semantic link by its identifier;get_node_links
– get specified type node links (iterator);related_node_links
– get specified type node links that are related for identifiable objects (usually nodes);with_node_links
– add (or update) semantic node links;without_node_links
– remove specified semantic node links.
from dataclasses import replace
from tdm import DefaultDocumentFactory
from tdm.datamodel.node_links import EquivalenceNodeLink, TranslationNodeLink
from tdm.datamodel.mentions import ImageNodeMention, NodeMention
from tdm.datamodel.nodes import ImageNode, ImageNodeMetadata, TextNode, TextNodeMetadata
doc = DefaultDocumentFactory.create_document()
image_node = ImageNode('/path/to/image', metadata=ImageNodeMetadata(width=200, height=100))
text_node = TextNode('Some text.')
russian_text_node = TextNode('Некоторый текст.')
translation_link = TranslationNodeLink(NodeMention(text_node), NodeMention(russian_text_node), language='ru')
text_on_image = EquivalenceNodeLink(
source=ImageNodeMention(image_node, top=10, bottom=20, left=0, right=150),
target=NodeMention(text_node)
)
doc_with_nodes = doc.with_nodes([image_node, text_node, russian_text_node])
doc_with_links = doc_with_nodes.with_node_links([translation_link, text_on_image])
the_same_doc = doc.with_node_links([translation_link, text_on_image], update=True)
assert doc_with_links == the_same_doc
_ = doc_with_links.related_node_links(text_node)
link = next(doc_with_links.related_node_links(image_node, EquivalenceNodeLink))
assert link.target.node.metadata.language is None
updated_node = replace(text_node, metadata=TextNodeMetadata(language='en'))
one_more_link = EquivalenceNodeLink(NodeMention(image_node), NodeMention(updated_node))
modified_doc = doc_with_links.with_node_links([one_more_link])
assert modified_doc.get_node(updated_node.id) == text_node
modified_doc = doc_with_links.with_node_links([one_more_link], update=True)
assert modified_doc.get_node(updated_node.id) == updated_node
doc_with_links = doc_with_links.with_nodes([updated_node])
link = next(doc_with_links.related_node_links(image_node, EquivalenceNodeLink))
assert link.target.node.metadata.language == 'en'
Document facts
Along with semi-structured content, TalismanDocument could be linked to knowledge base (via facts).
Facts represent concepts (some KB item), relations and concept and relation properties.
There are several fact types to represent almost any possible extracted knowledge in graph-like structure.
Each fact has unique identifier and status.
Possible statuses are:
- approved – fact is already approved to be correct and already stored in KB;
- declined – fact is already rejected (incorrect fact);
- auto – fact is marked to be approved automatically (correct fact not stored in KB yet);
- hidden – fact is neither approved nor declined, but is not relevant for downstream task;
- new – fact is neither approved nor declined.
Fact classes are placed in tdm.datamodel.facts
package
Concept facts
Concept facts bind document to some KB object.
Along with identifier and status these facts contain additional required fields: type_id
and value
.
type_id
is a KB concept type identifier that restricts possible contept relations and properties.
value
is a single KB concept identifier (with confidence) or a tuple of such identifiers (with its confidences).
If concept fact is approved, it should contain the only value
with approved concept identifier.
If concept fact is new, it should contain tuple of values (maybe empty)
from tdm.datamodel.facts import ConceptFact, KBConceptValue, FactStatus
cpt = ConceptFact(FactStatus.NEW, "PERSON", (KBConceptValue("Alice", 0.8), KBConceptValue("Bob", 0.1)))
assert cpt.value == (KBConceptValue("Alice", 0.8), KBConceptValue("Bob", 0.1))
cpt = ConceptFact(FactStatus.APPROVED, "PERSON", KBConceptValue("Alice"), id='1')
cpt2 = ConceptFact(FactStatus.APPROVED, "PERSON", (KBConceptValue("Alice"),), id='1')
assert cpt == cpt2
try:
_ = ConceptFact(FactStatus.APPROVED, "PERSON", (KBConceptValue("Alice", 0.8), KBConceptValue("Bob", 0.1)))
except ValueError:
pass
Atomic value facts
Atomic value facts (as opposite to composite) represent simple (scalar) values for properties.
As well as concept facts they contain type_id
and value
.
type_id
is a KB value type identifier that restricts possible values.
value
is a single normalized property value (with confidence) or a tuple of such values (with confidences).
The same status restrictions are applicable for atomic value facts.
There are several the most basic scalar types supported by TDM (placed in tdm.datamodel.values
):
StringValue
, IntValue
, DoubleValue
, TimestampValue
, DateTimeValue
, GeoPointValue
, LinkValue
.
Other values could be built as composite values.
from dataclasses import replace
from tdm.datamodel.facts import AtomValueFact, FactStatus
from tdm.datamodel.values import Coordinates, Date, DateTimeValue, GeoPointValue, IntValue
value = AtomValueFact(FactStatus.NEW, "AGE", (IntValue(24, 0.8), IntValue(25, 0.4)))
assert value.value == (IntValue(24, 0.8), IntValue(25, 0.4))
value = AtomValueFact(FactStatus.NEW, "DATE", (DateTimeValue(Date(1999, 1, 2)), GeoPointValue(Coordinates(0, 0))))
try:
_ = replace(value, status=FactStatus.APPROVED)
except ValueError:
pass
Value mentions
Each atomic value fact could be associated with part of the document content via MentionFact
.
Mention facts contains two required fields: mention
– node mention (whole node or part of the node) and value
– atomic value fact.
from tdm.datamodel.facts import AtomValueFact, FactStatus, MentionFact
from tdm.datamodel.mentions import NodeMention, TextNodeMention
from tdm.datamodel.nodes import ImageNode, TextNode
from tdm.datamodel.values import StringValue
text = TextNode("some text")
image = ImageNode("/path/to/image.png")
value = AtomValueFact(FactStatus.NEW, "STR", (StringValue("some"),))
text_mention_fact = MentionFact(FactStatus.NEW, TextNodeMention(text, 0, 4), value)
image_mention = MentionFact(FactStatus.NEW, NodeMention(image), value)
Composite value facts
In case property value can't be represented as an atomic value, composite values could be used.
Composite value is a set of named atomic values.
In TDM composite values are represented as a CompositeValueFact
with set of SlotFact
s that binds AtomValueFact
to
CompositeValueFact
.
Name of each atomic value (as part of composite value) is stored in slot's type_id
field.
from tdm.datamodel.facts import AtomValueFact, CompositeValueFact, FactStatus, ComponentFact
from tdm.datamodel.values import IntValue, StringValue
composite_fact = CompositeValueFact(FactStatus.NEW, 'ADDRESS')
country_atomic_fact = AtomValueFact(FactStatus.NEW, 'STR', (StringValue('Russia'),))
country_fact = ComponentFact(FactStatus.NEW, 'country', composite_fact, country_atomic_fact)
city_fact = ComponentFact(FactStatus.NEW, 'city', composite_fact, AtomValueFact(FactStatus.NEW, 'STR', (StringValue('Moscow')), ))
street_fact = ComponentFact(FactStatus.NEW, 'street', composite_fact, AtomValueFact(FactStatus.NEW, 'STR'))
building_fact = ComponentFact(FactStatus.NEW, 'building', composite_fact, AtomValueFact(FactStatus.NEW, 'INT', (IntValue(25),)))
Relation facts
Relation fact represents some relationship between concepts mentioned in document.
Each relation fact links two concept facts with some predefined relationship type.
If fact is approved, it could additionally contain value
(known concepts relationship KB identifier).
from tdm.datamodel.facts import ConceptFact, FactStatus, KBConceptValue, RelationFact
person = ConceptFact(FactStatus.NEW, "PERSON", (KBConceptValue("Alice", 0.8), KBConceptValue("Bob", 0.1)))
organization = ConceptFact(FactStatus.NEW, "ORGANIZATION", (KBConceptValue("Google", 0.7), KBConceptValue("Amazon", 0.2)))
works_in = RelationFact(FactStatus.NEW, "works in", person, organization)
Property facts
TDM supports two kinds of properties:
- concept property
- relation property
Property fact links concept fact (or relation fact) with some value fact (both atomic and composite values are possible).
from tdm.datamodel.facts import AtomValueFact, ConceptFact, FactStatus, KBConceptValue, PropertyFact, RelationFact, RelationPropertyFact
from tdm.datamodel.values import Date, DateTimeValue, IntValue
person = ConceptFact(FactStatus.NEW, "PERSON", (KBConceptValue("Alice", 0.8), KBConceptValue("Bob", 0.1)))
organization = ConceptFact(FactStatus.NEW, "ORGANIZATION", (KBConceptValue("Google", 0.7), KBConceptValue("Amazon", 0.2)))
works_in = RelationFact(FactStatus.NEW, "works in", person, organization)
age_value = AtomValueFact(FactStatus.NEW, "INT", (IntValue(20),))
age_property = PropertyFact(FactStatus.NEW, "age", person, age_value)
date = AtomValueFact(FactStatus.NEW, "DATE", (DateTimeValue(Date(2000, 9, 1)),))
works_since = RelationPropertyFact(FactStatus.NEW, "works since", works_in, date)
Document methods for facts
TalismanDocument
has several methods to process facts.
These methods are very similar to methods for other identifiable document elements:
id2fact
– mapping from identifier to fact;facts
– mapping from fact type to facts of that type;get_fact
– get fact by its identifier;get_facts
– get specified type facts;related_facts
– get specified type facts that are related for identifiable objects (no transitive dependencies);with_facts
– add (or update) facts;without_facts
– remove specified facts.
from tdm import DefaultDocumentFactory
from tdm.datamodel.facts import AtomValueFact, ConceptFact, FactStatus, KBConceptValue, MentionFact, PropertyFact, RelationFact, \
RelationPropertyFact
from tdm.datamodel.mentions import TextNodeMention
from tdm.datamodel.nodes import TextNode
from tdm.datamodel.values import Date, DateTimeValue, StringValue
doc = DefaultDocumentFactory.create_document()
text = TextNode('Alexander Pushkin was born on June 6, 1799')
person = ConceptFact(FactStatus.NEW, 'person')
name = AtomValueFact(FactStatus.NEW, "str", (StringValue('Alexander Pushkin'),))
name_mention = MentionFact(FactStatus.NEW, TextNodeMention(text, 0, 17), name)
name_property = PropertyFact(FactStatus.NEW, "name", person, name)
date = AtomValueFact(FactStatus.NEW, "date", (DateTimeValue(Date(1799, 6, 6)),))
date_mention = MentionFact(FactStatus.NEW, TextNodeMention(text, 30, 42), date)
birthday = PropertyFact(FactStatus.NEW, "birth date", person, date)
try:
_ = doc.with_facts([name_property])
except ValueError:
pass
modified = doc.with_nodes([text])
modified = modified.with_facts([person, name, name_mention, name_property, date, date_mention, birthday])
the_same = doc.with_facts([name_mention, name_property, date_mention, birthday], update=True)
assert modified == the_same
brin = ConceptFact(FactStatus.APPROVED, "person", KBConceptValue('Sergey Brin'))
google = ConceptFact(FactStatus.NEW, "organization", (KBConceptValue("Google"),))
date = AtomValueFact(FactStatus.APPROVED, "date", DateTimeValue(Date(1998, 9, 4)))
doc = doc.with_facts([
RelationPropertyFact(
status=FactStatus.NEW,
type_id="event date",
source=RelationFact(FactStatus.NEW, "found", brin, google, id='1'),
target=date
)
], update=True)
assert set(doc.get_facts(ConceptFact)) == {brin, google}
assert set(doc.get_facts(filter_=ConceptFact.status_filter(FactStatus.APPROVED))) == {brin, date}
assert set(doc.related_facts(brin)) == {RelationFact(FactStatus.NEW, "found", brin, google, id='1')}
Domain
To validate linked facts consistency domain could be utilized.
Domain consists of a set of domain types.
Domain types are identifiable and inherit base interface tdm.abstract.datamodel.AbstractDomainType
.
Knowledge base domain could be represented as a graph with node types and edge types.
TalismanDocument supports following principal knowledge base entity types (one for each fact type).
All the domain type classes are in tdm.datamodel.domain.types
package.
Node types:
ConceptType
– for ConceptFact
s. Represents a set of knowledge base concepts (objects) that share the same set of possible relations
and properties. These domain types define no additional restrictions;AtomValueType
– for AtomValueFact
. Represents a simple value that could be used in properties and as a part of composite value.
These domain types restrict possible values of corresponding AtomValueFact
s;CompositeValueType
– for CompositeValueFact
s. These domain types define no additional restrictions.
Link types (define restrictions for corresponding facts):
SlotType
– for SlotFact
. Represents a possible link between CompositeValueType
and AtomValueType
.RelationType
– for RelationFact
. Represents a possible link between two ConeptType
s.PropertyType
– for PropertyFact
. Represents a possible link between ConceptType
and value type (AtomValueType
or
CompositeValuetype
).RelationPropertyType
– for RelationPropertyFact
. epresents a possible link between RelationType
and value type (AtomValueType
or
CompositeValuetype
).
tdm.datamodel.Domain
class is used for organization domain types as a graph. It supports types retrieving (methods are very similar to
TalismanDocument
ones).
from tdm.datamodel.domain import Domain
from tdm.datamodel.domain.types import ConceptType, RelationType
person = ConceptType('person', id='1')
organization = ConceptType('organization', id='2')
works_in = RelationType('works in', person, organization, id='3')
domain = Domain([person, organization, works_in])
assert domain.get_type('1') == person
assert next(domain.related_types(person)) == works_in
assert domain.id2type == {'1': person, '2': organization, '3': works_in}
Facts validation with domain
Domain could be set to document to be applied as facts validator.
With tdm.datamodel.domain.set_default_domain
method you can set default domain.
All the documents created by tdm.DefaultDocumentFactory
after default domain is set, will be linked with domain.
It leads to facts consistency validation. Moreover, all fact type_id
s will be automatically replaced with corresponding domain types.
Once document is created, it uses domain that was set at creation time.
Using tdm.TalismanDocumentFactory
non-default domain could be used for documents.
from tdm import DefaultDocumentFactory
from tdm.datamodel.domain import AtomValueType, ConceptType, Domain, PropertyType, set_default_domain
from tdm.datamodel.facts import AtomValueFact, ConceptFact, FactStatus, PropertyFact
from tdm.datamodel.values import IntValue, StringValue
cpt_type = ConceptType("Персона", id="person")
value_type = AtomValueType("Число", IntValue, id="int")
prp_type = PropertyType("Возраст", cpt_type, value_type, id="age")
domain = Domain([cpt_type, value_type, prp_type])
doc1 = DefaultDocumentFactory.create_document()
set_default_domain(domain)
doc2 = DefaultDocumentFactory.create_document()
set_default_domain(None)
prp = PropertyFact(
FactStatus.NEW, "age",
ConceptFact(FactStatus.NEW, "person", id="cpt"),
AtomValueFact(FactStatus.NEW, "int", (StringValue("23"),), id="value"),
id="prp"
)
doc1 = doc1.with_facts([prp], update=True)
assert doc1.get_fact("prp") == prp
try:
_ = doc2.with_facts([prp], update=True)
except ValueError:
pass
doc2 = doc2.with_facts([PropertyFact(
FactStatus.NEW, "age",
ConceptFact(FactStatus.NEW, "person", id="cpt"),
AtomValueFact(FactStatus.NEW, "int", (IntValue(23),), id="value"),
id="prp"
)], update=True)
assert doc2.get_fact("cpt").type_id == cpt_type
assert doc2.get_fact("cpt") == ConceptFact(FactStatus.NEW, "person", id="cpt")
Serialization
To serialize and deserialize a Talisman document tdm.TalismanDocumentModel
could be utilized.
TalismanDocumentModel is a pydantic model, so you can write and read it to json with pydantic methods.
To serialize document TalismanDocumentModel.serialize
method should be used.
It returns TalismanDocumentModel
which could be converted to json.
from tdm import DefaultDocumentFactory, TalismanDocumentModel
document = DefaultDocumentFactory.create_document()
model = TalismanDocumentModel.serialize(document)
json = model.json()
To deserialize TalismanDocumentModel
to TalismanDocuemnt
, TalismanDocumentModel.deserialize
method should be used.
from tdm import TalismanDocumentModel
obj = ...
model = TalismanDocumentModel.parse_raw(obj)
document = model.deserialize()
Customization
Node markup
Default node markup
Each document node (both data and service) could contain additional markup that stores some additional information about document node.
As opposed to node metadata, node markup is not fixed and could be extended with almost any possible information.
Talisman document node markup should implement tdm.abstract.datamodel.AbstractMarkup
interface.
Markup is assumed to be a mapping. The only other requirement is node markup immutability.
There is a default node markup implementation tdm.abstract.datamodel.FrozenMarkup
.
This markup should be instantiated with FrozenMarkup.freeze
method to guarantee object immutability.
from immutabledict import immutabledict
from tdm.abstract.datamodel import FrozenMarkup
markup = FrozenMarkup.freeze({
'str': 'string',
'int': 1,
'float': 1.5,
'tuple': (1, 2, 3),
'list': [1, 2, 3],
'nested': {
'tuple of lists': (['item'], ['another'])
}
})
frozen = immutabledict({
'str': 'string',
'int': 1,
'float': 1.5,
'tuple': (1, 2, 3),
'list': (1, 2, 3),
'nested': immutabledict({
'tuple of lists': (('item',), ('another',))
})
})
assert markup.markup == frozen
Markup customization
FrozenMarkup
is simple immutable container that doesn't provide any methods for changing the stored markup.
For more convenient markup usage one could implement its own markup class with predefined structure.
Following example illustrates markup class implemented for TextNode
(but it could be used for other node types).
Text node markup (MyMarkup
) will contain int pointer (with other possible markup) that should point to some node's text character.
from immutabledict import immutabledict
from typing_extensions import Self
from tdm.abstract.datamodel import AbstractMarkup
class MyMarkup(AbstractMarkup):
"""
Custom markup class that stores int value and have additional methods for retrieving and modifying markup
Custom markup classes should be immutable and hashable
"""
def __init__(self, pointer: int, other: immutabledict):
self._pointer = pointer
self._other = other
@property
def markup(self) -> immutabledict:
"""
this property should be defined to return correct immutabledict markup representation
this representation is used for other markups construction (see `from_markup`) and for serialization
"""
return immutabledict({
'pointer': self._pointer,
**self._other
})
@classmethod
def from_markup(cls, markup: AbstractMarkup) -> Self:
"""
this method should be defined for object construction from another markup object
"""
kwargs: immutabledict = markup.markup
pointer = kwargs.get('pointer', 0)
other = immutabledict({k: v for k, v in kwargs.items() if k != 'pointer'})
return MyMarkup(pointer, other)
@property
def pointer(self):
"""
example property
"""
return self._pointer
def distance(self, pointer: int) -> int:
"""
example non-modifier method
"""
return abs(self._pointer - pointer)
def set_pointer(self, pointer: int) -> Self:
"""
example modifier method
"""
if pointer < 0:
raise ValueError
return MyMarkup(pointer, self._other)
from tdm.datamodel.nodes import TextNode
from tdm.abstract.datamodel import FrozenMarkup
from dataclasses import replace
markup = FrozenMarkup.freeze({'pointer': 1, 'another': 'some other markup values'})
node = TextNode('some text', markup=markup)
my_markup = MyMarkup.from_markup(markup)
assert my_markup == markup
assert FrozenMarkup.from_markup(my_markup) == my_markup
my_node = replace(node, markup=my_markup)
assert my_node == node
def set_pointer(node: TextNode, pointer: int) -> TextNode:
if len(node.content) <= pointer:
raise ValueError
return replace(node, markup=node.markup.set_pointer(pointer))
try:
_ = set_pointer(my_node, 10)
except ValueError:
pass
changed = set_pointer(my_node, 5)
To reduce boilerplate code talisman-dm library provide some useful decorators to create node subclasses with desired markup implementation.
All the decorators are placed in tdm.wrapper.node
package.
First of all we should define markup (and node) interface.
The interface methods should be decorated with tdm.wrapper.node.getter
and tdm.wrapper.node.modifier
methods to automatically generate
its implementation for wrapper node class.
Node wrapper could be generated with tdm.wrapper.generate_wrapper
decorator.
It should decorate class that is subclass for some AbstractNode
implementation, desired interface
and tdm.wrapper.node.AbstractNodeWrapper
.
This decorator automatically generates implementations for all interface methods marked with method decorators.
Additionally, node wrapper could contain modifier methods additional validators that are performed just before markup changes.
These validators could have any name, but should be decorated with tdm.wrapper.node.validate
decorator and have the same signature
that validated method.
All the getter method results could be post processed with tdm.wwrapper.node.pos_process
decorator.
The method is applied for getter results.
Decorated post processor could also have any name.
from abc import ABCMeta, abstractmethod
from immutabledict import immutabledict
from typing_extensions import Self
from tdm.abstract.datamodel import AbstractMarkup
from tdm.datamodel.nodes import TextNode
from tdm.wrapper.node import AbstractNodeWrapper, generate_wrapper, getter, modifier, post_process, validate
class MyMarkupInterface(metaclass=ABCMeta):
"""
Special interface for both markup and node wrapper implementations with desired getters and modifiers
"""
@property
@abstractmethod
def pointer(self) -> int:
"""
properties could not be additionally decorated.
Node wrapper will automatically delegate it to markup
"""
pass
@getter
@abstractmethod
def distance(self, pointer: int) -> int:
"""
all getters (methods that don't change the markup object) should be decorated with `getter`
"""
pass
@modifier
@abstractmethod
def set_pointer(self, pointer: int) -> Self:
"""
all modifiers (methods that create new markup object) should be decorated with `modifier`
"""
pass
class _MyMarkup(AbstractMarkup):
"""
Custom markup class that stores int pointer and have additional methods for retrieving and modifying markup
It could be non-public as user should not use it directly.
"""
def __init__(self, pointer: int, other: immutabledict):
self._pointer = pointer
self._other = other
@property
def markup(self) -> immutabledict:
"""
this property should be defined to return correct immutabledict markup representation
this representation is used for other markups construction (see `from_markup`) and for serialization
"""
return immutabledict({
'pointer': self._pointer,
**self._other
})
@classmethod
def from_markup(cls, markup: AbstractMarkup) -> Self:
"""
this method should be defined for object construction from another markup object
"""
kwargs: immutabledict = markup.markup
pointer = kwargs.get('pointer', 0)
other = immutabledict({k: v for k, v in kwargs.items() if k != 'pointer'})
return _MyMarkup(pointer, other)
@property
def pointer(self):
return self._pointer
def distance(self, pointer: int) -> int:
return abs(self._pointer - pointer)
def set_pointer(self, pointer: int) -> Self:
if pointer < 0:
raise ValueError
return _MyMarkup(pointer, self._other)
@generate_wrapper(_MyMarkup)
class TextNodeWrapper(TextNode, MyMarkupInterface, AbstractNodeWrapper[TextNode], metaclass=ABCMeta):
@validate(MyMarkupInterface.set_pointer)
def _validate_value(self, pointer: int) -> None:
"""
this method is called before markup update, so old markup object also could be used for validation.
this method could use node for validation
return value is ignored
"""
if pointer >= len(self.content):
raise ValueError
@post_process(MyMarkupInterface.distance)
def _post_process(self, result: int) -> int:
"""
this method is called for node.markup.distance method result
"""
return result + 1
from tdm.datamodel.nodes import TextNode
from tdm.abstract.datamodel import FrozenMarkup
markup = FrozenMarkup.freeze({'pointer': 1, 'another': 'some other markup values'})
node = TextNode('some text', markup=markup)
wrapped_node = TextNodeWrapper.wrap(node)
assert wrapped_node == node
try:
_ = wrapped_node.set_pointer(10)
except ValueError:
pass
changed = wrapped_node.set_pointer(5)
Composite markup customization
Talisman document node could have several types of markup with different origins.
For example text node could have appearance markup (font, size, etc.) and text content markup (e.g. segmentation).
In order not to mix the markup, it is convenient to separate such markup with different top-level keys.
For such cases talisman-dm library provide tdm.wrapper.node.composite_markup
decorator.
This decorator compose several markups with specified keys.
So the wrapped node could work with several markup objects at the same time.
from abc import ABCMeta, abstractmethod
from typing import Tuple
from typing_extensions import Self
from immutabledict import immutabledict
from tdm.wrapper.node import modifier, composite_markup, generate_wrapper, AbstractNodeWrapper, validate
from tdm.datamodel.nodes import TextNode
from tdm.abstract.datamodel import AbstractMarkup
class AppearanceMarkup(metaclass=ABCMeta):
"""
Interface for appearance markup
"""
@property
@abstractmethod
def fonts(self) -> Tuple[str, ...]:
pass
@modifier
@abstractmethod
def add_font(self, start: int, end: int, font: str) -> Self:
pass
class _AppearanceMarkupImpl(AbstractMarkup, AppearanceMarkup):
"""
implementation could be non-public
"""
def __init__(self, fonts: Tuple[Tuple[int, int, str], ...]):
self._fonts = fonts
@property
def markup(self) -> immutabledict:
return immutabledict({'fonts': self._fonts})
@classmethod
def from_markup(cls, markup: AbstractMarkup) -> Self:
markup = markup.markup
return cls(markup.get('fonts', ()))
@property
def fonts(self) -> Tuple[str, ...]:
return tuple(f for _, _, f in self._fonts)
def add_font(self, start: int, end: int, font: str) -> Self:
return _AppearanceMarkupImpl(self._fonts + ((start, end, font),))
class TextMarkup(metaclass=ABCMeta):
"""
Interface for text genre markup
"""
@property
@abstractmethod
def genre(self) -> str:
pass
@modifier
@abstractmethod
def set_genre(self, genre: str) -> Self:
pass
class _TextMarkupImpl(AbstractMarkup, TextMarkup):
def __init__(self, genre: str):
self._genre = genre
@property
def markup(self) -> immutabledict:
return immutabledict({'genre': self._genre})
@classmethod
def from_markup(cls, markup: AbstractMarkup) -> Self:
return cls(markup.markup['genre'])
@property
def genre(self) -> str:
return self._genre
def set_genre(self, genre: str) -> Self:
return _TextMarkupImpl(genre)
@composite_markup(appearance=_AppearanceMarkupImpl, text=_TextMarkupImpl)
class _CompositeTextNodeMarkup(AbstractMarkup, AppearanceMarkup, TextMarkup, metaclass=ABCMeta):
"""
This class is fully automatically generated.
It implements both `AppearanceMarkup` and `TextMarkup` interfaces
All the markup other from defined top-level keys will remain untouched
"""
pass
@generate_wrapper(_CompositeTextNodeMarkup)
class TextNodeWrapper(TextNode, AppearanceMarkup, TextMarkup, AbstractNodeWrapper[TextNode], metaclass=ABCMeta):
@validate(AppearanceMarkup.add_font)
def _validate_span(self, start: int, end: int, font: str) -> None:
if end >= len(self.content):
raise ValueError
from tdm.abstract.datamodel import FrozenMarkup
node = TextNode('Some text', markup=FrozenMarkup.freeze({
'appearance': {'fonts': ((0, 4, 'comic sans'),)},
'text': {'genre': 'example'},
'extra': 'some extra markup'
}))
wrapped = TextNodeWrapper.wrap(node)
assert node == wrapped
try:
_ = wrapped.add_font(0, 10, 'out of bound span')
except ValueError:
pass
modified = wrapped.add_font(5, 9, 'times new roman')
assert 'extra' in modified.markup.markup