textmater

Extract Structured Data from text

0.1
PyPI

Maintainers: 1

Textmater

Don't need to know where you're going, just need to know where you've been

Extract structured data (key values, grouped into sections) from text. (Runs backwards through text.. hence the name) Useful for creating configurations for extracting data from a file, which can then be applied to large numbers of these documents.

Overview

The general application of this is to construct a configuration of the Textmater class that pulls details out of a format of text. This configuration can then be fed further instances of the text and build up a structure of data, which can then be saved to .json or .csv

Example

Say we have an example of text like this

example_text =

-Shops-
Pete's: Grocers
KFC: Fast Food
Newsman: Newsagents
-Sports-
Football: Round Ball
AFL: Egg Ball
Cricket:Round Ball

and we want to get every key and value, with keys being anything before a : and values being anything after :. We also want them to be grouped according to their headers, and we want the output in json We could create an instance with

resource = Textmater(section_header_regex = '_[a-zA-Z]*_')

then run resource.drive(example_text) the resulting resource.section_dict would look like this

{
    '-Shops-': [{"Pete's": "Grocers", "KFC": "Fast Food", "Newsman": "Newsagents"}]
    '-Sports-': [{"Football": "Round Ball", 'AFL": "Egg Ball", "Cricket" : "Round Ball"}]
}

If you ran it again on a similarly formatted section of text, '-Shops-' list would be appended to, as would '-Sports-'

then resource.write_results_to_json() would save it as a json file. One file per section (key in the section_dict)

importing

from textmater import Textmater, tools

(tools is optional but has useful functions for working with text)

configuring and running

resource = Textmater() will instantiate the class, there are a lot of options here. Ones relating to functions run in order of appearance.All are optional

filter_functions: [function] takes a list of functions used to skip (or not) an instance of text passed in, each must take in a string and return true or false. E.g you pass in a function that returns false if 'denied' is present in the text anywhere. Then when you run drive this resource over a corpus of documents you can skip the ones with 'denied' in them.
transformation_functions: [function] takes a list of functions that are applied to transform the incoming text before further processing. Functions must take a string and return a string
section_header_regex: str(regex_pattern) 1st of 2 ways of specifying section headers. Provided pattern is run through the text to build the list of headers. Not to be used in conjunction with the next argument
section_header_list: [str] 2nd of 2 ways of specifying section headers. Direct values that if found in the text will be used to divide items found in the text. In the example, the same effect could have been achieved by passing in ['-Shops-', '-Sports-'] to this parameter instead
sections_to_skip: [str] list of sections headers that if found will promp Textmater to skip over the values in the section. Useful for improving output when there is a large section of a text you don't require the contents of.
cleanup_functions: [function] list of functions applied to each record before it is added to the section_dict. Must take a current_record_dict (<section header>: {dict of items within it}) and return the same. No need to make deepcopies as this is done automatically before passing the dict in.
overwrite_duplicate_keys: bool If set to false will generate a unique version of any key that is already present when trying to add to the current_record_dict. It will add _i where i is an integer, starting at 2. In the unlikely occassion <key>_i is also a collision, it increments i until it's not
spread_keys: [(str, str)] list of tuples representing keys in sections that you want to spread (e,g you find a value in one section and want it present in all of them, perhaps as an identifier). [0]: section name [1]: key example, you have a key 'patient id' in a section 'identifiers', you want this id shared across all the sections to use as a primary key. Your value for spread_keys would be [('identifiers', 'patient id')].
If you don't know the section that a key is in but you still want to spread it if it's found, leaving the section name empty, which would look like ('', 'patient id'), will result in Textmater searching for the key across all sections then spreading it.
delimiter: str the character/s you want to use as delimiters between keys and values.

Appendix

current_record_dict:

a dict where keys are section headers and values are dicts of items in that section:

{
    'section 1': {'key1' : 'value1', 'key2': 'value2', 'primary_key': '0'},
    'section 2': {'other key 1': 'value 1', 'other key 2': 'value 2', 'primary_key': '0'} 
}

resource.current_record_dict stores the result of the most recent extraction in this format

section_dict:

dict for storing combined current_record_dicts. keys are section headers and values are lists of dicts

{
    'section 1' : [{'key1' : 'value1', 'key2': 'value2', 'primary_key': '0'},
                {'key1' : 'value3', 'key2': 'value4', 'primary_key': '1'}],
    'section 2' : [{'other key 1' : 'value 1', 'primary_key': '0'},
                    {'other key 1': 'value z', 'primary_key': '1'}] 
}

resource.section_dict stores this

Keywords

FAQs

What is textmater?

Is textmater well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install