Runrex
Library to aid in organizing, running, and debugging regular expressions against large bodies of text.
Table of Contents
About the Project
The goal of this library is to simplify the deployment of regular expression on large bodies of text, in a variety of input formats.
Getting Started
To get a local copy up and running follow these simple steps.
Prerequisites
Installation
- Clone the repo
git clone https://github.com/kpwhri/runrex.git
- Install requirements (
requirements-dev
is for test packages)
pip install -r requirements.txt -r requirements-dev.txt
- If you wish to read text from SAS or SQL, you will need to install additional requirements. These additional requirements files may be of use:
- ODBC-connection:
requirements-db.txt
- Postgres:
requirements-psql.txt
- SAS:
requirements-sas.txt
- Run tests.
set/export PYTHONPATH=src
pytest tests
Usage
Example Implementations
Build Customized Algorithm
- Create 4 files:
patterns.py
: defines regular expressions of interest
- See
examples/example_patterns.py
for some examples
test_patterns.py
: tests for those regular expressions
- Why? Make sure the patterns do what you think they do
algorithm.py
: defines algorithm (how to use regular expressions); returns a Result
- See
examples/example_algorithm.py
for guidance
config.(py|json|yaml)
: various configurations defined in schema.py
- See example in
examples/example_config.py
for basic config
Input Data
Accepts a variety of input formats, but will need to at least specify a document_id
and document_text
. The names are configurable.
Sentence Splitting
By default, the input document text is expected to have each sentence on a separate line. If a sentence splitting scheme is desired, it will need to be supplied to the application.
Schema/Examples
For more details, see the example config
or consult the schema
Output Format
- Recommended output format is
jsonl
- The data can be extracted using python:
import json
with open('output.jsonl') as fh:
for line in fh:
data = json.loads(line)
Versions
Uses SEMVER.
See https://github.com/kpwhri/runrex/releases.
Roadmap
See the open issues for a list of proposed features (and known issues).
Contributing
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License.
See LICENSE
or https://kpwhri.mit-license.org for more information.
Contact
Please use the issue tracker.
Acknowledgements