Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

docdeid

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

docdeid

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

  • 1.0.0
  • PyPI
  • Socket score

Maintainers
1

docdeid

tests build Documentation Status pypy version python versions license black

Installation - Getting started - Features - Documentation - Development and contributiong - Authors - License

Create your own document de-identifier using docdeid, a simple framework independent of language or domain.

Note that docdeid is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving docdeid, feel free to get in touch to coordinate.

Installation

Grab the latest version from PyPi:

pip install docdeid

Getting started

from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor

deidentifier = DocDeid()

deidentifier.tokenizers["default"] = WordBoundaryTokenizer()

deidentifier.processors.add_processor(
    "name_lookup",
    SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)

deidentifier.processors.add_processor(
    "name_regexp",
    RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)

deidentifier.processors.add_processor(
    "redactor", 
    SimpleRedactor()
)

text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)

Find the relevant info in the Document object:

print(doc.annotations)

AnnotationSet({
    Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
    Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
    Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), 
    Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
print(doc.deidentified_text)

'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'

Features

Additionally, docdeid features:

  • Ability to create your own Annotator, AnnotationProcessor, Redactor and Tokenizer components
  • Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
  • Callable from one interface (DocDeid.deidenitfy())
  • String processing and filtering
  • Fast lookup based on sets or tries
  • Anything you add! PRs welcome.

For a more in-depth tutorial, see: docs/tutorial

Documentation

For full documentation and API, see: https://docdeid.readthedocs.io/en/latest/

Development and contributing

For setting up dev environment, see: docs/environment

For contributing, see: docs/contributing

Authors

Vincent Menger - Author, maintainer

License

This project is licensed under the MIT license - see the LICENSE.md file for details.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc