New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

ctparse

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

ctparse

Parse natural language time expressions in python

0.3.6
PyPI

Maintainers: 2

=========================================================== ctparse - Parse natural language time expressions in python

.. image:: https://img.shields.io/pypi/v/ctparse.svg :target: https://pypi.python.org/pypi/ctparse :alt: PyPi

.. image:: https://readthedocs.org/projects/ctparse/badge/?version=latest :target: https://ctparse.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg :target: https://github.com/psf/black

Free software: MIT license
Documentation: https://ctparse.readthedocs.io.

Background

The package ctparse is a pure python package to parse time expressions from natural language (i.e. strings). In many ways it builds on similar concepts as Facebook’s duckling package (https://github.com/facebook/duckling). However, for the time being it only targets times and only German and English text.

In principle ctparse can be used to detect time expressions in a text, however its main use case is the semantic interpretation of such expressions. Detecting time expressions in the first place can - to our experience - be done more efficiently (and precisely) using e.g. CRFs or other models targeted at this specific task.

ctparse is designed with the use case in mind where interpretation of time expressions is done under the following assumptions:

All expressions are relative to some pre-defined reference times
Unless explicitly specified in the time expression, valid resolutions are in the future relative to the reference time (i.e. 12.5. will be the next 12th of May, but 12.5.2012 should correctly resolve to the 12th of May 2012).
If in doubt, resolutions in the near future are more likely than resolutions in the far future (not implemented yet, but any resolution more than i.e. 3 month in the future is extremely unlikely).

The specific comtravo use-case is resolving time expressions in booking requests which almost always refer to some point in time within the next 4-8 weeks.

ctparse currently is language agnostic and supports German and English expressions. This might get an extension in the future. The main reason is that in real world communication more often than not people write in one language (their business language) but use constructs to express times that are based on their mother tongue and/or what they believe to be the way to express dates in the target language. This leads to text in German with English time expressions and vice-versa. Using a language detection upfront on the complete original text is for obvious no solution - rather it would make the problem worse.

Example

.. code:: python

from ctparse import ctparse from datetime import datetime

Set reference time

ts = datetime(2018, 3, 12, 14, 30) ctparse('May 5th 2:30 in the afternoon', ts=ts)

This should return a Time object represented as Time[0-29]{2018-05-05 14:30 (X/X)}, indicating that characters 0-29 were used in the resolution, that the resolved date time is the 5th of May 2018 at 14:30 and that this resolution is neither based on a day of week (first X) nor a part of day (second X).

Latent time


Normally, ``ctparse`` will anchor time expressions to the reference time.
For example, when parsing the time expression ``8:00 pm``, ctparse will
resolve the expression to 8 pm after the reference time as follows

.. code:: python

   parse = ctparse("8:00 pm", ts=datetime(2020, 1, 1, 7, 0), latent_time=True) # default
   # parse.resolution -> Time(2020, 1, 1, 20, 00)

This behavior can be customized using the option ``latent_time=False``, which will
return a time resolution not anchored to a particular date

.. code:: python

   parse = ctparse("8:00 pm", ts=datetime(2020, 1, 1, 7, 0), latent_time=False)
   # parse.resolution -> Time(None, None, None, 20, 00)

Implementation
--------------

``ctparse`` - as ``duckling`` - is a mixture of a rule and regular
expression based system + some probabilistic modeling. In this sense it
resembles a PCFG.

Rules
~~~~~

At the core ``ctparse`` is a collection of production rules over
sequences of regular expressions and (intermediate) productions.

Productions are either of type ``Time``, ``Interval`` or ``Duration`` and can
have certain predicates (e.g. whether a ``Time`` is a part of day like
``'afternoon'``).

A typical rule than looks like this:

.. code:: python

   @rule(predicate('isDate'), dimension(Interval))

I.e. this rule is applicable when the intermediate production resulted
in something that has a date, followed by something that is in interval
(like e.g. in ``'May 5th 9-10'``).

The actual production is a python function with the following signature:

.. code:: python

   @rule(predicate('isDate'), dimension(Interval))
   def ruleDateInterval(ts, d, i):
     """
     param ts: datetime - the current refenrence time
     d: Time - a time that contains at least a full date
     i: Interval - some Interval
     """
     if not (i.t_from.isTOD and i.t_to.isTOD):
       return None
     return Interval(
       t_from=Time(year=d.year, month=d.month, day=d.day,
                   hour=i.t_from.hour, minute=i.t_from.minute),
       t_to=Time(year=d.year, month=d.month, day=d.day,
                 hour=i.t_to.hour, minute=i.t_to.minute))

This production will return a new interval at the date of
``predicate('isDate')`` spanning the time coded in
``dimension(Interval)``. If the latter does code for something else than
a time of day (TOD), no production is returned, e.g. the rule matched
but failed.


Technical Background

Some observations on the problem:

Each rule is a combination of regular expressions and productions.
Consequently, each production must originate in a sequence of regular expressions that must have matched (parts of) the text.
Hence, only subsequence of all regular expressions in all rules can lead to a successful production.

To this end the algorithm proceeds as follows:

Input a string and a reference time
Find all matches of all regular expressions from all rules in the input strings. Each regular expression is assigned an identifier.
Find all distinct sequences of these matches where two matches do not overlap nor have a gap inbetween
To each such subsequence apply all rules at all possible positions until no further rules can be applied - in which case one solution is produced

Obviously, not all sequences of matching expressions and not all sequences of rules applied on top lead to meaningful results. Here the P\ CFG kicks in:

Based on example data (corpus.py) a model is calibrated to predict how likely a production is to lead to a/the correct result. Instead of doing a breadth first search, the most promising productions are applied first.
Resolutions are produced until there are no more resolutions or a timeout is hit.
Based on the same model from all resolutions the highest scoring is returned.

Credits

This package was created with Cookiecutter_ and the audreyr/cookiecutter-pypackage_ project template.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter .. _audreyr/cookiecutter-pypackage: https://github.com/audreyr/cookiecutter-pypackage

======= History

0.3.4 (2022-07-28)

Add fuzzy matching on longer literals
[internal] De-tangle corpus tests into isolated test cases
Allow spaces around separators in ruleDDMMYYYY and ruleYYYYMMDD

0.3.3 (2022-07-18)

Add rule for straight forward US formatted dates (ruleYYYYMMDD)
Added rule ruleYearMonth
Added corpus cases for some open issues that now pass
Changed all internal imports to be absolute (i.e. from ctparse.x instead of from .x)
Dropped tox (now using github actions)

0.3.2 (2022-07-18)

Drop support for python 3.6, update dev requirements

0.3.1 (2021-07-07)

Add support for python 3.9 on travis and in manifest; update build config

0.3.0 (2021-02-01)

Removed latent rules regarding times (latent rules regarding dates are still present)
Added latent_time option to customize the new behavior, defauld behavior is backwards-compatible

0.2.1 (2020-05-27)

Update development dependencies
Add flake8-bugbear and fixed issues

0.2.0 (2020-04-23)

Implemented new type Duration, to handle lengths of time
Adapted the dataset to include Duration
Implemented basic rule to merge Duration, Time and Interval in simple cases.
Created a make target to train the model make train

0.1.0 (2020-03-20)

Major refactor of code underlying predictive model
Based on a contribution from @bharathi-srini: replace naive bayes from sklearn by own implementation
Thus remove dependencies on numpy, scipy, scikit-learn
Predictions are much faster: 97/s in the old vs. 239/s in the new code base
Performance identical
Deprecate support for python 3.5, add 3.8
Add more strict type checking rules (mypy.ini)
Force black code formatting, make this a linter step, "black" all code

0.0.47 (2020-02-28)

Allow overlapping matches of regular expression when generating inital stack of "tokens"

0.0.46 (2020-02-26)

Implemented heuristics to detect (albeit imperfectly) military times

0.0.44 (2019-11-05)

Released time corpus
Implemented training model using ctparse corpus

0.0.43 (2019-11-01)

Added slash as a general separator
Added ruleTODTOD (to support expression like afternoon/evening)

0.0.42 (2019-10-30)

Removed nb module
Fix for two digit years
Freshly retrained model binary file

0.0.41 (2019-10-29)

Fix run_corpus refactoring bug
Implemented retraining utilities

0.0.40 (2019-10-25)

update develop dependencies
remove unused Protocol import from typing_extensions

0.0.39 (2019-10-24)

split ctparse file into several different modules
added types to public interface
introduced the Scorer abstraction to implement richer scoring strategies

0.0.38 (2018-11-05)

Added python 3.7 to supported versions (fix on travis available)

0.0.8 (2018-06-07)

First release on PyPI.

Keywords

ctparse time parsing natural language

FAQs

What is ctparse?

Is ctparse well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

ctparse

=========================================================== ctparse - Parse natural language time expressions in python

Background

Example

Set reference time

Credits

======= History

0.3.4 (2022-07-28)

0.3.3 (2022-07-18)

0.3.2 (2022-07-18)

0.3.1 (2021-07-07)

0.3.0 (2021-02-01)

0.2.1 (2020-05-27)

0.2.0 (2020-04-23)

0.1.0 (2020-03-20)

0.0.47 (2020-02-28)

0.0.46 (2020-02-26)

0.0.44 (2019-11-05)

0.0.43 (2019-11-01)

0.0.42 (2019-10-30)

0.0.41 (2019-10-29)

0.0.40 (2019-10-25)

0.0.39 (2019-10-24)

0.0.38 (2018-11-05)

0.0.8 (2018-06-07)

Keywords

Related posts

Ransomware in 2024: Record-Low Payment Rate Signals Changing Economics of Cybercrime

require(esm) Backported to Node.js 20, Paving the Way for ESM-Only Packages