Socket
Socket
Sign inDemoInstall

py3langid

Package Overview
Dependencies
1
Maintainers
1
Alerts
File Explorer

Install Socket

Protect your apps from supply chain attacks

Install

py3langid

Fork of the language identification tool langid.py, featuring a modernized codebase and faster execution times.

    0.2.2

Maintainers
1

Readme

=============
``py3langid``
=============


``py3langid`` is a fork of the standalone language identification tool ``langid.py`` by Marco Lui.

Original license: BSD-2-Clause. Fork license: BSD-3-Clause.



Changes in this fork
--------------------

Execution speed has been improved and the code base has been optimized for Python 3.6+:

- Import: Loading the package (``import py3langid``) is about 30% faster
- Startup: Loading the default classification model is 25-30x faster
- Execution: Language detection with ``langid.classify`` is 5-6x faster on paragraphs (less on longer texts)

For implementation details see this blog post: `How to make language detection with langid.py faster <https://adrien.barbaresi.eu/blog/language-detection-langid-py-faster.html>`_.


Usage
-----

Drop-in replacement
~~~~~~~~~~~~~~~~~~~


1. Install the package:

   * ``pip3 install py3langid`` (or ``pip`` where applicable)

2. Use it:

   * with Python: ``import py3langid as langid``
   * on the command-line: ``langid``


With Python
~~~~~~~~~~~

Basics:

.. code-block:: python

    >>> import py3langid as langid

    >>> text = 'This text is in English.'
    # identified language and probability
    >>> langid.classify(text)
    ('en', -56.77429)
    # unpack the result tuple in variables
    >>> lang, prob = langid.classify(text)
    # all potential languages
    >>> langid.rank(text)


More options:

.. code-block:: python

    >>> from py3langid.langid import LanguageIdentifier, MODEL_FILE

    # subset of target languages
    >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE)
    >>> identifier.set_languages(['de', 'en', 'fr'])
    # this won't work well...
    >>> identifier.classify('这样不好')
    ('en', -81.831665)

    # normalization of probabilities to an interval between 0 and 1
    >>> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
    >>> identifier.classify('This should be enough text.')
    ('en', 1.0)


Note: the Numpy data type for the feature vector has been changed to optimize for speed. If results are inconsistent, try restoring the original setting:

.. code-block:: python

    >>> langid.classify(text, datatype='uint32')


On the command-line
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # basic usage with probability normalization
    $ echo "This should be enough text." | langid -n
    ('en', 1.0)

    # define a subset of target languages
    $ echo "This won't be recognized properly." | langid -n -l fr,it,tr
    ('it', 0.97038305)



Legacy documentation
--------------------


**The docs below are provided for reference, only part of the functions are currently tested and maintained.**


Introduction
------------

``langid.py`` is a standalone Language Identification (LangID) tool.

The design principles are as follows:

1. Fast
2. Pre-trained over a large number of languages (currently 97)
3. Not sensitive to domain-specific features (e.g. HTML/XML markup)
4. Single .py file with minimal dependencies
5. Deployable as a web service

All that is required to run ``langid.py`` is Python >= 3.6 and numpy. 

The accompanying training tools are still Python2-only.

``langid.py`` is WSGI-compliant.  ``langid.py`` will use ``fapws3`` as a web server if 
available, and default to ``wsgiref.simple_server`` otherwise.

``langid.py`` comes pre-trained on 97 languages (ISO 639-1 codes given):

    af, am, an, ar, as, az, be, bg, bn, br, 
    bs, ca, cs, cy, da, de, dz, el, en, eo, 
    es, et, eu, fa, fi, fo, fr, ga, gl, gu, 
    he, hi, hr, ht, hu, hy, id, is, it, ja, 
    jv, ka, kk, km, kn, ko, ku, ky, la, lb, 
    lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, 
    nb, ne, nl, nn, no, oc, or, pa, pl, ps, 
    pt, qu, ro, ru, rw, se, si, sk, sl, sq, 
    sr, sv, sw, ta, te, th, tl, tr, ug, uk, 
    ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

* JRC-Acquis 
* ClueWeb 09
* Wikipedia
* Reuters RCV2
* Debian i18n


Usage
-----

    langid [options]

optional arguments:
  -h, --help            show this help message and exit
  -s, --serve           launch web service
  --host=HOST           host/ip to bind to
  --port=PORT           port to listen on
  -v                    increase verbosity (repeat for greater effect)
  -m MODEL              load model from file
  -l LANGS, --langs=LANGS
                        comma-separated set of target ISO639 language codes
                        (e.g en,de)
  -r, --remote          auto-detect IP address for remote access
  -b, --batch           specify a list of files on the command line
  -d, --dist            show full distribution over languages
  -u URL, --url=URL     langid of URL
  --line                process pipes line-by-line rather than as a document
  -n, --normalize       normalize confidence scores to probability values


The simplest way to use ``langid.py`` is as a command-line tool, and you can 
invoke using ``python langid.py``. If you installed ``langid.py`` as a Python 
module (e.g. via ``pip install langid``), you can invoke ``langid`` instead of 
``python langid.py -n`` (the two are equivalent).  This will cause a prompt to 
display. Enter text to identify, and hit enter::

  >>> This is a test
  ('en', -54.41310358047485)
  >>> Questa e una prova
  ('it', -35.41771221160889)


``langid.py`` can also detect when the input is redirected (only tested under Linux), and in this
case will process until EOF rather than until newline like in interactive mode::

  python langid.py < README.rst 
  ('en', -22552.496054649353)


The value returned is the unnormalized probability estimate for the language. Calculating 
the exact probability estimate is disabled by default, but can be enabled through a flag::

  python langid.py -n < README.rst 
  ('en', 1.0)

More details are provided in this README in the section on `Probability Normalization`.

You can also use ``langid.py`` as a Python library::

  # python
  Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
  [GCC 4.6.1] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import langid
  >>> langid.classify("This is a test")
  ('en', -54.41310358047485)

Finally, ``langid.py`` can use Python's built-in ``wsgiref.simple_server`` (or ``fapws3`` if available) to
provide language identification as a web service. To do this, launch ``python langid.py -s``, and
access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed
with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example::

  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service::

  # curl -d "q=This is a test" localhost:9008/detect
  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT::

  # curl -T readme.rst localhost:9008/detect
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  100  2871  100   119  100  2752    117   2723  0:00:01  0:00:01 --:--:--  2727
  {"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no "q=XXX" key-value pair is present in the HTTP POST payload, ``langid.py`` will interpret the entire
file as a single query. This allows for redirection via curl::

  # echo "This is a test" | curl -d @- localhost:9008/detect
  {"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

``langid.py`` will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even 
though the machine has a different external IP address. ``langid.py`` can attempt to automatically discover the external
IP address. To enable this functionality, start ``langid.py`` with the ``-r`` flag.

``langid.py`` supports constraining of the output language set using the ``-l`` flag and a comma-separated list of ISO639-1 
language codes (the ``-n`` flag enables probability normalization)::

  # python langid.py -n -l it,fr
  >>> Io non parlo italiano
  ('it', 0.99999999988965627)
  >>> Je ne parle pas français
  ('fr', 1.0)
  >>> I don't speak english
  ('it', 0.92210605672341062)

When using ``langid.py`` as a library, the set_languages method can be used to constrain the language set::

  python                      
  Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
  [GCC 4.6.1] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import langid
  >>> langid.classify("I do not speak english")
  ('en', 0.57133487679900674)
  >>> langid.set_languages(['de','fr','it'])
  >>> langid.classify("I do not speak english")
  ('it', 0.99999835791478453)
  >>> langid.set_languages(['en','it'])
  >>> langid.classify("I do not speak english")
  ('en', 0.99176190378750373)


Batch Mode
----------

``langid.py`` supports batch mode processing, which can be invoked with the ``-b`` flag.
In this mode, ``langid.py`` reads a list of paths to files to classify as arguments.
If no arguments are supplied, ``langid.py`` reads the list of paths from ``stdin``,
this is useful for using ``langid.py`` with UNIX utilities such as ``find``.

In batch mode, ``langid.py`` uses ``multiprocessing`` to invoke multiple instances of
the classifier, utilizing all available CPUs to classify documents in parallel. 


Probability Normalization
-------------------------

The probabilistic model implemented by ``langid.py`` involves the multiplication of a
large number of probabilities. For computational reasons, the actual calculations are
implemented in the log-probability space (a common numerical technique for dealing with
vanishingly small probabilities). One side-effect of this is that it is not necessary to
compute a full probability in order to determine the most probable language in a set
of candidate languages. However, users sometimes find it helpful to have a "confidence"
score for the probability prediction. Thus, ``langid.py`` implements a re-normalization
that produces an output in the 0-1 range.

``langid.py`` disables probability normalization by default. For
command-line usages of ``langid.py``, it can be enabled by passing the ``-n`` flag. For
probability normalization in library use, the user must instantiate their own 
``LanguageIdentifier``. An example of such usage is as follows::

  >> from py3langid.langid import LanguageIdentifier, MODEL_FILE
  >> identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
  >> identifier.classify("This is a test")
  ('en', 0.9999999909903544)


Training a model
----------------

So far Python 2.7 only, see the `original instructions <https://github.com/saffsd/langid.py#training-a-model>`_.


Read more
---------

``langid.py`` is based on published research. [1] describes the LD feature selection technique in detail,
and [2] provides more detail about the module ``langid.py`` itself.

[1] Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, 
In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), 
Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062

[2] Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, 
In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 
Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005


Keywords

FAQs


Did you know?

Socket installs a GitHub app to automatically flag issues on every pull request and report the health of your dependencies. Find out what is inside your node modules and prevent malicious activity before you update the dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc