==================
textblob-de README
.. image:: https://img.shields.io/pypi/v/textblob-de.svg
:target: https://pypi.python.org/pypi/textblob-de/
:alt: textblob_de - latest PyPI version
.. image:: https://travis-ci.org/markuskiller/textblob-de.png?branch=dev
:target: https://travis-ci.org/markuskiller/textblob-de
:alt: Travis-CI
.. image:: https://readthedocs.org/projects/textblob-de/badge/?version=latest
:target: http://textblob-de.readthedocs.org/en/latest/
:alt: Documentation Status
.. image:: https://img.shields.io/pypi/dm/textblob-de.svg
:target: https://pypi.python.org/pypi/textblob-de/
:alt: Number of PyPI downloads
.. image:: https://img.shields.io/github/license/markuskiller/textblob-de.svg
:target: http://choosealicense.com/licenses/mit/
:alt: LICENSE info
German language support for TextBlob <http://textblob.readthedocs.org/en/dev/>
_ by Steven Loria.
This python package is being developed as a TextBlob
Language Extension.
See Extension Guidelines <https://textblob.readthedocs.org/en/dev/contributing.html>
_ for details.
Features
- NEW: Works with Python3.7
- All directly accessible
textblob_de
classes (e.g. Sentence()
or Word()
) are initialized with default models for German - Properties or methods that do not yet work for German raise a
NotImplementedError
- German sentence boundary detection and tokenization (
NLTKPunktTokenizer
) - Consistent use of specified tokenizer for all tools (
NLTKPunktTokenizer
or PatternTokenizer
) - Part-of-speech tagging (
PatternTagger
) with keyword include_punc=True
(defaults to False
) - Tagset conversion in
PatternTagger
with keyword tagset='penn'|'universal'|'stts'
(defaults to penn
) - Parsing (
PatternParser
) with all pattern
keywords, plus pprint=True
(defaults to False
) - Noun Phrase Extraction (
PatternParserNPExtractor
) - Lemmatization (
PatternParserLemmatizer
) - Polarity detection (
PatternAnalyzer
) - Still EXPERIMENTAL, does not yet have information on subjectivity - Full
pattern.text.de
API support on Python3 - Supports Python 2 and 3
- See
working features overview <http://langui.ch/nlp/python/textblob-de-dev/>
_ for details
Installing/Upgrading
::
$ pip install -U textblob-de
$ python -m textblob.download_corpora
Or the latest development release (apparently this does not always work on Windows see
issues #1744/5 <https://github.com/pypa/pip/pull/1745>
_ for details)::
$ pip install -U git+https://github.com/markuskiller/textblob-de.git@dev
$ python -m textblob.download_corpora
.. note::
TextBlob
will be installed/upgraded automatically when running
pip install
. The second line (python -m textblob.download_corpora
)
downloads/updates nltk corpora and language models used in TextBlob
.
Usage
.. code-block:: python
>>> from textblob_de import TextBlobDE as TextBlob
>>> text = '''Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag.
Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen. Aber leider
habe ich nur noch EUR 3.50 in meiner Brieftasche.'''
>>> blob = TextBlob(text)
>>> blob.sentences
[Sentence("Heute ist der 3. Mai 2014 und Dr. Meier feiert seinen 43. Geburtstag."),
Sentence("Ich muss unbedingt daran denken, Mehl, usw. für einen Kuchen einzukaufen."),
Sentence("Aber leider habe ich nur noch EUR 3.50 in meiner Brieftasche.")]
>>> blob.tokens
WordList(['Heute', 'ist', 'der', '3.', 'Mai', ...]
>>> blob.tags
[('Heute', 'RB'), ('ist', 'VB'), ('der', 'DT'), ('3.', 'LS'), ('Mai', 'NN'),
('2014', 'CD'), ...]
# Default: Only noun_phrases that consist of two or more meaningful parts are displayed.
# Not perfect, but a start (relies heavily on parser accuracy)
>>> blob.noun_phrases
WordList(['Mai 2014', 'Dr. Meier', 'seinen 43. Geburtstag', 'Kuchen einzukaufen',
'meiner Brieftasche'])
.. code-block:: python
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.parse()
'Das/DT/B-NP/O Auto/NN/I-NP/O ist/VB/B-VP/O sehr/RB/B-ADJP/O schön/JJ/I-ADJP/O'
>>> from textblob_de import PatternParser
>>> blob = TextBlobDE("Das ist ein schönes Auto.", parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
WORD TAG CHUNK ROLE ID PNP LEMMA
Das DT - - - - das
ist VB VP - - - sein
ein DT NP - - - ein
schönes JJ NP ^ - - - schön
Auto NN NP ^ - - - auto
. . - - - - .
>>> from textblob_de import PatternTagger
>>> blob = TextBlob(text, pos_tagger=PatternTagger(include_punc=True))
[('Das', 'DT'), ('Auto', 'NN'), ('ist', 'VB'), ('sehr', 'RB'), ('schön', 'JJ'), ('.', '.')]
.. code-block:: python
>>> blob = TextBlob("Das Auto ist sehr schön.")
>>> blob.sentiment
Sentiment(polarity=1.0, subjectivity=0.0)
>>> blob = TextBlob("Das ist ein hässliches Auto.")
>>> blob.sentiment
Sentiment(polarity=-1.0, subjectivity=0.0)
.. warning::
**WORK IN PROGRESS:** The German polarity lexicon contains only uninflected
forms and there are no subjectivity scores yet. As of version 0.2.3, lemmatized
word forms are submitted to the ``PatternAnalyzer``, increasing the accuracy
of polarity values. New in version 0.2.7: return type of ``.sentiment`` is now
adapted to the main `TextBlob <http://textblob.readthedocs.org/en/dev/>`_ library (``:rtype: namedtuple``).
.. code-block:: python
>>> blob.words.lemmatize()
WordList(['das', 'sein', 'ein', 'hässlich', 'Auto'])
>>> from textblob_de.lemmatizers import PatternParserLemmatizer
>>> _lemmatizer = PatternParserLemmatizer()
>>> _lemmatizer.lemmatize("Das ist ein hässliches Auto.")
[('das', 'DT'), ('sein', 'VB'), ('ein', 'DT'), ('hässlich', 'JJ'), ('Auto', 'NN')]
.. note::
Make sure that you use unicode strings on Python2 if your input contains
non-ascii characters (e.g. ``word = u"schön"``).
Access to pattern
API in Python3
.. code-block:: python
>>> from textblob_de.packages import pattern_de as pd
>>> print(pd.attributive("neugierig", gender=pd.FEMALE, role=pd.INDIRECT, article="die"))
neugierigen
.. note::
Alternatively, the path to textblob_de/ext
can be added to the PYTHONPATH
, which allows
the use of pattern.de
in almost the same way as described in its
Documentation <http://www.clips.ua.ac.be/pages/pattern-de>
_.
The only difference is that you will have to prepend an underscore:
from _pattern.de import ...
. This is a precautionary measure in case the pattern
library gets native Python3 support in the future.
Documentation and API Reference
Requirements
TODO
Planned Extensions <http://textblob-de.readthedocs.org/en/latest/extensions.html>
_- Additional PoS tagging options, e.g. NLTK tagging (
NLTKTagger
) - Improve noun phrase extraction (e.g. based on
RFTagger
output) - Improve sentiment analysis (find suitable subjectivity scores)
- Improve functionality of
Sentence()
and Word()
objects - Adapt more tests from the main
TextBlob <http://textblob.readthedocs.org/en/dev/>
_ library (esp. for TextBlobDE()
in test_blob.py
)
License
MIT licensed. See the bundled LICENSE <https://github.com/markuskiller/textblob-de/blob/master/LICENSE>
_ file for more details.
Thanks
Coded with Wing IDE (free open source developer license)
.. image:: https://wingware.com/images/wingware-logo-180x58.png
:target: https://wingware.com/store/free
:alt: Python IDE for Python - wingware.com
Changelog
0.4.3 (03/01/2019)
++++++++++++++++++
- Added support for Python3.7 (
StopIteration --> return
) Pull Request #18 <https://github.com/markuskiller/textblob-de/pull/18>
_ (thanks @andrewmfiorillo) - Fixed tests for Google translation examples
- Updated tox/Travis-CI config files to include latest Python & pypy versions
- Updated sphinx_rtd_theme to version 0.4.2 to fix rendering problems on
RTD <http://textblob-de.readthedocs.org>
_ - Updated
setup.py publish
commands, Makefile
& Manifest.in
to new PiPy (using twine
)
0.4.2 (02/05/2015)
++++++++++++++++++
- Removed dependency on
NLTK, <https://github.com/nltk/nltk/>
_ as it already is a TextBlob <http://textblob.readthedocs.org/en/dev/>
_ dependency - Temporary workaround for
NLTK Issue #824 <https://github.com/nltk/nltk/issues/824>
_ for tox/Travis-CI - (update 13/01/2015)
NLTK Issue #824 <https://github.com/nltk/nltk/issues/824>
_ fixed, workaround removed - Enabled
pattern
tagset conversion ('penn'|'universal'|'stts'
) for PatternTagger
- Added tests for tagset conversion
- Fixed test for Arabic translation example (Google translation has changed)
- Added tests for lemmatizer
- Bugfix:
PatternAnalyzer
no longer breaks on subsequent ocurrences of the same (word, tag)
pairs on Python3 see comments to Pull Request #11 <https://github.com/markuskiller/textblob-de/pull/11>
_ - Bugfix/performance enhancement: Sentiment dictionary in
PatternAnalyzer
no longer reloaded for every sentence Pull Request #11 <https://github.com/markuskiller/textblob-de/pull/11>
_ (thanks @Arttii)
0.4.1 (03/10/2014)
++++++++++++++++++
- Docs hosted on
RTD <http://textblob-de.readthedocs.org>
_ - Removed dependency on nltk's depricated
PunktWordTokenizer
and replaced it with TreebankWordTokenizer
see nltk/nltk#746 (comment) <https://github.com/nltk/nltk/pull/746#issuecomment-57625756>
_ for details
0.4.0 (17/09/2014)
++++++++++++++++++
- Fixed
Issue #7 <https://github.com/markuskiller/textblob-de/issues/7>
_ (restore textblob>=0.9.0
compatibility) - Depend on
nltk3
. Vendorized nltk
was removed in textblob>=0.9.0
- Fixed
ImportError
on Python2 (unicodecsv
)
0.3.1 (29/08/2014)
++++++++++++++++++
- Improved
PatternParserNPExtractor
(less false positives in verb filter) - Made sure that all keyword arguments with default
None
are checked with is not None
- Fixed shortcut to
_pattern.de
in vendorized library - Added
Makefile
to facilitate development process - Added docs and API reference
0.3.0 (14/08/2014)
++++++++++++++++++
- Fixed
Issue #5 <https://github.com/markuskiller/textblob-de/issues/5>
_ (text + space + period)
0.2.9 (14/08/2014)
++++++++++++++++++
- Fixed tokenization in
PatternParser
(if initialized manually, punctuation was not always separated from words) - Improved handling of empty strings (Issue #3) and of strings containing single punctuation marks (Issue #4) in
PatternTagger
and PatternParser
- Added tests for empty strings and for strings containing single punctuation marks
0.2.8 (14/08/2014)
++++++++++++++++++
- Fixed
Issue #3 <https://github.com/markuskiller/textblob-de/issues/3>
_ (empty string) - Fixed
Issue #4 <https://github.com/markuskiller/textblob-de/issues/4>
_ (space + punctuation)
0.2.7 (13/08/2014)
++++++++++++++++++
- Fixed
Issue #1 <https://github.com/markuskiller/textblob-de/issues/1>
_ lemmatization of strings containing a forward slash (/
) - Enhancement
Issue #2 <https://github.com/markuskiller/textblob-de/issues/2>
_ use the same rtype as textblob
for sentiment detection. - Fixed tokenization in
PatternParserLemmatizer
0.2.6 (04/08/2014)
++++++++++++++++++
- Fixed
MANIFEST.in
for package data in sdist
0.2.5 (04/08/2014)
++++++++++++++++++
sdist
is non-functional as important files are missing due to a misconfiguration in MANIFEST.in
- does not affect wheels
- Major internal refactoring (but no backwards-incompatible API changes) with the aim of restoring complete compatibility to original
pattern>=2.6
library on Python2 - Separation of
textblob
and pattern
code - On Python2 the vendorized version of
pattern.text.de
is only used if original is not installed (same as nltk
) - Made
pattern.de.pprint
function and all parser keywords accessible to customise parser output - Access to complete
pattern.text.de
API on Python2 and Python3 from textblob_de.packages import pattern_de as pd
tox
passed on all major platforms (Win/Linux/OSX)
0.2.3 (26/07/2014)
++++++++++++++++++
- Lemmatizer:
PatternParserLemmatizer()
extracts lemmata from Parser output - Improved polarity analysis through look-up of lemmatised word forms
0.2.2 (22/07/2014)
++++++++++++++++++
- Option: Include punctuation in
tags
/pos_tags
properties (b = TextBlobDE(text, tagger=PatternTagger(include_punc=True))
) - Added
BlobberDE()
class initialized with German models TextBlobDE()
, Sentence()
, WordList()
and Word()
classes are now all initialized with German models- Restored complete API compatibility with
textblob.tokenizers
module of the main TextBlob <http://textblob.readthedocs.org/en/dev/>
_ library
0.2.1 (20/07/2014)
++++++++++++++++++
- Noun Phrase Extraction:
PatternParserNPExtractor()
extracts NPs from Parser output - Refactored the way
TextBlobDE()
passes on arguments and keyword arguments to individual tools - Backwards-incompatible: Deprecate
parser_show_lemmata=True
keyword in TextBlob()
. Use parser=PatternParser(lemmata=True)
instead.
0.2.0 (18/07/2014)
++++++++++++++++++
- vastly improved tokenization (
NLTKPunktTokenizer
and PatternTokenizer
with tests) - consistent use of specified tokenizer for all tools
TextBlobDE
with initialized default models for German- Parsing (
PatternParser
) plus test_parsers.py
- EXPERIMENTAL implementation of Polarity detection (
PatternAnalyzer
) - first attempt at extracting German Polarity clues into
de-sentiment.xml
- tox tests passing for py26, py27, py33 and py34
0.1.3 (09/07/2014)
++++++++++++++++++
0.1.0 - 0.1.2 (09/07/2014)
++++++++++++++++++++++++++
- First release on github
- A number of experimental releases for testing purposes
- Adapted version badges, tests & travis-ci config
- Code adapted from sample extension
textblob-fr <https://github.com/sloria/textblob-fr>
_ - Language specific linguistic resources copied from
pattern-de <https://github.com/clips/pattern/tree/master/pattern/text/de>
_