🚀 Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more

maskouk-pysqlite

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

maskouk-pysqlite

maskouk: Arabic Dictionary for Collocations - python + sqlite

0.1
Maintainers
1

Maskouk-pysqlite مكتبة مسكوك

Arabic collocations library and data for Python +SQLite API |maskouk logo|

|downloads| |downloads2|

Developpers: Taha Zerrouki: http://tahadz.com taha dot zerrouki at gmail dot com

+---------+------------------------------------------------------------------+ | Feature | value | | s | | +=========+==================================================================+ | Authors | Authors.md <https://github.com/linuxscout/maskouk-pysqlite/mast | | | er/AUTHORS.md>__ | +---------+------------------------------------------------------------------+ | Release | 0.1 | +---------+------------------------------------------------------------------+ | License | GPL <https://github.com/linuxscout/maskouk-pysqlite/master/LICE | | | NSE>__ | +---------+------------------------------------------------------------------+ | Tracker | linuxscout/maskouk/Issues <https://github.com/linuxscout/maskou | | | k-pysqlite/issues>__ | +---------+------------------------------------------------------------------+ | Website | http://maskouk.sourceforge.net <http://maskouk-pysqlite.sourcef | | | orge.net>__ | +---------+------------------------------------------------------------------+ | Source | Github <http://github.com/linuxscout/maskouk-pysqlite>__ | +---------+------------------------------------------------------------------+ | Downloa | sourceforge <http://maskouk.sourceforge.net>__ | | d | | +---------+------------------------------------------------------------------+ | Feedbac | Comments <https://github.com/linuxscout/maskouk-pysqlite/>__ | | ks | | +---------+------------------------------------------------------------------+ | Account | @Twitter | | s | @Sourceforge | +---------+------------------------------------------------------------------+

Description

Maskouk is a database of arab ic collocations extracted from Wikipedia.

Arabic wikipedia data base 2011-Jun-21.

install


.. code:: shell

    pip install maskouk-pysqlite

Usage
~~~~~

import
^^^^^^

.. code:: python

    >>> import pyarabic.araby as araby
    >>> import maskouk.collocations as msk
    >>> mydict = msk.CollocationClass()

Test if collocation exists in database
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    >>> wlist = [u"كرة", u"القدم"]
    >>> # test if collocation exists
    >>> results = mydict.is_collocated(wlist)
    >>> print("inuput:", wlist)
    >>> print("output:",results)
    inuput: ['كرة', 'القدم']
    output: كرة القدم
    >>> wlist = [u"شمس", u"النهار"]
    >>> results = mydict.is_collocated(wlist)
    >>> print("inuput:", wlist)
    >>> print("output:",results)
    inuput: ['شمس', 'النهار']
    output: False

Test if a word has collocations in database
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    >>> # get all collocations for a specific word
    >>> word1 = u"كرة"
    >>> results  = mydict.is_collocated_word(word1)
    >>> print("inuput:", word1)
    >>> print("output:",results)
    inuput: كرة
    output: {'القدم': 'كُرَة الْقَدَمِ'}
    >>>
    >>> word = u"بيت"
    >>> # get all collocations for a specific word
    >>> results  = mydict.is_collocated_word(word)
    >>> print("inuput:", word)
    >>> print("output:",results)
    inuput: بيت
    output: {'العدة': 'بَيْت الْعِدَّةِ', 'المستأجر': 'بَيْت الْمُسْتَأْجِرِ', 'المشتري': 'بَيْتِ الْمُشْتَرِي', 'الرجل': 'بَيْت الرَّجُلِ', 'البناء': 'بَيْت الْبِنَاءِ', 'الزوج': 'بَيْت الزَّوْجِ', 'المال': 'بيت المال', 'المقدس': 'بَيْت الْمَقْدِسِ', 'البائع': 'بَيْت الْبَائِعِ', 'الخلاء': 'بَيْت الْخَلَاءِ', 'الأب': 'بَيْت الْأَبِ', 'الله': 'بَيْت اللّهِ'}

Detect collocation in a phrase
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It can be presented asseparated lists or tagged forms

.. code:: python

    >>> # detect collocations in phrase    
    >>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
    >>> wordlist = araby.tokenize(text)
    >>> results  = mydict.ngramfinder(2, wordlist)
    >>> print("inuput:", text)
    >>> print("output:",results)
    inuput: لعبنا مباراة كرة القدم في بيت المقدس
    output: ['لعبنا', 'مباراة', 'كرة القدم', 'في', 'بيت المقدس']
    >>> # detect collocations in phrase    
    >>> text = u"لعبنا مباراة كرة القدم في بيت المقدس"
    >>> wordlist = araby.tokenize(text)
    >>> results   = mydict.lookup(wordlist)
    >>> print("inuput:", text)
    >>> print("output:",results)
    inuput: لعبنا مباراة كرة القدم في بيت المقدس
    output: (['لعبنا', 'مباراة', 'كُرَة', 'الْقَدَمِ', 'في', 'بَيْت', 'الْمَقْدِسِ'], ['CO', 'CO', 'CB', 'CI', 'CO', 'CB', 'CI'])
    >>> 

detect long collocations in a phrase
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some collocations are too long to be used in a bigrams database like
"بسم الله الرحمن الرحيم" "السلام عليكم ورحمة الله وبركاته" "أهلا وسهلا
بكم"

.. code:: python

    >>> # get Long collocations
    ... text = u" قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت"
    >>> results  = mydict.lookup4long_collocations(text)
    >>> print("inuput:", text)
    inuput:  قلت لهم السلام عليكم ورحمة الله تعالى وبركاته ثم رجعت
    >>> print("output:",results)   
    output:  قلت لهم السّلامُ عَلَيكُمْ وَرَحْمَةُ اللهِ تَعَالَى وبركاته ثم رجعت

Detect candidate collocations in phrase
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The candidate collocation doesn't exists in the database, this feature
is used to extract collocations based on rules. It returns a rule code,
100 as default (no collocation)

.. code:: python

    >>> text = u"ظهر رئيس الوزراء السيد عبد الملك بن عامر ومعه أمير دولة غرناطة ونهر النيل انطلاق السباق"
    >>> wordlist = araby.tokenize(text)
    >>> previous = "__"
    >>> for wrd in wordlist:
    ...     wlist = [previous, wrd]
    ...     results  = mydict.is_possible_collocation(wlist, lenght = 2)
    ...     print("inuput:", wlist)
    ...     print("output:", results)   
    ...     previous  = wrd
    ... 
    inuput: ['__', 'ظهر']
    output: 100
    inuput: ['ظهر', 'رئيس']
    output: 100
    inuput: ['رئيس', 'الوزراء']
    output: 100
    inuput: ['الوزراء', 'السيد']
    output: 20
    inuput: ['السيد', 'عبد']
    output: 100
    inuput: ['عبد', 'الملك']
    output: 15
    inuput: ['الملك', 'بن']
    output: 100
    inuput: ['بن', 'عامر']
    output: 15
    inuput: ['عامر', 'ومعه']
    output: 100
    inuput: ['ومعه', 'أمير']
    output: 100
    inuput: ['أمير', 'دولة']
    output: 100
    inuput: ['دولة', 'غرناطة']
    output: 10
    inuput: ['غرناطة', 'ونهر']
    output: 100
    inuput: ['ونهر', 'النيل']
    output: 100
    inuput: ['النيل', 'انطلاق']
    output: 100
    inuput: ['انطلاق', 'السباق']
    output: 100
    >>> 

[requirement]
^^^^^^^^^^^^^

::

    1- pyarabic 
    2. sqlite

Data Structure:
---------------

Colocations database

.. code:: sql

CREATE TABLE "collocations" (
    "id" INTEGER PRIMARY KEY  NOT NULL , 
    "vocalized" VARCHAR,
    "unvocalized" VARCHAR,
    "rule" VARCHAR, 
    "category" VARCHAR, 
    "note" VARCHAR,
    "first" VARCHAR,
    "second" VARCHAR
    );

CSV Structure:

  • id : id unique in the database
  • vocalized : vocalized collocation
  • unvocalized : unvocalized collocation
  • rule : the extraction rule number
  • category : collocation category
  • note :
  • first: first word
  • second: second word

.. |maskouk logo| image:: doc/maskouk_header.png .. |downloads| image:: https://img.shields.io/sourceforge/dt/maskouk.svg :target: http://sourceforge.org/projects/maskouk .. |downloads2| image:: https://img.shields.io/sourceforge/dm/maskouk.svg :target: http://sourceforge.org/projects/maskouk

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts