pystempel

Name	pystempel JSON
Version	2.0.0 JSON
	download
home_page	None
Summary	Polish stemmer.
upload_time	2024-07-10 17:16:07
maintainer	None
docs_url	None
author	Maciej Gawinecki
requires_python	<4.0,>=3.8
license	See documentation
keywords	nlp natural language processing computational linguistics stemming linguistics language natural language text analytics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

Stempel Stemmer
===============

.. image:: https://badge.fury.io/py/pystempel.svg
:target: https://badge.fury.io/py/pystempel

Python port of Stempel, an algorithmic stemmer for the Polish language, originally written in Java.

The original stemmer has been implemented as part of `Egothor Project`_, taken virtually unchanged to
`Stempel Stemmer Java library`_ by Andrzej Białecki and next included as part of `Apache Lucene`_,
a free and open-source search engine library. It is also used by `Elastic Search`_ search engine.

.. _Egothor Project: https://www.egothor.org/product/egothor2/
.. _Stempel Stemmer Java library: http://www.getopt.org/stempel/index.html
.. _Apache Lucene: https://lucene.apache.org/core/3_1_0/api/contrib-stempel/index.html
.. _Elastic Search: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-stempel.html

This package includes also high-quality stemming tables for Polish: the original one pretrained by
Andrzej Białecki on 20,000 training sets, and a new one, pretrained on 259,080 training sets
from `Polimorf dictionary`_ by me.

.. _Polimorf dictionary: https://clarin-pl.eu/dspace/handle/11321/577

The port does not include code for compiling stemming tables.

.. _sjp.pl: https://sjp.pl/slownik/en/

How to use
----------

Install in your local environment:

.. code:: console

pip install pystempel

Use in your code:

.. code:: python

from pystempel import Stemmer

Choose original (called default) version of a stemmer:

.. code:: python

stemmer = Stemmer.default()

or a version with a new stemming table pretrained on training sets from Polimorf dictionary:

.. code:: python

stemmer = Stemmer.polimorf()

Stem:

.. code:: python

>>> for word in ['książka', 'książki', 'książkami', 'książkowa', 'książkowymi']:
... print(stemmer(word))
...
książek
książek
książek
książkowy
książkowy

Choosing stemming table
-----------------------

Performance between the original (default) and the new stemming table (Polimorf-based) varies significantly.
The stemmer for the default stemming table is *understemming*, i.e., multiple forms of the
same lemma provide different stems more often (63%) than when using a Polimorf-based stemming table
(13%). However, the file footprint of the latter is bigger (2.2MB vs 0.3MB). Also, loading takes
longer (7.5 seconds vs. 1.3 seconds), though this happens only once when a stemmer is created. Also,
the stemmer stems slightly faster for the original stemming table: ~60000 vs ~51000 words per second.
See `Evaluation Jupyter Notebook`_ for the detailed evaluation results.

.. _Evaluation Jupyter Notebook: http://htmlpreview.github.io/?https://github.com/dzieciou/pystempel/blob/master/Evaluation.html

Also, please note that the licensing schema of both stemming tables differs, and hence licensing of
data generated with each one. See the "Licensing" section for the details.

Choosing between port and wrapper
---------------------------------

If you work on an NLP project in Python you can choose between Python port and Python wrapper.
Python port is what pystempel tries to achieve: translation from Java implementation to Python.
Python wrapper is what I used in `tests`_: Python functions to call the original Java implementation of
stemmer. You can find more about wrappers and ports in `Stackoverflow comparison post`_. Here, I
compare both approaches to help you decide:

* **Same accuracy**. I have verified the Python port by comparing its output
with the output of the original Java implementation for 331224 words from the Free Polish dictionary
(`sjp.pl`_) and for 100% of words, it returns same output.
* **Similar performance**. For the mentioned dataset, both stemmer versions achieved comparable performance.
Python port completed stemming in 4.4 seconds, while Python wrapper -- in 5 seconds (Intel Core
i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)
* **Different setup**. Python wrapper requires additional installation of Cython and pyjnius.
Python wrapper will make also `debugging harder`_ (switching between two programming languages).

.. _Stackoverflow comparison post: https://stackoverflow.com/questions/10113218/how-to-decide-when-to-wrap-port-write-from-scratch
.. _debugging harder: https://stackoverflow.com/questions/6970359/find-an-efficient-way-to-integrate-different-language-libraries-into-one-project
.. _tests: tests/

Options
-------

To disable a progress bar when loading stemming tables, set environment variable ``DISABLE_TQDM=True``.

Development setup
-----------------

To setup environment for development you will need `poetry`_ 1.4.0 or higher installed.

.. _poetry: https://python-poetry.org/

.. code:: console

poetry install
poetry shell
pre-commit install

To run tests download original stemmer in Java:

.. code:: console

curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar

and run:

.. code:: console

poetry run pytest

To run performance benchmark:

.. code:: console

PYTHONPATH=$PWD poetry run python tests/test_benchmark.py

Licensing
---------

* **Code**. Most of the code is covered by `Egothor`_ Open Source License, an Apache-style license.
The `Apache License 2.0`_ covers the rest of the code. This should be clear from the preamble
of each file.

* **Data**.

* The original pretrained stemming table is covered by `Apache License 2.0`_.

* The new pretrained stemming table is covered by `2-Clause BSD License`_, similarly to the
`Polimorf dictionary copy` it has been derived from. The copyright owner of both the stemming table
and the dictionary is `Institute of Computer Science at Polish Academy of Science`_ (IPI PAN).

* The Polish dictionary used by the unit tests comes from `sjp.pl`_ and is covered by
`Apache License 2.0`_ as well.

.. _Egothor: https://www.egothor.org/product/egothor2/
.. _Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0
.. _Polimorf dictionary copy: dicts/
.. _2-Clause BSD License: data/polimorf/LICENSE.txt
.. _Institute of Computer Science at Polish Academy of Science: https://ipipan.waw.pl/en/

Alternatives
------------

* `Estem`_ is Erlang wrapper (not port) for Stempel stemmer.
* `pl_stemmer`_ is a Python stemmer based on Porter's Algorithm.
* `polish-stem`_ is a Python stemmer using Finite State Transducers.

.. _Estem: https://github.com/arcusfelis/estem
.. _pl_stemmer: https://github.com/Tutanchamon/pl_stemmer
.. _polish-stem: https://github.com/eugeniashurko/polish-stem

Release notes
-------------

2.0.0: API backward incompatible changes
- Refactor `stempel` to `pystempel` package (#26)
- Refactor `StempelStemmer` to `Stemmer` and `Stemmer.stem` to callable (#26)

1.2.0: Stable version

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pystempel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": "NLP, natural language processing, computational linguistics, stemming, linguistics, language, natural language, text analytics",
    "author": "Maciej Gawinecki",
    "author_email": "mgawinecki@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d0/d4/ea46cc1cff1ad84bf0ba838405b8a527e5b13ed820ed1c6374e332dcc41b/pystempel-2.0.0.tar.gz",
    "platform": null,
    "description": "Stempel Stemmer\n===============\n\n.. image:: https://badge.fury.io/py/pystempel.svg\n    :target: https://badge.fury.io/py/pystempel\n\nPython port of Stempel, an algorithmic stemmer for the Polish language, originally written in Java.\n\nThe original stemmer has been implemented as part of `Egothor Project`_, taken virtually unchanged to\n`Stempel Stemmer Java library`_ by Andrzej Bia\u0142ecki and next included as part of `Apache Lucene`_,\na free and open-source search engine library. It is also used by `Elastic Search`_ search engine.\n\n.. _Egothor Project: https://www.egothor.org/product/egothor2/\n.. _Stempel Stemmer Java library: http://www.getopt.org/stempel/index.html\n.. _Apache Lucene: https://lucene.apache.org/core/3_1_0/api/contrib-stempel/index.html\n.. _Elastic Search: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-stempel.html\n\nThis package includes also high-quality stemming tables for Polish: the original one pretrained by\nAndrzej Bia\u0142ecki on 20,000 training sets, and a new one, pretrained on 259,080 training sets\nfrom `Polimorf dictionary`_ by me.\n\n\n.. _Polimorf dictionary: https://clarin-pl.eu/dspace/handle/11321/577\n\nThe port does not include code for compiling stemming tables.\n\n.. _sjp.pl: https://sjp.pl/slownik/en/\n\nHow to use\n----------\n\nInstall in your local environment:\n\n.. code:: console\n\n   pip install pystempel\n\nUse in your code:\n\n.. code:: python\n\n   from pystempel import Stemmer\n\nChoose original (called default) version of a stemmer:\n\n.. code:: python\n\n   stemmer = Stemmer.default()\n\nor a version with a new stemming table pretrained on training sets from Polimorf dictionary:\n\n.. code:: python\n\n   stemmer = Stemmer.polimorf()\n\nStem:\n\n.. code:: python\n\n  >>> for word in ['ksi\u0105\u017cka', 'ksi\u0105\u017cki', 'ksi\u0105\u017ckami', 'ksi\u0105\u017ckowa', 'ksi\u0105\u017ckowymi']:\n  ...   print(stemmer(word))\n  ...\n  ksi\u0105\u017cek\n  ksi\u0105\u017cek\n  ksi\u0105\u017cek\n  ksi\u0105\u017ckowy\n  ksi\u0105\u017ckowy\n\n\nChoosing stemming table\n-----------------------\n\nPerformance between the original (default) and the new stemming table (Polimorf-based) varies significantly.\nThe stemmer for the default stemming table is *understemming*, i.e., multiple forms of the\nsame lemma provide different stems more often (63%) than when using a Polimorf-based stemming table\n(13%). However, the file footprint of the latter is bigger (2.2MB vs 0.3MB). Also, loading takes\nlonger (7.5 seconds vs. 1.3 seconds), though this happens only once when a stemmer is created. Also, \nthe stemmer stems slightly faster for the original stemming table: ~60000 vs ~51000 words per second.\nSee `Evaluation Jupyter Notebook`_ for the detailed evaluation results.\n\n.. _Evaluation Jupyter Notebook: http://htmlpreview.github.io/?https://github.com/dzieciou/pystempel/blob/master/Evaluation.html\n\nAlso, please note that the licensing schema of both stemming tables differs, and hence licensing of\ndata generated with each one. See the \"Licensing\" section for the details.\n\n\n\nChoosing between port and wrapper\n---------------------------------\n\nIf you work on an NLP project in Python you can choose between Python port and Python wrapper.\nPython port is what pystempel tries to achieve: translation from Java implementation to Python.\nPython wrapper is what I used in `tests`_: Python functions to call the original Java implementation of\nstemmer. You can find more about wrappers and ports in `Stackoverflow comparison post`_. Here, I\ncompare both approaches to help you decide:\n\n* **Same accuracy**. I have verified the Python port by comparing its output\n  with the output of the original Java implementation for 331224 words from the Free Polish dictionary\n  (`sjp.pl`_) and for 100% of words, it returns same output.\n* **Similar performance**. For the mentioned dataset, both stemmer versions achieved comparable performance.\n  Python port completed stemming in 4.4 seconds, while Python wrapper -- in 5 seconds (Intel Core\n  i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)\n* **Different setup**. Python wrapper requires additional installation of Cython and pyjnius.\n  Python wrapper will make also `debugging harder`_ (switching between two programming languages).\n\n.. _Stackoverflow comparison post: https://stackoverflow.com/questions/10113218/how-to-decide-when-to-wrap-port-write-from-scratch\n.. _debugging harder: https://stackoverflow.com/questions/6970359/find-an-efficient-way-to-integrate-different-language-libraries-into-one-project\n.. _tests: tests/\n\nOptions\n-------\n\nTo disable a progress bar when loading stemming tables, set environment variable ``DISABLE_TQDM=True``.\n\nDevelopment setup\n-----------------\n\nTo setup environment for development you will need `poetry`_ 1.4.0 or higher installed.\n\n.. _poetry: https://python-poetry.org/\n\n.. code:: console\n\n    poetry install\n    poetry shell\n    pre-commit install\n\nTo run tests download original stemmer in Java:\n\n.. code:: console\n\n    curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar\n\nand run:\n\n.. code:: console\n\n    poetry run pytest\n\nTo run performance benchmark:\n\n.. code:: console\n\n    PYTHONPATH=$PWD poetry run python tests/test_benchmark.py\n\nLicensing\n---------\n\n* **Code**. Most of the code is covered by `Egothor`_ Open Source License, an Apache-style license.\n  The `Apache License 2.0`_ covers the rest of the code. This should be clear from the preamble\n  of each file.\n\n* **Data**.\n\n  * The original pretrained stemming table is covered by `Apache License 2.0`_.\n\n  * The new pretrained stemming table is covered by `2-Clause BSD License`_, similarly to the\n    `Polimorf dictionary copy` it has been derived from. The copyright owner of both the stemming table\n    and the dictionary is `Institute of Computer Science at Polish Academy of Science`_ (IPI PAN).\n\n  * The Polish dictionary used by the unit tests comes from `sjp.pl`_  and is covered by\n    `Apache License 2.0`_ as well.\n\n.. _Egothor: https://www.egothor.org/product/egothor2/\n.. _Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0\n.. _Polimorf dictionary copy: dicts/\n.. _2-Clause BSD License: data/polimorf/LICENSE.txt\n.. _Institute of Computer Science at Polish Academy of Science: https://ipipan.waw.pl/en/\n\n\n\nAlternatives\n------------\n\n* `Estem`_ is Erlang wrapper (not port) for Stempel stemmer.\n* `pl_stemmer`_ is a Python stemmer based on Porter's Algorithm.\n* `polish-stem`_ is a Python stemmer using Finite State Transducers.\n\n\n.. _Estem: https://github.com/arcusfelis/estem\n.. _pl_stemmer: https://github.com/Tutanchamon/pl_stemmer\n.. _polish-stem: https://github.com/eugeniashurko/polish-stem\n\n\nRelease notes\n-------------\n\n2.0.0: API backward incompatible changes\n- Refactor `stempel` to `pystempel` package (#26)\n- Refactor `StempelStemmer` to `Stemmer` and `Stemmer.stem` to callable (#26)\n\n\n1.2.0: Stable version\n\n",
    "bugtrack_url": null,
    "license": "See documentation",
    "summary": "Polish stemmer.",
    "version": "2.0.0",
    "project_urls": null,
    "split_keywords": [
        "nlp",
        " natural language processing",
        " computational linguistics",
        " stemming",
        " linguistics",
        " language",
        " natural language",
        " text analytics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0d9684f748e2f0368c9f7259b23838413fe5d2b4f76a31738c07a00de92747f9",
                "md5": "f97610abba54281f68d7b3d5384a526b",
                "sha256": "5271ab3d8640372567aaf9b8feaa4824ebede08362e4e34c4a24c64ad910abd3"
            },
            "downloads": -1,
            "filename": "pystempel-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f97610abba54281f68d7b3d5384a526b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 2717909,
            "upload_time": "2024-07-10T17:16:03",
            "upload_time_iso_8601": "2024-07-10T17:16:03.302739Z",
            "url": "https://files.pythonhosted.org/packages/0d/96/84f748e2f0368c9f7259b23838413fe5d2b4f76a31738c07a00de92747f9/pystempel-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d0d4ea46cc1cff1ad84bf0ba838405b8a527e5b13ed820ed1c6374e332dcc41b",
                "md5": "b42ee7eb78eef4296c8fd9da8ba2005e",
                "sha256": "201f68397c08a1ed5a7b6751355d40576acbbf0cad6d519836210175ed62a80e"
            },
            "downloads": -1,
            "filename": "pystempel-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b42ee7eb78eef4296c8fd9da8ba2005e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 2721022,
            "upload_time": "2024-07-10T17:16:07",
            "upload_time_iso_8601": "2024-07-10T17:16:07.350135Z",
            "url": "https://files.pythonhosted.org/packages/d0/d4/ea46cc1cff1ad84bf0ba838405b8a527e5b13ed820ed1c6374e332dcc41b/pystempel-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-10 17:16:07",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pystempel"
}

Maciej Gawinecki