snowballstemmer


Namesnowballstemmer JSON
Version 2.2.0 PyPI version JSON
download
home_pagehttps://github.com/snowballstem/snowball
SummaryThis package provides 29 stemmers for 28 languages generated from Snowball algorithms.
upload_time2021-11-16 18:38:38
maintainer
docs_urlNone
authorSnowball Developers
requires_python
licenseBSD-3-Clause
keywords stemmer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            Snowball stemming library collection for Python
===============================================

Python 3 (>= 3.3) is supported.  We no longer actively support Python 2 as
the Python developers stopped supporting it at the start of 2020.  Snowball
2.1.0 was the last release to officially support Python 2.

What is Stemming?
-----------------

Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps *connection*, *connections*, *connective*,
*connected*, and *connecting* to *connect*.  So a searching for *connected*
would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this
is not a requirement for text search systems, which are the intended field of
use.  We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so *awe* and *awful* don't have the same
stem), and over-stemming is more problematic than under-stemming so we tend not
to stem in cases that are hard to resolve.  If you want to always reduce words
to a root form and/or get a root form which is itself a word then Snowball's
stemming algorithms likely aren't the right answer.

How to use library
------------------

The ``snowballstemmer`` module has two functions.

The ``snowballstemmer.algorithms`` function returns a list of available
algorithm names.

The ``snowballstemmer.stemmer`` function takes an algorithm name and returns a
``Stemmer`` object.

``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a
``Stemmer.stemWords(word[])`` method.

.. code-block:: python

   import snowballstemmer

   stemmer = snowballstemmer.stemmer('english');
   print(stemmer.stemWords("We are the world".split()));

Automatic Acceleration
----------------------

`PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for
Snowball's ``libstemmer_c`` and should provide results 100% compatible to
**snowballstemmer**.

**PyStemmer** is faster because it wraps generated C versions of the stemmers;
**snowballstemmer** uses generate Python code and is slower but offers a pure
Python solution.

If PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer``
``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and
``Stemmer.stemWords()`` methods.

Benchmark
~~~~~~~~~

This is a crude benchmark which measures the time for running each stemmer on
every word in its sample vocabulary (10,787,583 words over 26 languages).  It's
not a realistic test of normal use as a real application would do much more
than just stemming.  It's also skewed towards the stemmers which do more work
per word and towards those with larger sample vocabularies.

* Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)
* Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)
* PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)
* PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)
* Python 2.7 + **PyStemmer** : 52s

For reference the equivalent test for C runs in 9 seconds.

These results are for Snowball 2.0.0.  They're likely to evolve over time as
the code Snowball generates for both Python and C continues to improve (for
a much older test over a different set of stemmers using Python 2.7,
**snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower
with **PyPy**).

The message to take away is that if you're stemming a lot of words you should
either install **PyStemmer** (which **snowballstemmer** will then automatically
use for you as described above) or use PyPy.

The TestApp example
-------------------

The ``testapp.py`` example program allows you to run any of the stemmers
on a sample vocabulary.

Usage::

   testapp.py <algorithm> "sentences ... "

.. code-block:: bash

   $ python testapp.py English "sentences... "



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/snowballstem/snowball",
    "name": "snowballstemmer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "stemmer",
    "author": "Snowball Developers",
    "author_email": "snowball-discuss@lists.tartarus.org",
    "download_url": "https://files.pythonhosted.org/packages/44/7b/af302bebf22c749c56c9c3e8ae13190b5b5db37a33d9068652e8f73b7089/snowballstemmer-2.2.0.tar.gz",
    "platform": "",
    "description": "Snowball stemming library collection for Python\n===============================================\n\nPython 3 (>= 3.3) is supported.  We no longer actively support Python 2 as\nthe Python developers stopped supporting it at the start of 2020.  Snowball\n2.1.0 was the last release to officially support Python 2.\n\nWhat is Stemming?\n-----------------\n\nStemming maps different forms of the same word to a common \"stem\" - for\nexample, the English stemmer maps *connection*, *connections*, *connective*,\n*connected*, and *connecting* to *connect*.  So a searching for *connected*\nwould also find documents which only have the other forms.\n\nThis stem form is often a word itself, but this is not always the case as this\nis not a requirement for text search systems, which are the intended field of\nuse.  We also aim to conflate words with the same meaning, rather than all\nwords with a common linguistic root (so *awe* and *awful* don't have the same\nstem), and over-stemming is more problematic than under-stemming so we tend not\nto stem in cases that are hard to resolve.  If you want to always reduce words\nto a root form and/or get a root form which is itself a word then Snowball's\nstemming algorithms likely aren't the right answer.\n\nHow to use library\n------------------\n\nThe ``snowballstemmer`` module has two functions.\n\nThe ``snowballstemmer.algorithms`` function returns a list of available\nalgorithm names.\n\nThe ``snowballstemmer.stemmer`` function takes an algorithm name and returns a\n``Stemmer`` object.\n\n``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a\n``Stemmer.stemWords(word[])`` method.\n\n.. code-block:: python\n\n   import snowballstemmer\n\n   stemmer = snowballstemmer.stemmer('english');\n   print(stemmer.stemWords(\"We are the world\".split()));\n\nAutomatic Acceleration\n----------------------\n\n`PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for\nSnowball's ``libstemmer_c`` and should provide results 100% compatible to\n**snowballstemmer**.\n\n**PyStemmer** is faster because it wraps generated C versions of the stemmers;\n**snowballstemmer** uses generate Python code and is slower but offers a pure\nPython solution.\n\nIf PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer``\n``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and\n``Stemmer.stemWords()`` methods.\n\nBenchmark\n~~~~~~~~~\n\nThis is a crude benchmark which measures the time for running each stemmer on\nevery word in its sample vocabulary (10,787,583 words over 26 languages).  It's\nnot a realistic test of normal use as a real application would do much more\nthan just stemming.  It's also skewed towards the stemmers which do more work\nper word and towards those with larger sample vocabularies.\n\n* Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)\n* Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)\n* PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)\n* PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)\n* Python 2.7 + **PyStemmer** : 52s\n\nFor reference the equivalent test for C runs in 9 seconds.\n\nThese results are for Snowball 2.0.0.  They're likely to evolve over time as\nthe code Snowball generates for both Python and C continues to improve (for\na much older test over a different set of stemmers using Python 2.7,\n**snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower\nwith **PyPy**).\n\nThe message to take away is that if you're stemming a lot of words you should\neither install **PyStemmer** (which **snowballstemmer** will then automatically\nuse for you as described above) or use PyPy.\n\nThe TestApp example\n-------------------\n\nThe ``testapp.py`` example program allows you to run any of the stemmers\non a sample vocabulary.\n\nUsage::\n\n   testapp.py <algorithm> \"sentences ... \"\n\n.. code-block:: bash\n\n   $ python testapp.py English \"sentences... \"\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "This package provides 29 stemmers for 28 languages generated from Snowball algorithms.",
    "version": "2.2.0",
    "split_keywords": [
        "stemmer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "cde6b1a608f6796bb8c3680139a37f6b",
                "sha256": "c8e1716e83cc398ae16824e5572ae04e0d9fc2c6b985fb0f900f5f0c96ecba1a"
            },
            "downloads": -1,
            "filename": "snowballstemmer-2.2.0-py2.py3-none-any.whl",
            "has_sig": true,
            "md5_digest": "cde6b1a608f6796bb8c3680139a37f6b",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 93002,
            "upload_time": "2021-11-16T18:38:34",
            "upload_time_iso_8601": "2021-11-16T18:38:34.792729Z",
            "url": "https://files.pythonhosted.org/packages/ed/dc/c02e01294f7265e63a7315fe086dd1df7dacb9f840a804da846b96d01b96/snowballstemmer-2.2.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "4332ddc7bbee0f344a03915b2ad59a54",
                "sha256": "09b16deb8547d3412ad7b590689584cd0fe25ec8db3be37788be3810cbf19cb1"
            },
            "downloads": -1,
            "filename": "snowballstemmer-2.2.0.tar.gz",
            "has_sig": true,
            "md5_digest": "4332ddc7bbee0f344a03915b2ad59a54",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 86699,
            "upload_time": "2021-11-16T18:38:38",
            "upload_time_iso_8601": "2021-11-16T18:38:38.009589Z",
            "url": "https://files.pythonhosted.org/packages/44/7b/af302bebf22c749c56c9c3e8ae13190b5b5db37a33d9068652e8f73b7089/snowballstemmer-2.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-11-16 18:38:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "snowballstem",
    "github_project": "snowball",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "lcname": "snowballstemmer"
}
        
Elapsed time: 0.01471s