nskipgrams
==========
.. image:: https://badge.fury.io/py/nskipgrams.svg
:target: https://pypi.python.org/pypi/nskipgrams
:alt: PyPI version
.. image:: https://img.shields.io/pypi/pyversions/nskipgrams.svg
:target: https://pypi.python.org/pypi/nskipgrams
:alt: Supported Python versions
.. image:: https://circleci.com/gh/jacksonllee/nskipgrams.svg?style=shield
:target: https://circleci.com/gh/jacksonllee/nskipgrams
:alt: CircleCI Build
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.4002095.svg
:target: https://doi.org/10.5281/zenodo.4002095
:alt: DOI
``nskipgrams`` is a lightweight Python package to work with ngrams and skipgrams.
Fields of study using ngrams and skipgrams from sequential data, especially
computational linguistics and natural language processing, will find
this package helpful.
Highlights:
* Simple: Store, access, and count ngrams and skipgrams -- that's it!
* Memory-efficient: Tries are used for internal storage.
* Hassle-free: No dependencies. Written in pure Python. Today is a great day.
Download and Install
--------------------
To download and install the most recent version::
$ pip install --upgrade nskipgrams
Usage
-----
The following are defined:
- Ngrams
- The class ``Ngrams`` handles a collection of ngrams.
- The function ``ngrams_from_seq`` yields ngrams for a given sequence.
- Skipgrams
- The class ``Skipgrams`` handles a collection of skipgrams.
- The function ``skipgrams_from_seq`` yields skipgrams for a given sequence.
Getting Ngrams from a Sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you simply need ngrams from a sequence, ``ngrams_from_seq`` is what you're looking for:
.. code-block:: python
>>> from nskipgrams import ngrams_from_seq
>>> for ngram in ngrams_from_seq("abcdef", n=2):
... print(ngram)
('a', 'b')
('b', 'c')
('c', 'd')
('d', 'e')
('e', 'f')
Initializing an Ngram Collection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
>>> from nskipgrams import Ngrams
>>> char_ngrams = Ngrams(n=3) # handles unigrams, bigrams, and trigrams
Adding Ngrams
^^^^^^^^^^^^^
.. code-block:: python
>>> char_ngrams.add_from_seq("my cats")
>>> char_ngrams.add_from_seq("your cat", count=2)
Here, a sequence is anything that can be iterated over,
and the corresponding ngrams are extracted from the individual elements
off of the sequence.
For example, if the sequence is a text string like ``"my cats"`` above,
then the ngrams are character-based (hence the chosen variable name ``char_ngrams``).
To add a single ngram:
.. code-block:: python
>>> char_ngrams.add(("y", "o", "u"))
As a best practice, it is recommended that an ngram be represented as a ``tuple``
regardless of what the individual elements are,
e.g., ``("y", "o", "u")`` for character-based ngrams.
As output examples show below, the ``tuple`` data type is also what this package
uses to represent ngrams.
Accessing Ngrams
^^^^^^^^^^^^^^^^
.. code-block:: python
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=1): # unigrams
... print(ngram, count)
...
('m',), 1
('y',), 3
(' ',), 3
('c',), 3
('a',), 3
('t',), 3
('s',), 1
('o',), 2
('u',), 2
('r',), 2
>>>
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=2): # bigrams
... print(ngram, count)
...
('m', 'y'), 1
('y', ' '), 1
('y', 'o'), 2
(' ', 'c'), 3
('c', 'a'), 3
('a', 't'), 3
('t', 's'), 1
('o', 'u'), 2
('u', 'r'), 2
('r', ' '), 2
>>>
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=3): # trigrams
... print(ngram, count)
...
('m', 'y', ' '), 1
('y', ' ', 'c'), 1
('y', 'o', 'u'), 3
(' ', 'c', 'a'), 3
('c', 'a', 't'), 3
('a', 't', 's'), 1
('o', 'u', 'r'), 2
('u', 'r', ' '), 2
('r', ' ', 'c'), 2
Accessing Ngrams with a Specific Prefix
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
>>> for ngram, count in char_ngrams.ngrams_with_counts(n=3, prefix=("y",)):
... print(ngram, count)
...
('y', ' ', 'c'), 1
('y', 'o', 'u'), 3
Accessing the Count of a Specific Ngram
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python
>>> char_ngrams.count(("c", "a", "t"))
3
Checking Membership
^^^^^^^^^^^^^^^^^^^
To check if an ngram has an exact match in the collection so far:
.. code-block:: python
>>> ("c", "a", "t") in char_ngrams
True
Combining Collections of Ngrams
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To combine collections of ngrams (e.g., when you process data sources in parallel
and have multiple ``Ngrams`` objects):
.. code-block:: python
>>> char_ngrams1 = Ngrams(n=2)
>>> char_ngrams1.add_from_seq("my cat")
>>> set(char_ngrams1.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
(('a', 't'), 1),
(('c', 'a'), 1),
(('m', 'y'), 1),
(('y', ' '), 1)}
>>>
>>> char_ngrams2 = Ngrams(n=2)
>>> char_ngrams2.add_from_seq("your cats")
>>> set(char_ngrams2.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
(('a', 't'), 1),
(('c', 'a'), 1),
(('o', 'u'), 1),
(('r', ' '), 1),
(('t', 's'), 1),
(('u', 'r'), 1),
(('y', 'o'), 1)}
>>>
>>> char_ngrams3 = Ngrams(n=2)
>>> char_ngrams3.add_from_seq("her cats")
>>> set(char_ngrams3.ngrams_with_counts(n=2))
{((' ', 'c'), 1),
(('a', 't'), 1),
(('c', 'a'), 1),
(('e', 'r'), 1),
(('h', 'e'), 1),
(('r', ' '), 1),
(('t', 's'), 1)}
>>>
>>> char_ngrams1.combine(char_ngrams2, char_ngrams3) # `combine` takes as many Ngrams objects as desired
>>> set(char_ngrams1.ngrams_with_counts(n=2))
{((' ', 'c'), 3),
(('a', 't'), 3),
(('c', 'a'), 3),
(('e', 'r'), 1),
(('h', 'e'), 1),
(('m', 'y'), 1),
(('o', 'u'), 1),
(('r', ' '), 2),
(('t', 's'), 2),
(('u', 'r'), 1),
(('y', ' '), 1),
(('y', 'o'), 1)}
If you don't want to mutate any of the ``Ngrams`` instances
(the ``combine`` method works in-place and mutates ``these_ngrams``
when ``these_ngrams.combine`` is called),
then you can create an empty ngram collection and combine into it
all of your ngrams:
.. code-block:: python
>>> collections = [char_ngrams1, char_ngrams2, char_ngrams3]
>>> all_ngrams = Ngrams(n=2) # A new, empty collection of ngrams
>>> all_ngrams.combine(*collections)
Any "Sequences" and their Corresponding "Ngrams" Work
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
While the examples above use text strings as sequences and character-based ngrams,
another common usage in computational linguistics and NLP is to have
segmented phrases/sentences as sequences and word-based ngrams:
.. code-block:: python
>>> from nskipgrams import Ngrams
>>> word_ngrams = Ngrams(n=2)
>>> word_ngrams.add_from_seq(("in", "the", "beginning"))
>>> word_ngrams.add_from_seq(("in", "the", "end"))
>>> for ngram, count in word_ngrams.ngrams_with_counts(n=2):
... print(ngram, count)
...
('in', 'the'), 2
('the', 'beginning'), 1
('the', 'end'), 1
Skipgrams
^^^^^^^^^
Ngrams are a special case of skipgrams, with skip = 0.
The class ``Skipgrams`` works the same as ``Ngrams``, with the following differences:
* ``Skipgrams`` has the method ``skipgrams_with_counts`` rather than ``ngrams_with_counts``.
``skipgrams_with_counts`` also has the keyword argument ``skip``
(in addition to ``n`` and ``prefix``).
* For ``Skipgrams``, the methods ``add`` and ``count``, as well as collection instantiation
(i.e., ``__init__``), also have a meaningful ``skip`` keyword argument.
The function ``skipgrams_from_seq`` works the same as ``ngrams_from_seq``, but has
the ``skip`` keyword argument (in addition to ``seq`` and ``n``).
Citation
--------
Lee, Jackson L. 2023. nskipgrams: A lightweight Python package to work with ngrams and skipgrams. https://doi.org/10.5281/zenodo.4002095
.. code-block:: latex
@software{leengrams,
author = {Jackson L. Lee},
title = {nskipgrams: A lightweight Python package to work with ngrams and skipgrams},
year = 2021,
doi = {10.5281/zenodo.4002095},
url = {https://doi.org/10.5281/zenodo.4002095}
}
License
-------
MIT License. Please see ``LICENSE.txt`` in the GitHub source code for details.
Changelog
---------
Please see ``CHANGELOG.md`` in the GitHub source code.
Raw data
{
"_id": null,
"home_page": "",
"name": "nskipgrams",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "computational linguistics,natural language processing,NLP,linguistics,language,ngrams,skipgrams",
"author": "",
"author_email": "\"Jackson L. Lee\" <jacksonlunlee@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/01/77/39683e1f8f47f261eab4059c8e392e24b55beefaf3cc08940577f75d5c19/nskipgrams-0.5.0.tar.gz",
"platform": null,
"description": "nskipgrams\n==========\n\n.. image:: https://badge.fury.io/py/nskipgrams.svg\n :target: https://pypi.python.org/pypi/nskipgrams\n :alt: PyPI version\n\n.. image:: https://img.shields.io/pypi/pyversions/nskipgrams.svg\n :target: https://pypi.python.org/pypi/nskipgrams\n :alt: Supported Python versions\n\n.. image:: https://circleci.com/gh/jacksonllee/nskipgrams.svg?style=shield\n :target: https://circleci.com/gh/jacksonllee/nskipgrams\n :alt: CircleCI Build\n\n.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.4002095.svg\n :target: https://doi.org/10.5281/zenodo.4002095\n :alt: DOI\n\n``nskipgrams`` is a lightweight Python package to work with ngrams and skipgrams.\nFields of study using ngrams and skipgrams from sequential data, especially\ncomputational linguistics and natural language processing, will find\nthis package helpful.\n\nHighlights:\n\n* Simple: Store, access, and count ngrams and skipgrams -- that's it!\n* Memory-efficient: Tries are used for internal storage.\n* Hassle-free: No dependencies. Written in pure Python. Today is a great day.\n\nDownload and Install\n--------------------\n\nTo download and install the most recent version::\n\n $ pip install --upgrade nskipgrams\n\nUsage\n-----\n\nThe following are defined:\n\n- Ngrams\n - The class ``Ngrams`` handles a collection of ngrams.\n - The function ``ngrams_from_seq`` yields ngrams for a given sequence.\n- Skipgrams\n - The class ``Skipgrams`` handles a collection of skipgrams.\n - The function ``skipgrams_from_seq`` yields skipgrams for a given sequence.\n\nGetting Ngrams from a Sequence\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nIf you simply need ngrams from a sequence, ``ngrams_from_seq`` is what you're looking for:\n\n.. code-block:: python\n\n >>> from nskipgrams import ngrams_from_seq\n >>> for ngram in ngrams_from_seq(\"abcdef\", n=2):\n ... print(ngram)\n ('a', 'b')\n ('b', 'c')\n ('c', 'd')\n ('d', 'e')\n ('e', 'f')\n\nInitializing an Ngram Collection\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n.. code-block:: python\n\n >>> from nskipgrams import Ngrams\n >>> char_ngrams = Ngrams(n=3) # handles unigrams, bigrams, and trigrams\n\nAdding Ngrams\n^^^^^^^^^^^^^\n\n.. code-block:: python\n\n >>> char_ngrams.add_from_seq(\"my cats\")\n >>> char_ngrams.add_from_seq(\"your cat\", count=2)\n\nHere, a sequence is anything that can be iterated over,\nand the corresponding ngrams are extracted from the individual elements\noff of the sequence.\nFor example, if the sequence is a text string like ``\"my cats\"`` above,\nthen the ngrams are character-based (hence the chosen variable name ``char_ngrams``).\n\nTo add a single ngram:\n\n.. code-block:: python\n\n >>> char_ngrams.add((\"y\", \"o\", \"u\"))\n\nAs a best practice, it is recommended that an ngram be represented as a ``tuple``\nregardless of what the individual elements are,\ne.g., ``(\"y\", \"o\", \"u\")`` for character-based ngrams.\nAs output examples show below, the ``tuple`` data type is also what this package\nuses to represent ngrams.\n\nAccessing Ngrams\n^^^^^^^^^^^^^^^^\n\n.. code-block:: python\n\n >>> for ngram, count in char_ngrams.ngrams_with_counts(n=1): # unigrams\n ... print(ngram, count)\n ...\n ('m',), 1\n ('y',), 3\n (' ',), 3\n ('c',), 3\n ('a',), 3\n ('t',), 3\n ('s',), 1\n ('o',), 2\n ('u',), 2\n ('r',), 2\n >>>\n >>> for ngram, count in char_ngrams.ngrams_with_counts(n=2): # bigrams\n ... print(ngram, count)\n ...\n ('m', 'y'), 1\n ('y', ' '), 1\n ('y', 'o'), 2\n (' ', 'c'), 3\n ('c', 'a'), 3\n ('a', 't'), 3\n ('t', 's'), 1\n ('o', 'u'), 2\n ('u', 'r'), 2\n ('r', ' '), 2\n >>>\n >>> for ngram, count in char_ngrams.ngrams_with_counts(n=3): # trigrams\n ... print(ngram, count)\n ...\n ('m', 'y', ' '), 1\n ('y', ' ', 'c'), 1\n ('y', 'o', 'u'), 3\n (' ', 'c', 'a'), 3\n ('c', 'a', 't'), 3\n ('a', 't', 's'), 1\n ('o', 'u', 'r'), 2\n ('u', 'r', ' '), 2\n ('r', ' ', 'c'), 2\n\nAccessing Ngrams with a Specific Prefix\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n.. code-block:: python\n\n >>> for ngram, count in char_ngrams.ngrams_with_counts(n=3, prefix=(\"y\",)):\n ... print(ngram, count)\n ...\n ('y', ' ', 'c'), 1\n ('y', 'o', 'u'), 3\n\nAccessing the Count of a Specific Ngram\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n.. code-block:: python\n\n >>> char_ngrams.count((\"c\", \"a\", \"t\"))\n 3\n\nChecking Membership\n^^^^^^^^^^^^^^^^^^^\n\nTo check if an ngram has an exact match in the collection so far:\n\n.. code-block:: python\n\n >>> (\"c\", \"a\", \"t\") in char_ngrams\n True\n\nCombining Collections of Ngrams\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nTo combine collections of ngrams (e.g., when you process data sources in parallel\nand have multiple ``Ngrams`` objects):\n\n.. code-block:: python\n\n >>> char_ngrams1 = Ngrams(n=2)\n >>> char_ngrams1.add_from_seq(\"my cat\")\n >>> set(char_ngrams1.ngrams_with_counts(n=2))\n {((' ', 'c'), 1),\n (('a', 't'), 1),\n (('c', 'a'), 1),\n (('m', 'y'), 1),\n (('y', ' '), 1)}\n >>>\n >>> char_ngrams2 = Ngrams(n=2)\n >>> char_ngrams2.add_from_seq(\"your cats\")\n >>> set(char_ngrams2.ngrams_with_counts(n=2))\n {((' ', 'c'), 1),\n (('a', 't'), 1),\n (('c', 'a'), 1),\n (('o', 'u'), 1),\n (('r', ' '), 1),\n (('t', 's'), 1),\n (('u', 'r'), 1),\n (('y', 'o'), 1)}\n >>>\n >>> char_ngrams3 = Ngrams(n=2)\n >>> char_ngrams3.add_from_seq(\"her cats\")\n >>> set(char_ngrams3.ngrams_with_counts(n=2))\n {((' ', 'c'), 1),\n (('a', 't'), 1),\n (('c', 'a'), 1),\n (('e', 'r'), 1),\n (('h', 'e'), 1),\n (('r', ' '), 1),\n (('t', 's'), 1)}\n >>>\n >>> char_ngrams1.combine(char_ngrams2, char_ngrams3) # `combine` takes as many Ngrams objects as desired\n >>> set(char_ngrams1.ngrams_with_counts(n=2))\n {((' ', 'c'), 3),\n (('a', 't'), 3),\n (('c', 'a'), 3),\n (('e', 'r'), 1),\n (('h', 'e'), 1),\n (('m', 'y'), 1),\n (('o', 'u'), 1),\n (('r', ' '), 2),\n (('t', 's'), 2),\n (('u', 'r'), 1),\n (('y', ' '), 1),\n (('y', 'o'), 1)}\n\nIf you don't want to mutate any of the ``Ngrams`` instances\n(the ``combine`` method works in-place and mutates ``these_ngrams``\nwhen ``these_ngrams.combine`` is called),\nthen you can create an empty ngram collection and combine into it\nall of your ngrams:\n\n.. code-block:: python\n\n >>> collections = [char_ngrams1, char_ngrams2, char_ngrams3]\n >>> all_ngrams = Ngrams(n=2) # A new, empty collection of ngrams\n >>> all_ngrams.combine(*collections)\n\nAny \"Sequences\" and their Corresponding \"Ngrams\" Work\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nWhile the examples above use text strings as sequences and character-based ngrams,\nanother common usage in computational linguistics and NLP is to have\nsegmented phrases/sentences as sequences and word-based ngrams:\n\n.. code-block:: python\n\n >>> from nskipgrams import Ngrams\n >>> word_ngrams = Ngrams(n=2)\n >>> word_ngrams.add_from_seq((\"in\", \"the\", \"beginning\"))\n >>> word_ngrams.add_from_seq((\"in\", \"the\", \"end\"))\n >>> for ngram, count in word_ngrams.ngrams_with_counts(n=2):\n ... print(ngram, count)\n ...\n ('in', 'the'), 2\n ('the', 'beginning'), 1\n ('the', 'end'), 1\n\nSkipgrams\n^^^^^^^^^\n\nNgrams are a special case of skipgrams, with skip = 0.\nThe class ``Skipgrams`` works the same as ``Ngrams``, with the following differences:\n\n* ``Skipgrams`` has the method ``skipgrams_with_counts`` rather than ``ngrams_with_counts``.\n ``skipgrams_with_counts`` also has the keyword argument ``skip``\n (in addition to ``n`` and ``prefix``).\n* For ``Skipgrams``, the methods ``add`` and ``count``, as well as collection instantiation\n (i.e., ``__init__``), also have a meaningful ``skip`` keyword argument.\n\nThe function ``skipgrams_from_seq`` works the same as ``ngrams_from_seq``, but has\nthe ``skip`` keyword argument (in addition to ``seq`` and ``n``).\n\nCitation\n--------\n\nLee, Jackson L. 2023. nskipgrams: A lightweight Python package to work with ngrams and skipgrams. https://doi.org/10.5281/zenodo.4002095\n\n.. code-block:: latex\n\n @software{leengrams,\n author = {Jackson L. Lee},\n title = {nskipgrams: A lightweight Python package to work with ngrams and skipgrams},\n year = 2021,\n doi = {10.5281/zenodo.4002095},\n url = {https://doi.org/10.5281/zenodo.4002095}\n }\n\nLicense\n-------\n\nMIT License. Please see ``LICENSE.txt`` in the GitHub source code for details.\n\nChangelog\n---------\n\nPlease see ``CHANGELOG.md`` in the GitHub source code.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A lightweight Python package to work with ngrams and skipgrams",
"version": "0.5.0",
"split_keywords": [
"computational linguistics",
"natural language processing",
"nlp",
"linguistics",
"language",
"ngrams",
"skipgrams"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f87e637bf86ee93dc81c6760b1765ab28f644bcbca81d0750e42b0033cec4ac5",
"md5": "cb22c17ee19f5745e89cb35c208bc791",
"sha256": "bf9b50453a623254fff1e4d953a06969e23207b74d433458dd99ad4e1bd82b68"
},
"downloads": -1,
"filename": "nskipgrams-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cb22c17ee19f5745e89cb35c208bc791",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 7782,
"upload_time": "2023-03-11T21:42:11",
"upload_time_iso_8601": "2023-03-11T21:42:11.860825Z",
"url": "https://files.pythonhosted.org/packages/f8/7e/637bf86ee93dc81c6760b1765ab28f644bcbca81d0750e42b0033cec4ac5/nskipgrams-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "017739683e1f8f47f261eab4059c8e392e24b55beefaf3cc08940577f75d5c19",
"md5": "01689079babca9f3281d0609830c527a",
"sha256": "8eabf61270e7ac12569d1630360f0421ca9603ce77d8106776369781559e56a7"
},
"downloads": -1,
"filename": "nskipgrams-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "01689079babca9f3281d0609830c527a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 8027,
"upload_time": "2023-03-11T21:42:45",
"upload_time_iso_8601": "2023-03-11T21:42:45.357276Z",
"url": "https://files.pythonhosted.org/packages/01/77/39683e1f8f47f261eab4059c8e392e24b55beefaf3cc08940577f75d5c19/nskipgrams-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-11 21:42:45",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "nskipgrams"
}