ipatok


Nameipatok JSON
Version 0.4.2 PyPI version JSON
download
home_pageNone
SummaryIPA tokeniser
upload_time2024-04-07 13:51:48
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseNone
keywords ipa tokeniser tokenizer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ======
ipatok
======

A simple IPA tokeniser, as simple as in:

>>> from ipatok import tokenise
>>> tokenise('ˈtiːt͡ʃə')
['t', 'iː', 't͡ʃ', 'ə']
>>> tokenise('ʃːjeq͡χːʼjer')
['ʃː', 'j', 'e', 'q͡χːʼ', 'j', 'e', 'r']


api
===

``tokenise(string, strict=False, replace=False, diphthongs=False, tones=False,
unknown=False, merge=None)`` takes an IPA string and returns a list of tokens.
A token usually consists of a single letter together with its accompanying
diacritics. If two letters are connected by a tie bar, they are also considered
a single token. Except for length markers, suprasegmentals are excluded from
the output. Whitespace is also ignored. The function accepts the following
keyword arguments:

- ``strict``: if set to ``True``, the function ensures that ``string`` complies
  to the IPA spec (`the 2015 revision`_); a ``ValueError`` is raised if it does
  not. If set to ``False`` (the default), the role of non-IPA characters is
  guessed based on their Unicode category (cf. the pitfalls section below).
- ``replace``: if set to ``True``, the function replaces some common
  substitutes with their IPA-compliant counterparts, e.g. ``g → ɡ``, ``ɫ → l̴``,
  ``ʦ → t͡s``. Refer to ``ipatok/data/replacements.tsv`` for a full list. If
  both ``strict`` and ``replace`` are set to ``True``, replacing is done before
  checking for spec compliance.
- ``diphthongs``: if set to ``True``, the function groups together non-syllabic
  vowels with their syllabic neighbours (e.g. ``aɪ̯`` would form a single
  token). If set to ``False`` (the default), vowels are not tokenised together
  unless there is a connecting tie bar (e.g. ``a͡ɪ``).
- ``tones``: if set to ``True``, tone and word accents are included in the
  output (accent markers as diacritics and Chao letters as separate tokens). If
  set to ``False`` (the default), these are ignored.
- ``unknown``: if set to ``True``, the output includes (as separate tokens)
  symbols that cannot be classified as letters, diacritics or suprasegmentals
  (e.g. ``-``, ``/``, ``$``). If set to ``False`` (the default), such symbols
  are ignored. It does not have effect if ``strict`` is set to ``True``.
- ``merge``: expects a ``str, str → bool`` function to be applied onto each
  pair of consecutive tokens; those for which the output is ``True`` are merged
  together. You can use this to, e.g., plug in your own diphthong detection
  algorithm:

  >>> tokenise(string, diphthongs=False, merge=custom_func)

``tokenize`` is an alias for ``tokenise``.

other functions
---------------

``replace_digits_with_chao(string, inverse=False)`` takes an IPA string and
replaces the digits 1-5 (also in superscript) with Chao tone letters. If
``inverse=True``, smaller digits are converted into higher tones; otherwise,
they are converted into lower tones (the default).  Equal consecutive digits
are collapsed into a single Chao letter (e.g. ``55 → ˥``).

>>> tokenise(replace_numbers_with_chao('ɕia⁵¹ɕyɛ²¹⁴'), tones=True)
['ɕ', 'i', 'a', '˥˩', 'ɕ', 'y', 'ɛ', '˨˩˦']


``clusterise(string, strict=False, replace=False, diphthongs=False,
tones=False, unknown=False, merge=None)`` takes an IPA string and lists its
consonant and vowel clusters. The keyword arguments are identical as for
``tokenise``:

>>> from ipatok import clusterise
>>> clusterise("kiaːltaːʃ")
['k', 'iaː', 'lt', 'aː', 'ʃ']

``clusterize`` is an alias for ``clusterise``.

pitfalls
========

When ``strict=True`` each symbol is looked up in the spec and there is no
ambiguity as to how the input should be tokenised.

With ``strict=False`` IPA symbols are still handled correctly. A non-IPA symbol
would be treated as follows:

- if it is a non-modifier letter (e.g. ``Γ``), it is considered a consonant;
- if it is a modifier (e.g. ``ˀ``) or a combining mark (e.g. ``ə̇``), it is
  considered a diacritic;
- if it is a `modifier tone letter`_ (e.g. ``꜍``), it is considered a tone
  symbol;
- if it is neither of those, it is considered an unknown symbol.

Regardless of the value of ``strict``, whitespace characters and underscores
are considered to be word boundaries, i.e. there would not be tokens grouping
together symbols separated by these characters, even though the latter are not
included in the output.


installation
============

This is a Python 3 package without dependencies and it is offered at the
`Cheese Shop`_::

    # usually within a virtual environment
    pip install ipatok


other IPA packages
==================

- lingpy_ is a historical linguistics suite that includes an ipa2tokens_
  function.
- loanpy_ is another historical linguistics suite which works with IPA strings.
- ipapy_ is a package for working with IPA strings.
- ipalint_ provides a command-line tool for checking IPA datasets for errors
  and inconsistencies.
- asjp_ provides functions for converting between IPA and ASJP.


licence
=======

MIT. Do as you please and praise the snake gods.


.. _`the 2015 revision`: https://www.internationalphoneticassociation.org/sites/default/files/phonsymbol.pdf
.. _`modifier tone letter`: http://www.unicode.org/charts/PDF/UA700.pdf
.. _`Cheese Shop`: https://pypi.python.org/pypi/ipatok/
.. _`lingpy`: https://pypi.python.org/pypi/lingpy/
.. _`ipa2tokens`: http://lingpy.org/reference/lingpy.sequence.html#lingpy.sequence.sound_classes.ipa2tokens
.. _`loanpy`: https://pypi.org/project/loanpy/
.. _`ipapy`: https://pypi.python.org/pypi/ipapy/
.. _`ipalint`: https://pypi.python.org/pypi/ipalint/
.. _`asjp`: https://pypi.python.org/pypi/asjp/

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ipatok",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "IPA, tokeniser, tokenizer",
    "author": null,
    "author_email": "pavelsof <mail@pavelsof.com>",
    "download_url": "https://files.pythonhosted.org/packages/9d/21/d4b74ab75562dbf06a455070965b424c216f2c7de754b2058aa51e487665/ipatok-0.4.2.tar.gz",
    "platform": null,
    "description": "======\nipatok\n======\n\nA simple IPA tokeniser, as simple as in:\n\n>>> from ipatok import tokenise\n>>> tokenise('\u02c8ti\u02d0t\u0361\u0283\u0259')\n['t', 'i\u02d0', 't\u0361\u0283', '\u0259']\n>>> tokenise('\u0283\u02d0jeq\u0361\u03c7\u02d0\u02bcjer')\n['\u0283\u02d0', 'j', 'e', 'q\u0361\u03c7\u02d0\u02bc', 'j', 'e', 'r']\n\n\napi\n===\n\n``tokenise(string, strict=False, replace=False, diphthongs=False, tones=False,\nunknown=False, merge=None)`` takes an IPA string and returns a list of tokens.\nA token usually consists of a single letter together with its accompanying\ndiacritics. If two letters are connected by a tie bar, they are also considered\na single token. Except for length markers, suprasegmentals are excluded from\nthe output. Whitespace is also ignored. The function accepts the following\nkeyword arguments:\n\n- ``strict``: if set to ``True``, the function ensures that ``string`` complies\n  to the IPA spec (`the 2015 revision`_); a ``ValueError`` is raised if it does\n  not. If set to ``False`` (the default), the role of non-IPA characters is\n  guessed based on their Unicode category (cf. the pitfalls section below).\n- ``replace``: if set to ``True``, the function replaces some common\n  substitutes with their IPA-compliant counterparts, e.g. ``g \u2192 \u0261``, ``\u026b \u2192 l\u0334``,\n  ``\u02a6 \u2192 t\u0361s``. Refer to ``ipatok/data/replacements.tsv`` for a full list. If\n  both ``strict`` and ``replace`` are set to ``True``, replacing is done before\n  checking for spec compliance.\n- ``diphthongs``: if set to ``True``, the function groups together non-syllabic\n  vowels with their syllabic neighbours (e.g. ``a\u026a\u032f`` would form a single\n  token). If set to ``False`` (the default), vowels are not tokenised together\n  unless there is a connecting tie bar (e.g. ``a\u0361\u026a``).\n- ``tones``: if set to ``True``, tone and word accents are included in the\n  output (accent markers as diacritics and Chao letters as separate tokens). If\n  set to ``False`` (the default), these are ignored.\n- ``unknown``: if set to ``True``, the output includes (as separate tokens)\n  symbols that cannot be classified as letters, diacritics or suprasegmentals\n  (e.g. ``-``, ``/``, ``$``). If set to ``False`` (the default), such symbols\n  are ignored. It does not have effect if ``strict`` is set to ``True``.\n- ``merge``: expects a ``str, str \u2192 bool`` function to be applied onto each\n  pair of consecutive tokens; those for which the output is ``True`` are merged\n  together. You can use this to, e.g., plug in your own diphthong detection\n  algorithm:\n\n  >>> tokenise(string, diphthongs=False, merge=custom_func)\n\n``tokenize`` is an alias for ``tokenise``.\n\nother functions\n---------------\n\n``replace_digits_with_chao(string, inverse=False)`` takes an IPA string and\nreplaces the digits 1-5 (also in superscript) with Chao tone letters. If\n``inverse=True``, smaller digits are converted into higher tones; otherwise,\nthey are converted into lower tones (the default).  Equal consecutive digits\nare collapsed into a single Chao letter (e.g. ``55 \u2192 \u02e5``).\n\n>>> tokenise(replace_numbers_with_chao('\u0255ia\u2075\u00b9\u0255y\u025b\u00b2\u00b9\u2074'), tones=True)\n['\u0255', 'i', 'a', '\u02e5\u02e9', '\u0255', 'y', '\u025b', '\u02e8\u02e9\u02e6']\n\n\n``clusterise(string, strict=False, replace=False, diphthongs=False,\ntones=False, unknown=False, merge=None)`` takes an IPA string and lists its\nconsonant and vowel clusters. The keyword arguments are identical as for\n``tokenise``:\n\n>>> from ipatok import clusterise\n>>> clusterise(\"kia\u02d0lta\u02d0\u0283\")\n['k', 'ia\u02d0', 'lt', 'a\u02d0', '\u0283']\n\n``clusterize`` is an alias for ``clusterise``.\n\npitfalls\n========\n\nWhen ``strict=True`` each symbol is looked up in the spec and there is no\nambiguity as to how the input should be tokenised.\n\nWith ``strict=False`` IPA symbols are still handled correctly. A non-IPA symbol\nwould be treated as follows:\n\n- if it is a non-modifier letter (e.g. ``\u0393``), it is considered a consonant;\n- if it is a modifier (e.g. ``\u02c0``) or a combining mark (e.g. ``\u0259\u0307``), it is\n  considered a diacritic;\n- if it is a `modifier tone letter`_ (e.g. ``\ua70d``), it is considered a tone\n  symbol;\n- if it is neither of those, it is considered an unknown symbol.\n\nRegardless of the value of ``strict``, whitespace characters and underscores\nare considered to be word boundaries, i.e. there would not be tokens grouping\ntogether symbols separated by these characters, even though the latter are not\nincluded in the output.\n\n\ninstallation\n============\n\nThis is a Python 3 package without dependencies and it is offered at the\n`Cheese Shop`_::\n\n    # usually within a virtual environment\n    pip install ipatok\n\n\nother IPA packages\n==================\n\n- lingpy_ is a historical linguistics suite that includes an ipa2tokens_\n  function.\n- loanpy_ is another historical linguistics suite which works with IPA strings.\n- ipapy_ is a package for working with IPA strings.\n- ipalint_ provides a command-line tool for checking IPA datasets for errors\n  and inconsistencies.\n- asjp_ provides functions for converting between IPA and ASJP.\n\n\nlicence\n=======\n\nMIT. Do as you please and praise the snake gods.\n\n\n.. _`the 2015 revision`: https://www.internationalphoneticassociation.org/sites/default/files/phonsymbol.pdf\n.. _`modifier tone letter`: http://www.unicode.org/charts/PDF/UA700.pdf\n.. _`Cheese Shop`: https://pypi.python.org/pypi/ipatok/\n.. _`lingpy`: https://pypi.python.org/pypi/lingpy/\n.. _`ipa2tokens`: http://lingpy.org/reference/lingpy.sequence.html#lingpy.sequence.sound_classes.ipa2tokens\n.. _`loanpy`: https://pypi.org/project/loanpy/\n.. _`ipapy`: https://pypi.python.org/pypi/ipapy/\n.. _`ipalint`: https://pypi.python.org/pypi/ipalint/\n.. _`asjp`: https://pypi.python.org/pypi/asjp/\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "IPA tokeniser",
    "version": "0.4.2",
    "project_urls": {
        "Changelog": "https://github.com/pavelsof/ipatok/blob/master/CHANGELOG.rst",
        "Home": "https://github.com/pavelsof/ipatok",
        "Source": "https://github.com/pavelsof/ipatok",
        "Tracker": "https://github.com/pavelsof/ipatok/issues"
    },
    "split_keywords": [
        "ipa",
        " tokeniser",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6a236310fb6f98f50eb00bcb1e593f3d91ff04e1e8b6eeac2d077d3cdbcf57a3",
                "md5": "d3c9bdc0520c96a6d79868f747b759fc",
                "sha256": "2555e8cdff3264431a97e0be9b53541a73018c5a2ab1b107415cea132025c9f0"
            },
            "downloads": -1,
            "filename": "ipatok-0.4.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d3c9bdc0520c96a6d79868f747b759fc",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 15046,
            "upload_time": "2024-04-07T13:51:46",
            "upload_time_iso_8601": "2024-04-07T13:51:46.508548Z",
            "url": "https://files.pythonhosted.org/packages/6a/23/6310fb6f98f50eb00bcb1e593f3d91ff04e1e8b6eeac2d077d3cdbcf57a3/ipatok-0.4.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9d21d4b74ab75562dbf06a455070965b424c216f2c7de754b2058aa51e487665",
                "md5": "de4d48dba866252680333c4407cf14f1",
                "sha256": "6e767a570f1806cf862ff164c0e6a943a3e6dcf1baad27120daed580259bcae8"
            },
            "downloads": -1,
            "filename": "ipatok-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "de4d48dba866252680333c4407cf14f1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 16384,
            "upload_time": "2024-04-07T13:51:48",
            "upload_time_iso_8601": "2024-04-07T13:51:48.872460Z",
            "url": "https://files.pythonhosted.org/packages/9d/21/d4b74ab75562dbf06a455070965b424c216f2c7de754b2058aa51e487665/ipatok-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-07 13:51:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pavelsof",
    "github_project": "ipatok",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "ipatok"
}
        
Elapsed time: 2.66634s