uniseg


Nameuniseg JSON
Version 0.10.0 PyPI version JSON
download
home_pageNone
SummaryDetermine Unicode text segmentations
upload_time2025-01-23 14:10:45
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords text unicode
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ======
uniseg
======

A Python package to determine Unicode text segmentations.


- `uniseg · PyPI <https://pypi.org/project/uniseg/>`_
- `emptypage / uniseg-py — Bitbucket <https://bitbucket.org/emptypage/uniseg-py/>`_
- `uniseg documentation — Read the Docs <https://uniseg-py.readthedocs.io/>`_


News
====

We released the version 0.9.0 on November, 2024, and this is the first
release ever which passes all the Unicode breaking tests (congrats!).  And now
I'm going to make its release number to 1.0, with some breaking changes for the
APIs soon.  Thank you.


Features
========

This package provides:

- Functions to get Unicode Character Database (UCD) properties concerned with
  text segmentations.
- Functions to determine segmentation boundaries of Unicode strings.
- Classes that help implement Unicode-aware text wrapping on both console
  (monospace) and graphical (monospace / proportional) font environments.

Supporting segmentations are:

*code point*
  `Code point <https://www.unicode.org/glossary/#code_point>`_ is *"any value
  in the Unicode codespace."* It is the basic unit for processing Unicode
  strings.

  Historically, units per Unicode string object on elder versions of Python
  was build-dependent.  Some builds uses UTF-16 as an implementation for that
  and treat each code point greater than U+FFFF as a "surrogate pair", which
  is a pair of the special two code points.  The `uniseg` package had
  provided utility functions in order to treat Unicode strings per proper
  code points on every platform.

  Since Python 3.3, The Unicode string is implemented with "flexible string
  representation", which gives access to full code points and
  space-efficiency `[PEP 393]`_.  So you don't need to worry about treating
  complex multi-code-points issue any more.  If you want to treat some Unicode
  string per code point, just iterate that like: ``for c in s:``.  So
  ``uniseg.codepoint`` module has been deprecated and deleted.

  .. _[PEP 393]: https://peps.python.org/pep-0393/

*grapheme cluster*
  `Grapheme cluster <https://www.unicode.org/glossary/#grapheme_cluster>`_
  approximately represents *"user-perceived character."*  They may be made
  up of single or multiple Unicode code points.  e.g. "g̈", "g" +
  *combining diaeresis* is a single *user-perceived character*, while which
  represents with two code points, U+0067 LATIN SMALL LETTER G and U+0308
  COMBINING DIAERESIS.

*word break*
  Word boundaries are familiar segmentation in many common text operations.
  e.g. Unit for text highlighting, cursor jumping etc.  Note that *words* are
  not determinable only by spaces or punctuations in text in some languages.
  Such languages like Thai or Japanese require dictionaries to determine
  appropriate word boundaries.  Though the package only provides simple word
  breaking implementation which is based on the scripts and doesn't use any
  dictionaries, it also provides ways to customize its default behavior.

*sentence break*
  Sentence breaks are also common in text processing but they are more
  contextual and less formal.  The sentence breaking implementation (which is
  specified in UAX: Unicode Standard Annex) in the package is simple and
  formal too.  But it must be still useful in some usages.

*line break*
  Implementing line breaking algorithm is one of the key features of this
  package.  The feature is important in many general text presentations in
  both CLI and GUI applications.


Requirements
============

Python 3.9 or later.


Install
=======

.. code:: console

  $ pip install uniseg


Changes
=======

0.9.1 (2025-01-16)
  - Fix ``ambiguous_as_wide`` options are not working on ``uniseg.wrap``.

0.9.0 (2024-11-07)
  - Unicode 16.0.0.
  - Rule-based grapheme cluster segmentation is back.
  - And, this is the first release ever that passes the entire Unicode breaking tests!


0.8.1 (2024-08-13)
  - Fix `sentence_break('/')` raised an exception. (Thanks to Nathaniel Mills)

0.8.0 (2024-02-08)
  - Unicode 15.0.0.
  - Regex-based grapheme cluster segmentation.
  - Quit supporting Python versions < 3.8.

0.7.2 (2022-09-20)
  - Improve performance of Unicode lookups. `PR by Max Bachmann
    <https://bitbucket.org/emptypage/uniseg-py/pull-requests/1>`_.

0.7.1 (2015-05-02)
  - CHANGE: wrap.Wrapper.wrap(): returns the count of lines now.
  - Separate LICENSE from README.txt for the packaging-related reason in some
    environments.

0.7.0 (2015-02-27)
  - CHANGE: Quitted gathering all submodules's members on the top, uniseg
    module.
  - CHANGE: Reform ``uniseg.wrap`` module and sample scripts.
  - Maintained uniseg.wrap module, and sample scripts work again.

0.6.4 (2015-02-10)
  - Add ``uniseg-dbpath`` console command, which just print the path of
    ``ucd.sqlite3``.
  - Include sample scripts under the package's subdirectory.

0.6.3 (2015-01-25)
  - Python 3.4
  - Support modern setuptools, pip and wheel.

0.6.2 (2013-06-09)
  - Python 3.3

0.6.1 (2013-06-08)
  - Unicode 6.2.0


References
==========

- `UAX #29: Unicode Text Segmentation (16.0.0)
  <https://www.unicode.org/reports/tr29/tr29-45.html>`_
- `UAX #14: Unicode Line Breaking Algorithm (16.0.0)
  <https://www.unicode.org/reports/tr14/tr14-53.html>`_


Related / Similar Projects
==========================

`PyICU <https://pypi.python.org/pypi/PyICU>`_ - Python extension wrapping the ICU C++ API
  *PyICU* is a Python extension wrapping International Components for
  Unicode library (ICU). It also provides text segmentation supports and
  they just perform richer and faster than those of ours. PyICU is an
  extension library so it requires ICU dynamic library (binary files) and
  compiler to build the extension. Our package is written in pure Python;
  it runs slower but is more portable.

`pytextseg <https://pypi.python.org/pypi/pytextseg>`_ - Python module for textsegmentation
  *pytextseg* package focuses very similar goal to ours; it provides
  Unicode-aware text wrapping features. They designed and uses their
  original string class (not built-in ``unicod`` / ``str`` classes) for the
  purpose. We use strings as just ordinary built-in ``unicode`` / ``str``
  objects for text processing in our modules.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "uniseg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "text, unicode",
    "author": null,
    "author_email": "Masaaki Shibata <mshibata@emptypage.jp>",
    "download_url": "https://files.pythonhosted.org/packages/26/9e/179eab698c565904a25bf0d8c999ef7ec858c9828809df2d1a5b92a5dd67/uniseg-0.10.0.tar.gz",
    "platform": null,
    "description": "======\nuniseg\n======\n\nA Python package to determine Unicode text segmentations.\n\n\n- `uniseg \u00b7 PyPI <https://pypi.org/project/uniseg/>`_\n- `emptypage / uniseg-py \u2014 Bitbucket <https://bitbucket.org/emptypage/uniseg-py/>`_\n- `uniseg documentation \u2014 Read the Docs <https://uniseg-py.readthedocs.io/>`_\n\n\nNews\n====\n\nWe released the version 0.9.0 on November, 2024, and this is the first\nrelease ever which passes all the Unicode breaking tests (congrats!).  And now\nI'm going to make its release number to 1.0, with some breaking changes for the\nAPIs soon.  Thank you.\n\n\nFeatures\n========\n\nThis package provides:\n\n- Functions to get Unicode Character Database (UCD) properties concerned with\n  text segmentations.\n- Functions to determine segmentation boundaries of Unicode strings.\n- Classes that help implement Unicode-aware text wrapping on both console\n  (monospace) and graphical (monospace / proportional) font environments.\n\nSupporting segmentations are:\n\n*code point*\n  `Code point <https://www.unicode.org/glossary/#code_point>`_ is *\"any value\n  in the Unicode codespace.\"* It is the basic unit for processing Unicode\n  strings.\n\n  Historically, units per Unicode string object on elder versions of Python\n  was build-dependent.  Some builds uses UTF-16 as an implementation for that\n  and treat each code point greater than U+FFFF as a \"surrogate pair\", which\n  is a pair of the special two code points.  The `uniseg` package had\n  provided utility functions in order to treat Unicode strings per proper\n  code points on every platform.\n\n  Since Python 3.3, The Unicode string is implemented with \"flexible string\n  representation\", which gives access to full code points and\n  space-efficiency `[PEP 393]`_.  So you don't need to worry about treating\n  complex multi-code-points issue any more.  If you want to treat some Unicode\n  string per code point, just iterate that like: ``for c in s:``.  So\n  ``uniseg.codepoint`` module has been deprecated and deleted.\n\n  .. _[PEP 393]: https://peps.python.org/pep-0393/\n\n*grapheme cluster*\n  `Grapheme cluster <https://www.unicode.org/glossary/#grapheme_cluster>`_\n  approximately represents *\"user-perceived character.\"*  They may be made\n  up of single or multiple Unicode code points.  e.g. \"g\u0308\", \"g\" +\n  *combining diaeresis* is a single *user-perceived character*, while which\n  represents with two code points, U+0067 LATIN SMALL LETTER G and U+0308\n  COMBINING DIAERESIS.\n\n*word break*\n  Word boundaries are familiar segmentation in many common text operations.\n  e.g. Unit for text highlighting, cursor jumping etc.  Note that *words* are\n  not determinable only by spaces or punctuations in text in some languages.\n  Such languages like Thai or Japanese require dictionaries to determine\n  appropriate word boundaries.  Though the package only provides simple word\n  breaking implementation which is based on the scripts and doesn't use any\n  dictionaries, it also provides ways to customize its default behavior.\n\n*sentence break*\n  Sentence breaks are also common in text processing but they are more\n  contextual and less formal.  The sentence breaking implementation (which is\n  specified in UAX: Unicode Standard Annex) in the package is simple and\n  formal too.  But it must be still useful in some usages.\n\n*line break*\n  Implementing line breaking algorithm is one of the key features of this\n  package.  The feature is important in many general text presentations in\n  both CLI and GUI applications.\n\n\nRequirements\n============\n\nPython 3.9 or later.\n\n\nInstall\n=======\n\n.. code:: console\n\n  $ pip install uniseg\n\n\nChanges\n=======\n\n0.9.1 (2025-01-16)\n  - Fix ``ambiguous_as_wide`` options are not working on ``uniseg.wrap``.\n\n0.9.0 (2024-11-07)\n  - Unicode 16.0.0.\n  - Rule-based grapheme cluster segmentation is back.\n  - And, this is the first release ever that passes the entire Unicode breaking tests!\n\n\n0.8.1 (2024-08-13)\n  - Fix `sentence_break('/')` raised an exception. (Thanks to Nathaniel Mills)\n\n0.8.0 (2024-02-08)\n  - Unicode 15.0.0.\n  - Regex-based grapheme cluster segmentation.\n  - Quit supporting Python versions < 3.8.\n\n0.7.2 (2022-09-20)\n  - Improve performance of Unicode lookups. `PR by Max Bachmann\n    <https://bitbucket.org/emptypage/uniseg-py/pull-requests/1>`_.\n\n0.7.1 (2015-05-02)\n  - CHANGE: wrap.Wrapper.wrap(): returns the count of lines now.\n  - Separate LICENSE from README.txt for the packaging-related reason in some\n    environments.\n\n0.7.0 (2015-02-27)\n  - CHANGE: Quitted gathering all submodules's members on the top, uniseg\n    module.\n  - CHANGE: Reform ``uniseg.wrap`` module and sample scripts.\n  - Maintained uniseg.wrap module, and sample scripts work again.\n\n0.6.4 (2015-02-10)\n  - Add ``uniseg-dbpath`` console command, which just print the path of\n    ``ucd.sqlite3``.\n  - Include sample scripts under the package's subdirectory.\n\n0.6.3 (2015-01-25)\n  - Python 3.4\n  - Support modern setuptools, pip and wheel.\n\n0.6.2 (2013-06-09)\n  - Python 3.3\n\n0.6.1 (2013-06-08)\n  - Unicode 6.2.0\n\n\nReferences\n==========\n\n- `UAX #29: Unicode Text Segmentation (16.0.0)\n  <https://www.unicode.org/reports/tr29/tr29-45.html>`_\n- `UAX #14: Unicode Line Breaking Algorithm (16.0.0)\n  <https://www.unicode.org/reports/tr14/tr14-53.html>`_\n\n\nRelated / Similar Projects\n==========================\n\n`PyICU <https://pypi.python.org/pypi/PyICU>`_ - Python extension wrapping the ICU C++ API\n  *PyICU* is a Python extension wrapping International Components for\n  Unicode library (ICU). It also provides text segmentation supports and\n  they just perform richer and faster than those of ours. PyICU is an\n  extension library so it requires ICU dynamic library (binary files) and\n  compiler to build the extension. Our package is written in pure Python;\n  it runs slower but is more portable.\n\n`pytextseg <https://pypi.python.org/pypi/pytextseg>`_ - Python module for textsegmentation\n  *pytextseg* package focuses very similar goal to ours; it provides\n  Unicode-aware text wrapping features. They designed and uses their\n  original string class (not built-in ``unicod`` / ``str`` classes) for the\n  purpose. We use strings as just ordinary built-in ``unicode`` / ``str``\n  objects for text processing in our modules.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Determine Unicode text segmentations",
    "version": "0.10.0",
    "project_urls": {
        "Documentation": "https://uniseg-py.readthedocs.io/",
        "Repository": "https://bitbucket.org/emptypage/uniseg-py/"
    },
    "split_keywords": [
        "text",
        " unicode"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a4afbce42444a17a774cdcbf375d92793a9177b0406b1c38fd78114e2365123",
                "md5": "1bd169794fe9936605df13dc7ad2091f",
                "sha256": "d280cd632b75efe867aee798d66634426de31a9d29a3d79be208108e8cb45032"
            },
            "downloads": -1,
            "filename": "uniseg-0.10.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1bd169794fe9936605df13dc7ad2091f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 37216,
            "upload_time": "2025-01-23T14:10:43",
            "upload_time_iso_8601": "2025-01-23T14:10:43.625091Z",
            "url": "https://files.pythonhosted.org/packages/4a/4a/fbce42444a17a774cdcbf375d92793a9177b0406b1c38fd78114e2365123/uniseg-0.10.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "269e179eab698c565904a25bf0d8c999ef7ec858c9828809df2d1a5b92a5dd67",
                "md5": "9d81f756efb77dbfcbea64b7bafabf52",
                "sha256": "b5ea258b3a21bfe9ce1adde56836c0d055743a69aba44cb9d3596bda4f0c52a3"
            },
            "downloads": -1,
            "filename": "uniseg-0.10.0.tar.gz",
            "has_sig": false,
            "md5_digest": "9d81f756efb77dbfcbea64b7bafabf52",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 35353,
            "upload_time": "2025-01-23T14:10:45",
            "upload_time_iso_8601": "2025-01-23T14:10:45.779706Z",
            "url": "https://files.pythonhosted.org/packages/26/9e/179eab698c565904a25bf0d8c999ef7ec858c9828809df2d1a5b92a5dd67/uniseg-0.10.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-23 14:10:45",
    "github": false,
    "gitlab": false,
    "bitbucket": true,
    "codeberg": false,
    "bitbucket_user": "emptypage",
    "bitbucket_project": "uniseg-py",
    "lcname": "uniseg"
}
        
Elapsed time: 8.32192s