unicategories


Nameunicategories JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://gitlab.com/ergoithz/unicategories
SummaryUnicode category database
upload_time2023-04-02 13:27:47
maintainer
docs_urlNone
authorFelipe A. Hernandez
requires_python>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
licenseMIT
keywords unicode
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # unicategories

Unicode category database, generated and cached on setup.

This module exposes a category dictionary containing `RangeGroup` instances,
containing all unicode category character ranges detected on your system.

## Example

```python
from unicategories import categories

upperchars = categories['Lu'].characters()  # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."
```

## RangeGroup

Immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's `range`, open at the end.

This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.

RangeGroup class provides the following methods:

### `range_group.characters()`
`type: () -> typing.Iterator[str]`
```rst
Get iterator with all characters on this range group.

:returns: iterator of characters (str of size 1)
```

### `range_group.codes()`
`type: () -> typing.Iterator[int]`
```rst
Get iterator for all unicode code points contained in this range group.

:returns: iterator of character indexes (int)
```

### `range_group.has(character)`
`type: (typing.Union[str, int]) -> bool`
```rst
Get if character (or character code point) is part of this range group.

:param character: character or unicode code point to look for
:returns: True if character is contained by any range, False otherwise
```

## Unicode categories

Taken from [wikipedia](https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)).

| Value  | Category Major, minor      | Basic type     | Character assigned     | Fixed                                                       | Remarks                                                                                                                   |
|--------|----------------------------|----------------|------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| Lu     | Letter, uppercase          | Graphic        | Character              |                                                             |                                                                                                                           |
| Ll     | Letter, lowercase          | Graphic        | Character              |                                                             |                                                                                                                           |
| Lt     | Letter, titlecase          | Graphic        | Character              |                                                             | Ligatures containing uppercase followed by lowercase letters (e.g., `Dž` , `Lj` , `Nj` , and `Dz` )                           |
| Lm     | Letter, modifier           | Graphic        | Character              |                                                             |                                                                                                                           |
| Lo     | Letter, other              | Graphic        | Character              |                                                             |                                                                                                                           |
| Mn     | Mark, nonspacing           | Graphic        | Character              |                                                             |                                                                                                                           |
| Mc     | Mark, spacing combining    | Graphic        | Character              |                                                             |                                                                                                                           |
| Me     | Mark, enclosing            | Graphic        | Character              |                                                             |                                                                                                                           |
| Nd     | Number, decimal digit      | Graphic        | Character              |                                                             | All these, and only these, have Numeric Type = De                                                                         |
| Nl     | Number, letter             | Graphic        | Character              |                                                             | Numerals composed of letters or letterlike symbols (e.g., Roman numerals )                                                |
| No     | Number, other              | Graphic        | Character              |                                                             | E.g., vulgar fractions , superscript and subscript digits                                                                 |
| Pc     | Punctuation, connector     | Graphic        | Character              |                                                             | Includes "_" underscore                                                                                                   |
| Pd     | Punctuation, dash          | Graphic        | Character              |                                                             | Includes several hyphen characters                                                                                        |
| Ps     | Punctuation, open          | Graphic        | Character              |                                                             | Opening bracket characters                                                                                                |
| Pe     | Punctuation, close         | Graphic        | Character              |                                                             | Closing bracket characters                                                                                                |
| Pi     | Punctuation, initial quote | Graphic        | Character              |                                                             | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
| Pf     | Punctuation, final quote   | Graphic        | Character              |                                                             | Closing quotation mark. May behave like Ps or Pe depending on usage                                                       |
| Po     | Punctuation, other         | Graphic        | Character              |                                                             |                                                                                                                           |
| Sm     | Symbol, math               | Graphic        | Character              |                                                             |                                                                                                                           |
| Sc     | Symbol, currency           | Graphic        | Character              |                                                             |                                                                                                                           |
| Sk     | Symbol, modifier           | Graphic        | Character              |                                                             |                                                                                                                           |
| So     | Symbol, other              | Graphic        | Character              |                                                             |                                                                                                                           |
| Zs     | Separator, space           | Graphic        | Character              |                                                             | Includes the space, but not TAB , CR , or LF , which are Cc                                                               |
| Zl     | Separator, line            | Format         | Character              |                                                             | Only U+2028 LINE SEPARATOR (LSEP)                                                                                         |
| Zp     | Separator, paragraph       | Format         | Character              |                                                             | Only U+2029 PARAGRAPH SEPARATOR (PSEP)                                                                                    |
| Cc     | Other, control             | Control        | Character              | Fixed 65                                                    | No name     , `<control>`                                                                                                 |
| Cf     | Other, format              | Format         | Character              |                                                             | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters                |
| Cs     | Other, surrogate           | Surrogate      | Not (but abstract)     | Fixed 2,048                                                 | No name     , `<surrogate>`                                                                                               |
| Co     | Other, private use         | Private-use    | Not (but abstract)     | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name     , `<private-use>`                                                                                             |
| Cn     | Other, not assigned        | Noncharacter   | Not                    | Fixed 66                                                    | No name     , `<noncharacter>`                                                                                            |
| Cn     | Other, not assigned        | Reserved       | Not                    | Not fixed                                                   | No name     , `<reserved>`                                                                                                |

In addition to that, unicategories provide general categories `L`, `M`, `N`, `P`, `S`, `Z` and `C`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/ergoithz/unicategories",
    "name": "unicategories",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
    "maintainer_email": "",
    "keywords": "unicode",
    "author": "Felipe A. Hernandez",
    "author_email": "ergoithz@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/91/0a/3fb12e60f0a0cfe659e1527aafd1aef99d80dbe5c8f2efa7b33e4a41f943/unicategories-0.1.2.tar.gz",
    "platform": "any",
    "description": "# unicategories\n\nUnicode category database, generated and cached on setup.\n\nThis module exposes a category dictionary containing `RangeGroup` instances,\ncontaining all unicode category character ranges detected on your system.\n\n## Example\n\n```python\nfrom unicategories import categories\n\nupperchars = categories['Lu'].characters()  # iterator\nprint('Unicode uppercase caracters are \"%s\"' % ''.join(upperchars))\n# Unicode uppercase caracters are \"ABCDEF...\"\n```\n\n## RangeGroup\n\nImmutable iterable (based on tuple, with some useful methods) of (start, end)\ntuples being, like python's `range`, open at the end.\n\nThis method have been chosen for memory efficiency, storing individually all\ncharacters on memory would take a lot of memory.\n\nRangeGroup class provides the following methods:\n\n### `range_group.characters()`\n`type: () -> typing.Iterator[str]`\n```rst\nGet iterator with all characters on this range group.\n\n:returns: iterator of characters (str of size 1)\n```\n\n### `range_group.codes()`\n`type: () -> typing.Iterator[int]`\n```rst\nGet iterator for all unicode code points contained in this range group.\n\n:returns: iterator of character indexes (int)\n```\n\n### `range_group.has(character)`\n`type: (typing.Union[str, int]) -> bool`\n```rst\nGet if character (or character code point) is part of this range group.\n\n:param character: character or unicode code point to look for\n:returns: True if character is contained by any range, False otherwise\n```\n\n## Unicode categories\n\nTaken from [wikipedia](https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)).\n\n| Value  | Category Major, minor      | Basic type     | Character assigned     | Fixed                                                       | Remarks                                                                                                                   |\n|--------|----------------------------|----------------|------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|\n| Lu     | Letter, uppercase          | Graphic        | Character              |                                                             |                                                                                                                           |\n| Ll     | Letter, lowercase          | Graphic        | Character              |                                                             |                                                                                                                           |\n| Lt     | Letter, titlecase          | Graphic        | Character              |                                                             | Ligatures containing uppercase followed by lowercase letters (e.g., `\u01c5` , `\u01c8` , `\u01cb` , and `\u01f2` )                           |\n| Lm     | Letter, modifier           | Graphic        | Character              |                                                             |                                                                                                                           |\n| Lo     | Letter, other              | Graphic        | Character              |                                                             |                                                                                                                           |\n| Mn     | Mark, nonspacing           | Graphic        | Character              |                                                             |                                                                                                                           |\n| Mc     | Mark, spacing combining    | Graphic        | Character              |                                                             |                                                                                                                           |\n| Me     | Mark, enclosing            | Graphic        | Character              |                                                             |                                                                                                                           |\n| Nd     | Number, decimal digit      | Graphic        | Character              |                                                             | All these, and only these, have Numeric Type = De                                                                         |\n| Nl     | Number, letter             | Graphic        | Character              |                                                             | Numerals composed of letters or letterlike symbols (e.g., Roman numerals )                                                |\n| No     | Number, other              | Graphic        | Character              |                                                             | E.g., vulgar fractions , superscript and subscript digits                                                                 |\n| Pc     | Punctuation, connector     | Graphic        | Character              |                                                             | Includes \"_\" underscore                                                                                                   |\n| Pd     | Punctuation, dash          | Graphic        | Character              |                                                             | Includes several hyphen characters                                                                                        |\n| Ps     | Punctuation, open          | Graphic        | Character              |                                                             | Opening bracket characters                                                                                                |\n| Pe     | Punctuation, close         | Graphic        | Character              |                                                             | Closing bracket characters                                                                                                |\n| Pi     | Punctuation, initial quote | Graphic        | Character              |                                                             | Opening quotation mark . Does not include the ASCII \"neutral\" quotation mark. May behave like Ps or Pe depending on usage |\n| Pf     | Punctuation, final quote   | Graphic        | Character              |                                                             | Closing quotation mark. May behave like Ps or Pe depending on usage                                                       |\n| Po     | Punctuation, other         | Graphic        | Character              |                                                             |                                                                                                                           |\n| Sm     | Symbol, math               | Graphic        | Character              |                                                             |                                                                                                                           |\n| Sc     | Symbol, currency           | Graphic        | Character              |                                                             |                                                                                                                           |\n| Sk     | Symbol, modifier           | Graphic        | Character              |                                                             |                                                                                                                           |\n| So     | Symbol, other              | Graphic        | Character              |                                                             |                                                                                                                           |\n| Zs     | Separator, space           | Graphic        | Character              |                                                             | Includes the space, but not TAB , CR , or LF , which are Cc                                                               |\n| Zl     | Separator, line            | Format         | Character              |                                                             | Only U+2028 LINE SEPARATOR (LSEP)                                                                                         |\n| Zp     | Separator, paragraph       | Format         | Character              |                                                             | Only U+2029 PARAGRAPH SEPARATOR (PSEP)                                                                                    |\n| Cc     | Other, control             | Control        | Character              | Fixed 65                                                    | No name     , `<control>`                                                                                                 |\n| Cf     | Other, format              | Format         | Character              |                                                             | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters                |\n| Cs     | Other, surrogate           | Surrogate      | Not (but abstract)     | Fixed 2,048                                                 | No name     , `<surrogate>`                                                                                               |\n| Co     | Other, private use         | Private-use    | Not (but abstract)     | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15\u201316 | No name     , `<private-use>`                                                                                             |\n| Cn     | Other, not assigned        | Noncharacter   | Not                    | Fixed 66                                                    | No name     , `<noncharacter>`                                                                                            |\n| Cn     | Other, not assigned        | Reserved       | Not                    | Not fixed                                                   | No name     , `<reserved>`                                                                                                |\n\nIn addition to that, unicategories provide general categories `L`, `M`, `N`, `P`, `S`, `Z` and `C`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Unicode category database",
    "version": "0.1.2",
    "split_keywords": [
        "unicode"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "910a3fb12e60f0a0cfe659e1527aafd1aef99d80dbe5c8f2efa7b33e4a41f943",
                "md5": "c0749f5458daa4518ea7654b6a250241",
                "sha256": "8e005e80ed156da58eb584c26e6f4b073b15a86963b4afa4cc37045e926a9591"
            },
            "downloads": -1,
            "filename": "unicategories-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "c0749f5458daa4518ea7654b6a250241",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
            "size": 12397,
            "upload_time": "2023-04-02T13:27:47",
            "upload_time_iso_8601": "2023-04-02T13:27:47.794647Z",
            "url": "https://files.pythonhosted.org/packages/91/0a/3fb12e60f0a0cfe659e1527aafd1aef99d80dbe5c8f2efa7b33e4a41f943/unicategories-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-02 13:27:47",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "gitlab_user": "ergoithz",
    "gitlab_project": "unicategories",
    "lcname": "unicategories"
}
        
Elapsed time: 0.14976s