# unicategories
Unicode category database, generated and cached on setup.
This module exposes a category dictionary containing `RangeGroup` instances,
containing all unicode category character ranges detected on your system.
## Example
```python
from unicategories import categories
upperchars = categories['Lu'].characters() # iterator
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
# Unicode uppercase caracters are "ABCDEF..."
```
## RangeGroup
Immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's `range`, open at the end.
This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.
RangeGroup class provides the following methods:
### `range_group.characters()`
`type: () -> typing.Iterator[str]`
```rst
Get iterator with all characters on this range group.
:returns: iterator of characters (str of size 1)
```
### `range_group.codes()`
`type: () -> typing.Iterator[int]`
```rst
Get iterator for all unicode code points contained in this range group.
:returns: iterator of character indexes (int)
```
### `range_group.has(character)`
`type: (typing.Union[str, int]) -> bool`
```rst
Get if character (or character code point) is part of this range group.
:param character: character or unicode code point to look for
:returns: True if character is contained by any range, False otherwise
```
## Unicode categories
Taken from [wikipedia](https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)).
| Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
|--------|----------------------------|----------------|------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| Lu | Letter, uppercase | Graphic | Character | | |
| Ll | Letter, lowercase | Graphic | Character | | |
| Lt | Letter, titlecase | Graphic | Character | | Ligatures containing uppercase followed by lowercase letters (e.g., `Dž` , `Lj` , `Nj` , and `Dz` ) |
| Lm | Letter, modifier | Graphic | Character | | |
| Lo | Letter, other | Graphic | Character | | |
| Mn | Mark, nonspacing | Graphic | Character | | |
| Mc | Mark, spacing combining | Graphic | Character | | |
| Me | Mark, enclosing | Graphic | Character | | |
| Nd | Number, decimal digit | Graphic | Character | | All these, and only these, have Numeric Type = De |
| Nl | Number, letter | Graphic | Character | | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) |
| No | Number, other | Graphic | Character | | E.g., vulgar fractions , superscript and subscript digits |
| Pc | Punctuation, connector | Graphic | Character | | Includes "_" underscore |
| Pd | Punctuation, dash | Graphic | Character | | Includes several hyphen characters |
| Ps | Punctuation, open | Graphic | Character | | Opening bracket characters |
| Pe | Punctuation, close | Graphic | Character | | Closing bracket characters |
| Pi | Punctuation, initial quote | Graphic | Character | | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
| Pf | Punctuation, final quote | Graphic | Character | | Closing quotation mark. May behave like Ps or Pe depending on usage |
| Po | Punctuation, other | Graphic | Character | | |
| Sm | Symbol, math | Graphic | Character | | |
| Sc | Symbol, currency | Graphic | Character | | |
| Sk | Symbol, modifier | Graphic | Character | | |
| So | Symbol, other | Graphic | Character | | |
| Zs | Separator, space | Graphic | Character | | Includes the space, but not TAB , CR , or LF , which are Cc |
| Zl | Separator, line | Format | Character | | Only U+2028 LINE SEPARATOR (LSEP) |
| Zp | Separator, paragraph | Format | Character | | Only U+2029 PARAGRAPH SEPARATOR (PSEP) |
| Cc | Other, control | Control | Character | Fixed 65 | No name , `<control>` |
| Cf | Other, format | Format | Character | | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters |
| Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , `<surrogate>` |
| Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , `<private-use>` |
| Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , `<noncharacter>` |
| Cn | Other, not assigned | Reserved | Not | Not fixed | No name , `<reserved>` |
In addition to that, unicategories provide general categories `L`, `M`, `N`, `P`, `S`, `Z` and `C`.
Raw data
{
"_id": null,
"home_page": "https://gitlab.com/ergoithz/unicategories",
"name": "unicategories",
"maintainer": "",
"docs_url": null,
"requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
"maintainer_email": "",
"keywords": "unicode",
"author": "Felipe A. Hernandez",
"author_email": "ergoithz@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/91/0a/3fb12e60f0a0cfe659e1527aafd1aef99d80dbe5c8f2efa7b33e4a41f943/unicategories-0.1.2.tar.gz",
"platform": "any",
"description": "# unicategories\n\nUnicode category database, generated and cached on setup.\n\nThis module exposes a category dictionary containing `RangeGroup` instances,\ncontaining all unicode category character ranges detected on your system.\n\n## Example\n\n```python\nfrom unicategories import categories\n\nupperchars = categories['Lu'].characters() # iterator\nprint('Unicode uppercase caracters are \"%s\"' % ''.join(upperchars))\n# Unicode uppercase caracters are \"ABCDEF...\"\n```\n\n## RangeGroup\n\nImmutable iterable (based on tuple, with some useful methods) of (start, end)\ntuples being, like python's `range`, open at the end.\n\nThis method have been chosen for memory efficiency, storing individually all\ncharacters on memory would take a lot of memory.\n\nRangeGroup class provides the following methods:\n\n### `range_group.characters()`\n`type: () -> typing.Iterator[str]`\n```rst\nGet iterator with all characters on this range group.\n\n:returns: iterator of characters (str of size 1)\n```\n\n### `range_group.codes()`\n`type: () -> typing.Iterator[int]`\n```rst\nGet iterator for all unicode code points contained in this range group.\n\n:returns: iterator of character indexes (int)\n```\n\n### `range_group.has(character)`\n`type: (typing.Union[str, int]) -> bool`\n```rst\nGet if character (or character code point) is part of this range group.\n\n:param character: character or unicode code point to look for\n:returns: True if character is contained by any range, False otherwise\n```\n\n## Unicode categories\n\nTaken from [wikipedia](https://en.wikipedia.org/wiki/Template:General_Category_(Unicode)).\n\n| Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |\n|--------|----------------------------|----------------|------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|\n| Lu | Letter, uppercase | Graphic | Character | | |\n| Ll | Letter, lowercase | Graphic | Character | | |\n| Lt | Letter, titlecase | Graphic | Character | | Ligatures containing uppercase followed by lowercase letters (e.g., `\u01c5` , `\u01c8` , `\u01cb` , and `\u01f2` ) |\n| Lm | Letter, modifier | Graphic | Character | | |\n| Lo | Letter, other | Graphic | Character | | |\n| Mn | Mark, nonspacing | Graphic | Character | | |\n| Mc | Mark, spacing combining | Graphic | Character | | |\n| Me | Mark, enclosing | Graphic | Character | | |\n| Nd | Number, decimal digit | Graphic | Character | | All these, and only these, have Numeric Type = De |\n| Nl | Number, letter | Graphic | Character | | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) |\n| No | Number, other | Graphic | Character | | E.g., vulgar fractions , superscript and subscript digits |\n| Pc | Punctuation, connector | Graphic | Character | | Includes \"_\" underscore |\n| Pd | Punctuation, dash | Graphic | Character | | Includes several hyphen characters |\n| Ps | Punctuation, open | Graphic | Character | | Opening bracket characters |\n| Pe | Punctuation, close | Graphic | Character | | Closing bracket characters |\n| Pi | Punctuation, initial quote | Graphic | Character | | Opening quotation mark . Does not include the ASCII \"neutral\" quotation mark. May behave like Ps or Pe depending on usage |\n| Pf | Punctuation, final quote | Graphic | Character | | Closing quotation mark. May behave like Ps or Pe depending on usage |\n| Po | Punctuation, other | Graphic | Character | | |\n| Sm | Symbol, math | Graphic | Character | | |\n| Sc | Symbol, currency | Graphic | Character | | |\n| Sk | Symbol, modifier | Graphic | Character | | |\n| So | Symbol, other | Graphic | Character | | |\n| Zs | Separator, space | Graphic | Character | | Includes the space, but not TAB , CR , or LF , which are Cc |\n| Zl | Separator, line | Format | Character | | Only U+2028 LINE SEPARATOR (LSEP) |\n| Zp | Separator, paragraph | Format | Character | | Only U+2029 PARAGRAPH SEPARATOR (PSEP) |\n| Cc | Other, control | Control | Character | Fixed 65 | No name , `<control>` |\n| Cf | Other, format | Format | Character | | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters |\n| Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , `<surrogate>` |\n| Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15\u201316 | No name , `<private-use>` |\n| Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , `<noncharacter>` |\n| Cn | Other, not assigned | Reserved | Not | Not fixed | No name , `<reserved>` |\n\nIn addition to that, unicategories provide general categories `L`, `M`, `N`, `P`, `S`, `Z` and `C`.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Unicode category database",
"version": "0.1.2",
"split_keywords": [
"unicode"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "910a3fb12e60f0a0cfe659e1527aafd1aef99d80dbe5c8f2efa7b33e4a41f943",
"md5": "c0749f5458daa4518ea7654b6a250241",
"sha256": "8e005e80ed156da58eb584c26e6f4b073b15a86963b4afa4cc37045e926a9591"
},
"downloads": -1,
"filename": "unicategories-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "c0749f5458daa4518ea7654b6a250241",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
"size": 12397,
"upload_time": "2023-04-02T13:27:47",
"upload_time_iso_8601": "2023-04-02T13:27:47.794647Z",
"url": "https://files.pythonhosted.org/packages/91/0a/3fb12e60f0a0cfe659e1527aafd1aef99d80dbe5c8f2efa7b33e4a41f943/unicategories-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-02 13:27:47",
"github": false,
"gitlab": true,
"bitbucket": false,
"gitlab_user": "ergoithz",
"gitlab_project": "unicategories",
"lcname": "unicategories"
}