# py-ministring
Experimental compact UTF-8 string type for CPython as a C-extension.
## Description
py-ministring implements a new string-like type `Utf8String` with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).
## Why py-ministring?
- **Compact Storage**: Stores original UTF-8 bytes instead of wide characters
- **O(1) Indexing**: Uses offset table for fast character access
- **Hash Caching**: Speeds up comparison operations and dictionary usage
- **Protocol Compatibility**: Implements core Python string protocols (indexing, slicing, equality, hashing)
## Installation
```bash
git clone https://github.com/AI-Stratov/py-ministring
cd py-ministring
python setup.py build_ext --inplace
```
## Usage
```python
from ministring import ministr
# Create a string
s = ministr("hello π world")
# Length in codepoints
print(len(s)) # 13
# Indexing
print(s[6]) # "π"
print(s[0]) # "h"
print(s[-1]) # "d"
# Slicing
print(str(s[0:5])) # "hello"
print(str(s[6:7])) # "π"
print(str(s[8:])) # "world"
# Convert to regular string
print(str(s)) # "hello π world"
# Comparison
assert s == "hello π world"
assert "hello π world" == s
# Hashing (can use in dict/set)
d = {s: "value"}
s2 = ministr("hello π world")
print(d[s2]) # "value"
```
## API
### Constructor
- `ministr(obj)` - creates a new Utf8String object from a string or str()-convertible object
### Methods
- `len(s)` - returns the number of Unicode codepoints
- `s[i]` - returns character at index as a regular Python string
- `s[start:stop]` - returns a new Utf8String with slice
- `str(s)` - converts to regular Python string
- `repr(s)` - string representation for debugging
- `hash(s)` - hash value (cached)
- `s == other` - comparison with other Utf8String or regular strings
## Data Structure
```c
typedef struct {
PyObject_HEAD
char *utf8_data; // UTF-8 bytes
Py_ssize_t utf8_size; // size in bytes
int32_t *offsets; // offset table: codepoint β byte
Py_ssize_t length; // number of codepoints
Py_hash_t hash; // cached hash
} Utf8StringObject;
```
## Testing
Run tests with pytest:
```bash
pip install pytest
pytest -v
```
## Limitations
β οΈ **WARNING**: This is an experimental prototype, not intended for production use!
- Missing support for many string methods (`find`, `replace`, etc.)
- May be slower than regular strings for some operations
- No support for step slicing (`s[::2]`)
- Limited handling of invalid UTF-8
- No optimizations for very long strings
## Technical Details
### C API
Core functions for working with Utf8String:
- `Utf8String_FromUTF8(data, size)` - create from UTF-8 data
- `utf8_codepoint_count(data, size)` - count codepoints
- `build_offset_table(self)` - build offset table
- `utf8_char_length(first_byte)` - determine UTF-8 character length
### Architecture
1. **Data Storage**: Original UTF-8 bytes are preserved unchanged
2. **Indexing**: Offset table built on-demand for O(1) access
3. **Caching**: Hash values cached for faster comparisons
4. **Compatibility**: Full support for Python protocols (sequence, mapping)
## Usage Examples
### Working with Emojis
```python
s = ministr("Hello π world π!")
print(f"Length: {len(s)}") # Length: 14
print(f"Emojis: {s[6]}, {s[12]}") # Emojis: π, π
```
### Multi-language Text Processing
```python
s = ministr("Hello δΈη π ΠΠΈΡ")
print(f"English: {str(s[0:5])}") # Hello
print(f"Chinese: {str(s[6:8])}") # δΈη
print(f"Emoji: {s[9]}") # π
print(f"Russian: {str(s[11:14])}") # ΠΠΈΡ
```
### Performance
```python
# Creating many strings with emojis
texts = [ministr(f"Text {i} π") for i in range(1000)]
text_set = set(texts) # Fast thanks to cached hash
```
## License
Experimental code for educational purposes.
Raw data
{
"_id": null,
"home_page": null,
"name": "py-ministring",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "string, utf8, unicode, c-extension, performance",
"author": null,
"author_email": "AI-Stratov <workistratov@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/ae/03/21da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70/py_ministring-0.1.1.tar.gz",
"platform": null,
"description": "# py-ministring\r\n\r\nExperimental compact UTF-8 string type for CPython as a C-extension.\r\n\r\n## Description\r\n\r\npy-ministring implements a new string-like type `Utf8String` with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).\r\n\r\n## Why py-ministring?\r\n\r\n- **Compact Storage**: Stores original UTF-8 bytes instead of wide characters\r\n- **O(1) Indexing**: Uses offset table for fast character access\r\n- **Hash Caching**: Speeds up comparison operations and dictionary usage\r\n- **Protocol Compatibility**: Implements core Python string protocols (indexing, slicing, equality, hashing)\r\n\r\n## Installation\r\n\r\n```bash\r\ngit clone https://github.com/AI-Stratov/py-ministring\r\ncd py-ministring\r\npython setup.py build_ext --inplace\r\n```\r\n\r\n## Usage\r\n\r\n```python\r\nfrom ministring import ministr\r\n\r\n# Create a string\r\ns = ministr(\"hello \ud83d\ude03 world\")\r\n\r\n# Length in codepoints\r\nprint(len(s)) # 13\r\n\r\n# Indexing\r\nprint(s[6]) # \"\ud83d\ude03\"\r\nprint(s[0]) # \"h\"\r\nprint(s[-1]) # \"d\"\r\n\r\n# Slicing\r\nprint(str(s[0:5])) # \"hello\"\r\nprint(str(s[6:7])) # \"\ud83d\ude03\"\r\nprint(str(s[8:])) # \"world\"\r\n\r\n# Convert to regular string\r\nprint(str(s)) # \"hello \ud83d\ude03 world\"\r\n\r\n# Comparison\r\nassert s == \"hello \ud83d\ude03 world\"\r\nassert \"hello \ud83d\ude03 world\" == s\r\n\r\n# Hashing (can use in dict/set)\r\nd = {s: \"value\"}\r\ns2 = ministr(\"hello \ud83d\ude03 world\")\r\nprint(d[s2]) # \"value\"\r\n```\r\n\r\n## API\r\n\r\n### Constructor\r\n\r\n- `ministr(obj)` - creates a new Utf8String object from a string or str()-convertible object\r\n\r\n### Methods\r\n\r\n- `len(s)` - returns the number of Unicode codepoints\r\n- `s[i]` - returns character at index as a regular Python string\r\n- `s[start:stop]` - returns a new Utf8String with slice\r\n- `str(s)` - converts to regular Python string\r\n- `repr(s)` - string representation for debugging\r\n- `hash(s)` - hash value (cached)\r\n- `s == other` - comparison with other Utf8String or regular strings\r\n\r\n## Data Structure\r\n\r\n```c\r\ntypedef struct {\r\n PyObject_HEAD\r\n char *utf8_data; // UTF-8 bytes\r\n Py_ssize_t utf8_size; // size in bytes\r\n int32_t *offsets; // offset table: codepoint \u2192 byte\r\n Py_ssize_t length; // number of codepoints\r\n Py_hash_t hash; // cached hash\r\n} Utf8StringObject;\r\n```\r\n\r\n## Testing\r\n\r\nRun tests with pytest:\r\n\r\n```bash\r\npip install pytest\r\npytest -v\r\n```\r\n\r\n## Limitations\r\n\r\n\u26a0\ufe0f **WARNING**: This is an experimental prototype, not intended for production use!\r\n\r\n- Missing support for many string methods (`find`, `replace`, etc.)\r\n- May be slower than regular strings for some operations\r\n- No support for step slicing (`s[::2]`)\r\n- Limited handling of invalid UTF-8\r\n- No optimizations for very long strings\r\n\r\n## Technical Details\r\n\r\n### C API\r\n\r\nCore functions for working with Utf8String:\r\n\r\n- `Utf8String_FromUTF8(data, size)` - create from UTF-8 data\r\n- `utf8_codepoint_count(data, size)` - count codepoints\r\n- `build_offset_table(self)` - build offset table\r\n- `utf8_char_length(first_byte)` - determine UTF-8 character length\r\n\r\n### Architecture\r\n\r\n1. **Data Storage**: Original UTF-8 bytes are preserved unchanged\r\n2. **Indexing**: Offset table built on-demand for O(1) access\r\n3. **Caching**: Hash values cached for faster comparisons\r\n4. **Compatibility**: Full support for Python protocols (sequence, mapping)\r\n\r\n## Usage Examples\r\n\r\n### Working with Emojis\r\n\r\n```python\r\ns = ministr(\"Hello \ud83d\udc4b world \ud83c\udf0d!\")\r\nprint(f\"Length: {len(s)}\") # Length: 14\r\nprint(f\"Emojis: {s[6]}, {s[12]}\") # Emojis: \ud83d\udc4b, \ud83c\udf0d\r\n```\r\n\r\n### Multi-language Text Processing\r\n\r\n```python\r\ns = ministr(\"Hello \u4e16\u754c \ud83c\udf0d \u041c\u0438\u0440\")\r\nprint(f\"English: {str(s[0:5])}\") # Hello\r\nprint(f\"Chinese: {str(s[6:8])}\") # \u4e16\u754c\r\nprint(f\"Emoji: {s[9]}\") # \ud83c\udf0d\r\nprint(f\"Russian: {str(s[11:14])}\") # \u041c\u0438\u0440\r\n```\r\n\r\n### Performance\r\n\r\n```python\r\n# Creating many strings with emojis\r\ntexts = [ministr(f\"Text {i} \ud83d\ude00\") for i in range(1000)]\r\ntext_set = set(texts) # Fast thanks to cached hash\r\n```\r\n\r\n## License\r\n\r\nExperimental code for educational purposes.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Experimental compact UTF-8 string type for CPython",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/AI-Stratov/py-ministring",
"Issues": "https://github.com/AI-Stratov/py-ministring/issues",
"Repository": "https://github.com/AI-Stratov/py-ministring.git"
},
"split_keywords": [
"string",
" utf8",
" unicode",
" c-extension",
" performance"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5ea43d0bcfea66f16358a5852e2ead590c3bfb4d472dda275035c2dee358663a",
"md5": "391800eb532cab151cc39264cd0ef15b",
"sha256": "2dbcf958c875227011f686caef0b57102a8c8fb7b08dfbea155b563895e2f916"
},
"downloads": -1,
"filename": "py_ministring-0.1.1-cp313-cp313-win_amd64.whl",
"has_sig": false,
"md5_digest": "391800eb532cab151cc39264cd0ef15b",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.8",
"size": 10797,
"upload_time": "2025-08-26T19:04:05",
"upload_time_iso_8601": "2025-08-26T19:04:05.057936Z",
"url": "https://files.pythonhosted.org/packages/5e/a4/3d0bcfea66f16358a5852e2ead590c3bfb4d472dda275035c2dee358663a/py_ministring-0.1.1-cp313-cp313-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ae0321da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70",
"md5": "be5312e79aacbc85413eff6a65f4f38c",
"sha256": "6ea9dc2fb7ca1c05bdd480b30459332dc524cd6c6a45e7a36b4e17e94b41c81b"
},
"downloads": -1,
"filename": "py_ministring-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "be5312e79aacbc85413eff6a65f4f38c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 12321,
"upload_time": "2025-08-26T19:04:05",
"upload_time_iso_8601": "2025-08-26T19:04:05.886305Z",
"url": "https://files.pythonhosted.org/packages/ae/03/21da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70/py_ministring-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-26 19:04:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AI-Stratov",
"github_project": "py-ministring",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "py-ministring"
}