py-ministring


Namepy-ministring JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryExperimental compact UTF-8 string type for CPython
upload_time2025-08-26 19:04:05
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords string utf8 unicode c-extension performance
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # py-ministring

Experimental compact UTF-8 string type for CPython as a C-extension.

## Description

py-ministring implements a new string-like type `Utf8String` with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).

## Why py-ministring?

- **Compact Storage**: Stores original UTF-8 bytes instead of wide characters
- **O(1) Indexing**: Uses offset table for fast character access
- **Hash Caching**: Speeds up comparison operations and dictionary usage
- **Protocol Compatibility**: Implements core Python string protocols (indexing, slicing, equality, hashing)

## Installation

```bash
git clone https://github.com/AI-Stratov/py-ministring
cd py-ministring
python setup.py build_ext --inplace
```

## Usage

```python
from ministring import ministr

# Create a string
s = ministr("hello πŸ˜ƒ world")

# Length in codepoints
print(len(s))      # 13

# Indexing
print(s[6])        # "πŸ˜ƒ"
print(s[0])        # "h"
print(s[-1])       # "d"

# Slicing
print(str(s[0:5]))    # "hello"
print(str(s[6:7]))    # "πŸ˜ƒ"
print(str(s[8:]))     # "world"

# Convert to regular string
print(str(s))      # "hello πŸ˜ƒ world"

# Comparison
assert s == "hello πŸ˜ƒ world"
assert "hello πŸ˜ƒ world" == s

# Hashing (can use in dict/set)
d = {s: "value"}
s2 = ministr("hello πŸ˜ƒ world")
print(d[s2])       # "value"
```

## API

### Constructor

- `ministr(obj)` - creates a new Utf8String object from a string or str()-convertible object

### Methods

- `len(s)` - returns the number of Unicode codepoints
- `s[i]` - returns character at index as a regular Python string
- `s[start:stop]` - returns a new Utf8String with slice
- `str(s)` - converts to regular Python string
- `repr(s)` - string representation for debugging
- `hash(s)` - hash value (cached)
- `s == other` - comparison with other Utf8String or regular strings

## Data Structure

```c
typedef struct {
    PyObject_HEAD
    char *utf8_data;        // UTF-8 bytes
    Py_ssize_t utf8_size;   // size in bytes
    int32_t *offsets;       // offset table: codepoint β†’ byte
    Py_ssize_t length;      // number of codepoints
    Py_hash_t hash;         // cached hash
} Utf8StringObject;
```

## Testing

Run tests with pytest:

```bash
pip install pytest
pytest -v
```

## Limitations

⚠️ **WARNING**: This is an experimental prototype, not intended for production use!

- Missing support for many string methods (`find`, `replace`, etc.)
- May be slower than regular strings for some operations
- No support for step slicing (`s[::2]`)
- Limited handling of invalid UTF-8
- No optimizations for very long strings

## Technical Details

### C API

Core functions for working with Utf8String:

- `Utf8String_FromUTF8(data, size)` - create from UTF-8 data
- `utf8_codepoint_count(data, size)` - count codepoints
- `build_offset_table(self)` - build offset table
- `utf8_char_length(first_byte)` - determine UTF-8 character length

### Architecture

1. **Data Storage**: Original UTF-8 bytes are preserved unchanged
2. **Indexing**: Offset table built on-demand for O(1) access
3. **Caching**: Hash values cached for faster comparisons
4. **Compatibility**: Full support for Python protocols (sequence, mapping)

## Usage Examples

### Working with Emojis

```python
s = ministr("Hello πŸ‘‹ world 🌍!")
print(f"Length: {len(s)}")           # Length: 14
print(f"Emojis: {s[6]}, {s[12]}")    # Emojis: πŸ‘‹, 🌍
```

### Multi-language Text Processing

```python
s = ministr("Hello δΈ–η•Œ 🌍 ΠœΠΈΡ€")
print(f"English: {str(s[0:5])}")     # Hello
print(f"Chinese: {str(s[6:8])}")     # δΈ–η•Œ
print(f"Emoji: {s[9]}")              # 🌍
print(f"Russian: {str(s[11:14])}")   # ΠœΠΈΡ€
```

### Performance

```python
# Creating many strings with emojis
texts = [ministr(f"Text {i} πŸ˜€") for i in range(1000)]
text_set = set(texts)  # Fast thanks to cached hash
```

## License

Experimental code for educational purposes.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "py-ministring",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "string, utf8, unicode, c-extension, performance",
    "author": null,
    "author_email": "AI-Stratov <workistratov@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ae/03/21da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70/py_ministring-0.1.1.tar.gz",
    "platform": null,
    "description": "# py-ministring\r\n\r\nExperimental compact UTF-8 string type for CPython as a C-extension.\r\n\r\n## Description\r\n\r\npy-ministring implements a new string-like type `Utf8String` with efficient Unicode indexing and slicing. This prototype is designed to reduce memory footprint when working with texts containing predominantly ASCII characters with occasional multi-byte characters (like emojis).\r\n\r\n## Why py-ministring?\r\n\r\n- **Compact Storage**: Stores original UTF-8 bytes instead of wide characters\r\n- **O(1) Indexing**: Uses offset table for fast character access\r\n- **Hash Caching**: Speeds up comparison operations and dictionary usage\r\n- **Protocol Compatibility**: Implements core Python string protocols (indexing, slicing, equality, hashing)\r\n\r\n## Installation\r\n\r\n```bash\r\ngit clone https://github.com/AI-Stratov/py-ministring\r\ncd py-ministring\r\npython setup.py build_ext --inplace\r\n```\r\n\r\n## Usage\r\n\r\n```python\r\nfrom ministring import ministr\r\n\r\n# Create a string\r\ns = ministr(\"hello \ud83d\ude03 world\")\r\n\r\n# Length in codepoints\r\nprint(len(s))      # 13\r\n\r\n# Indexing\r\nprint(s[6])        # \"\ud83d\ude03\"\r\nprint(s[0])        # \"h\"\r\nprint(s[-1])       # \"d\"\r\n\r\n# Slicing\r\nprint(str(s[0:5]))    # \"hello\"\r\nprint(str(s[6:7]))    # \"\ud83d\ude03\"\r\nprint(str(s[8:]))     # \"world\"\r\n\r\n# Convert to regular string\r\nprint(str(s))      # \"hello \ud83d\ude03 world\"\r\n\r\n# Comparison\r\nassert s == \"hello \ud83d\ude03 world\"\r\nassert \"hello \ud83d\ude03 world\" == s\r\n\r\n# Hashing (can use in dict/set)\r\nd = {s: \"value\"}\r\ns2 = ministr(\"hello \ud83d\ude03 world\")\r\nprint(d[s2])       # \"value\"\r\n```\r\n\r\n## API\r\n\r\n### Constructor\r\n\r\n- `ministr(obj)` - creates a new Utf8String object from a string or str()-convertible object\r\n\r\n### Methods\r\n\r\n- `len(s)` - returns the number of Unicode codepoints\r\n- `s[i]` - returns character at index as a regular Python string\r\n- `s[start:stop]` - returns a new Utf8String with slice\r\n- `str(s)` - converts to regular Python string\r\n- `repr(s)` - string representation for debugging\r\n- `hash(s)` - hash value (cached)\r\n- `s == other` - comparison with other Utf8String or regular strings\r\n\r\n## Data Structure\r\n\r\n```c\r\ntypedef struct {\r\n    PyObject_HEAD\r\n    char *utf8_data;        // UTF-8 bytes\r\n    Py_ssize_t utf8_size;   // size in bytes\r\n    int32_t *offsets;       // offset table: codepoint \u2192 byte\r\n    Py_ssize_t length;      // number of codepoints\r\n    Py_hash_t hash;         // cached hash\r\n} Utf8StringObject;\r\n```\r\n\r\n## Testing\r\n\r\nRun tests with pytest:\r\n\r\n```bash\r\npip install pytest\r\npytest -v\r\n```\r\n\r\n## Limitations\r\n\r\n\u26a0\ufe0f **WARNING**: This is an experimental prototype, not intended for production use!\r\n\r\n- Missing support for many string methods (`find`, `replace`, etc.)\r\n- May be slower than regular strings for some operations\r\n- No support for step slicing (`s[::2]`)\r\n- Limited handling of invalid UTF-8\r\n- No optimizations for very long strings\r\n\r\n## Technical Details\r\n\r\n### C API\r\n\r\nCore functions for working with Utf8String:\r\n\r\n- `Utf8String_FromUTF8(data, size)` - create from UTF-8 data\r\n- `utf8_codepoint_count(data, size)` - count codepoints\r\n- `build_offset_table(self)` - build offset table\r\n- `utf8_char_length(first_byte)` - determine UTF-8 character length\r\n\r\n### Architecture\r\n\r\n1. **Data Storage**: Original UTF-8 bytes are preserved unchanged\r\n2. **Indexing**: Offset table built on-demand for O(1) access\r\n3. **Caching**: Hash values cached for faster comparisons\r\n4. **Compatibility**: Full support for Python protocols (sequence, mapping)\r\n\r\n## Usage Examples\r\n\r\n### Working with Emojis\r\n\r\n```python\r\ns = ministr(\"Hello \ud83d\udc4b world \ud83c\udf0d!\")\r\nprint(f\"Length: {len(s)}\")           # Length: 14\r\nprint(f\"Emojis: {s[6]}, {s[12]}\")    # Emojis: \ud83d\udc4b, \ud83c\udf0d\r\n```\r\n\r\n### Multi-language Text Processing\r\n\r\n```python\r\ns = ministr(\"Hello \u4e16\u754c \ud83c\udf0d \u041c\u0438\u0440\")\r\nprint(f\"English: {str(s[0:5])}\")     # Hello\r\nprint(f\"Chinese: {str(s[6:8])}\")     # \u4e16\u754c\r\nprint(f\"Emoji: {s[9]}\")              # \ud83c\udf0d\r\nprint(f\"Russian: {str(s[11:14])}\")   # \u041c\u0438\u0440\r\n```\r\n\r\n### Performance\r\n\r\n```python\r\n# Creating many strings with emojis\r\ntexts = [ministr(f\"Text {i} \ud83d\ude00\") for i in range(1000)]\r\ntext_set = set(texts)  # Fast thanks to cached hash\r\n```\r\n\r\n## License\r\n\r\nExperimental code for educational purposes.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Experimental compact UTF-8 string type for CPython",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/AI-Stratov/py-ministring",
        "Issues": "https://github.com/AI-Stratov/py-ministring/issues",
        "Repository": "https://github.com/AI-Stratov/py-ministring.git"
    },
    "split_keywords": [
        "string",
        " utf8",
        " unicode",
        " c-extension",
        " performance"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5ea43d0bcfea66f16358a5852e2ead590c3bfb4d472dda275035c2dee358663a",
                "md5": "391800eb532cab151cc39264cd0ef15b",
                "sha256": "2dbcf958c875227011f686caef0b57102a8c8fb7b08dfbea155b563895e2f916"
            },
            "downloads": -1,
            "filename": "py_ministring-0.1.1-cp313-cp313-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "391800eb532cab151cc39264cd0ef15b",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.8",
            "size": 10797,
            "upload_time": "2025-08-26T19:04:05",
            "upload_time_iso_8601": "2025-08-26T19:04:05.057936Z",
            "url": "https://files.pythonhosted.org/packages/5e/a4/3d0bcfea66f16358a5852e2ead590c3bfb4d472dda275035c2dee358663a/py_ministring-0.1.1-cp313-cp313-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ae0321da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70",
                "md5": "be5312e79aacbc85413eff6a65f4f38c",
                "sha256": "6ea9dc2fb7ca1c05bdd480b30459332dc524cd6c6a45e7a36b4e17e94b41c81b"
            },
            "downloads": -1,
            "filename": "py_ministring-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "be5312e79aacbc85413eff6a65f4f38c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 12321,
            "upload_time": "2025-08-26T19:04:05",
            "upload_time_iso_8601": "2025-08-26T19:04:05.886305Z",
            "url": "https://files.pythonhosted.org/packages/ae/03/21da5c48f58cf036996380c603b53183514b352f4897b4b8723d89305a70/py_ministring-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-26 19:04:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AI-Stratov",
    "github_project": "py-ministring",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "py-ministring"
}
        
Elapsed time: 1.36847s