ultra-tokenizer


Nameultra-tokenizer JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/pranav271103/Ultra-Tokenizer.git
SummaryAdvanced tokenizer with support for BPE, WordPiece, and Unigram algorithms
upload_time2025-08-16 09:58:51
maintainerNone
docs_urlNone
authorPranav Singh
requires_python>=3.8
licenseApache-2.0
keywords tokenizer nlp bpe wordpiece unigram
VCS
bugtrack_url
requirements numpy tqdm pytest pytest-cov black isort mypy sentencepiece tokenizers transformers torch pandas scikit-learn matplotlib seaborn
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Ultra-Tokenizer

[![PyPI version](https://img.shields.io/pypi/v/ultra-tokenizer.svg)](https://pypi.org/project/ultra-tokenizer/)
[![Python Version](https://img.shields.io/pypi/pyversions/ultra-tokenizer.svg)](https://pypi.org/project/ultra-tokenizer/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Documentation Status](https://readthedocs.org/projects/ultra-tokenizer/badge/?version=latest)](https://pranav271103.github.io/Ultra-Tokenizer/)

## Features

- **Multiple Tokenization Algorithms**: Supports BPE, WordPiece, and Unigram algorithms
- **High Performance**: Optimized for speed and memory efficiency
- **Easy Integration**: Simple API for training and using tokenizers
- **Production Ready**: Battle-tested with comprehensive test coverage
- **Type Hints**: Full Python type support for better development experience

## Installation

Install the latest stable version from PyPI:

```bash
pip install ultra-tokenizer
```

## Quick Start

### Basic Usage

```python
from ultra_tokenizer import Tokenizer

# Initialize tokenizer with default settings
tokenizer = Tokenizer()

# Tokenize text
text = "Hello, world! This is Ultra-Tokenizer in action."
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['Hello', ',', 'world', '!', 'This', 'is', 'Ultra', '-', 'Token', '##izer', 'in', 'action', '.']
```

### Training a New Tokenizer

```python
from ultra_tokenizer import TokenizerTrainer

# Initialize trainer
trainer = TokenizerTrainer(
    vocab_size=30000,
    min_frequency=2,
    show_progress=True
)

# Train on text files
tokenizer = trainer.train_from_files(["file1.txt", "file2.txt"])

# Save tokenizer
tokenizer.save("my_tokenizer.json")

# Load tokenizer
from ultra_tokenizer import Tokenizer
tokenizer = Tokenizer.from_file("my_tokenizer.json")
```

## Advanced Usage

### Custom Tokenization Rules

```python
from ultra_tokenizer import Tokenizer
import re

# Custom tokenization with regex pattern
custom_tokenizer = Tokenizer(
    tokenization_pattern=r"\b\w+\b|\S"  # Words or non-whitespace characters
)
```

### Batch Processing

```python
# Process multiple texts efficiently
texts = ["First sentence.", "Second sentence.", "Third sentence."]
all_tokens = [tokenizer.tokenize(text) for text in texts]
```

## Customization

### Special Tokens

```python
from ultra_tokenizer import Tokenizer

# Initialize with custom special tokens
tokenizer = Tokenizer(
    special_tokens={
        "unk_token": "[UNK]",
        "pad_token": "[PAD]",
        "cls_token": "[CLS]",
        "sep_token": "[SEP]"
    }
)
```

### Custom Preprocessing

```python
def custom_preprocessor(text):
    # Your custom preprocessing logic here
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

tokenizer = Tokenizer(preprocessing_fn=custom_preprocessor)
```

## Fine-tuning

### Updating an Existing Tokenizer

```python
# Continue training on new data
with open("new_data.txt", "r", encoding="utf-8") as f:
    trainer = TokenizerTrainer(vocab_size=35000)  # Slightly larger vocab
    tokenizer = trainer.train_from_files(
        ["new_data.txt"],
        initial_tokenizer=tokenizer  # Start from existing tokenizer
    )
```

### Domain-Specific Fine-tuning

```python
# Fine-tune on domain-specific data
domain_trainer = TokenizerTrainer(
    vocab_size=32000,
    min_frequency=1,  # Include rare terms
    special_tokens={"additional_special_tokens": ["[MED]", "[DISEASE]", "[TREATMENT]"]}
)

domain_tokenizer = domain_trainer.train_from_files(
    ["medical_corpus.txt"],
    initial_tokenizer=tokenizer  # Start from base tokenizer
)
```

### Performance Optimization

```python
# Optimize for inference
tokenizer.enable_caching()  # Cache tokenization results
fast_tokens = tokenizer.tokenize("Optimized for speed!")
```

## Documentation

For detailed documentation, examples, and API reference, please visit:

[Ultra-Tokenizer Documentation](https://pranav271103.github.io/Ultra-Tokenizer/)

## Contributing

Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) to get started.

## License

This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.

## Contact

For questions or feedback, please [open an issue](https://github.com/pranav271103/Ultra-Tokenizer/issues) or contact [pranav.singh01010101@gmail.com](mailto:pranav.singh01010101@gmail.com).


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pranav271103/Ultra-Tokenizer.git",
    "name": "ultra-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "tokenizer, nlp, bpe, wordpiece, unigram",
    "author": "Pranav Singh",
    "author_email": "Pranav Singh <pranav.singh01010101@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f6/6b/d3391cd6509696482a9fbb2023cad8717c9949918340f1576ded92fec734/ultra_tokenizer-0.1.3.tar.gz",
    "platform": null,
    "description": "# Ultra-Tokenizer\r\n\r\n[![PyPI version](https://img.shields.io/pypi/v/ultra-tokenizer.svg)](https://pypi.org/project/ultra-tokenizer/)\r\n[![Python Version](https://img.shields.io/pypi/pyversions/ultra-tokenizer.svg)](https://pypi.org/project/ultra-tokenizer/)\r\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\r\n[![Documentation Status](https://readthedocs.org/projects/ultra-tokenizer/badge/?version=latest)](https://pranav271103.github.io/Ultra-Tokenizer/)\r\n\r\n## Features\r\n\r\n- **Multiple Tokenization Algorithms**: Supports BPE, WordPiece, and Unigram algorithms\r\n- **High Performance**: Optimized for speed and memory efficiency\r\n- **Easy Integration**: Simple API for training and using tokenizers\r\n- **Production Ready**: Battle-tested with comprehensive test coverage\r\n- **Type Hints**: Full Python type support for better development experience\r\n\r\n## Installation\r\n\r\nInstall the latest stable version from PyPI:\r\n\r\n```bash\r\npip install ultra-tokenizer\r\n```\r\n\r\n## Quick Start\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom ultra_tokenizer import Tokenizer\r\n\r\n# Initialize tokenizer with default settings\r\ntokenizer = Tokenizer()\r\n\r\n# Tokenize text\r\ntext = \"Hello, world! This is Ultra-Tokenizer in action.\"\r\ntokens = tokenizer.tokenize(text)\r\nprint(tokens)\r\n# Output: ['Hello', ',', 'world', '!', 'This', 'is', 'Ultra', '-', 'Token', '##izer', 'in', 'action', '.']\r\n```\r\n\r\n### Training a New Tokenizer\r\n\r\n```python\r\nfrom ultra_tokenizer import TokenizerTrainer\r\n\r\n# Initialize trainer\r\ntrainer = TokenizerTrainer(\r\n    vocab_size=30000,\r\n    min_frequency=2,\r\n    show_progress=True\r\n)\r\n\r\n# Train on text files\r\ntokenizer = trainer.train_from_files([\"file1.txt\", \"file2.txt\"])\r\n\r\n# Save tokenizer\r\ntokenizer.save(\"my_tokenizer.json\")\r\n\r\n# Load tokenizer\r\nfrom ultra_tokenizer import Tokenizer\r\ntokenizer = Tokenizer.from_file(\"my_tokenizer.json\")\r\n```\r\n\r\n## Advanced Usage\r\n\r\n### Custom Tokenization Rules\r\n\r\n```python\r\nfrom ultra_tokenizer import Tokenizer\r\nimport re\r\n\r\n# Custom tokenization with regex pattern\r\ncustom_tokenizer = Tokenizer(\r\n    tokenization_pattern=r\"\\b\\w+\\b|\\S\"  # Words or non-whitespace characters\r\n)\r\n```\r\n\r\n### Batch Processing\r\n\r\n```python\r\n# Process multiple texts efficiently\r\ntexts = [\"First sentence.\", \"Second sentence.\", \"Third sentence.\"]\r\nall_tokens = [tokenizer.tokenize(text) for text in texts]\r\n```\r\n\r\n## Customization\r\n\r\n### Special Tokens\r\n\r\n```python\r\nfrom ultra_tokenizer import Tokenizer\r\n\r\n# Initialize with custom special tokens\r\ntokenizer = Tokenizer(\r\n    special_tokens={\r\n        \"unk_token\": \"[UNK]\",\r\n        \"pad_token\": \"[PAD]\",\r\n        \"cls_token\": \"[CLS]\",\r\n        \"sep_token\": \"[SEP]\"\r\n    }\r\n)\r\n```\r\n\r\n### Custom Preprocessing\r\n\r\n```python\r\ndef custom_preprocessor(text):\r\n    # Your custom preprocessing logic here\r\n    text = text.lower()\r\n    text = re.sub(r'\\s+', ' ', text).strip()\r\n    return text\r\n\r\ntokenizer = Tokenizer(preprocessing_fn=custom_preprocessor)\r\n```\r\n\r\n## Fine-tuning\r\n\r\n### Updating an Existing Tokenizer\r\n\r\n```python\r\n# Continue training on new data\r\nwith open(\"new_data.txt\", \"r\", encoding=\"utf-8\") as f:\r\n    trainer = TokenizerTrainer(vocab_size=35000)  # Slightly larger vocab\r\n    tokenizer = trainer.train_from_files(\r\n        [\"new_data.txt\"],\r\n        initial_tokenizer=tokenizer  # Start from existing tokenizer\r\n    )\r\n```\r\n\r\n### Domain-Specific Fine-tuning\r\n\r\n```python\r\n# Fine-tune on domain-specific data\r\ndomain_trainer = TokenizerTrainer(\r\n    vocab_size=32000,\r\n    min_frequency=1,  # Include rare terms\r\n    special_tokens={\"additional_special_tokens\": [\"[MED]\", \"[DISEASE]\", \"[TREATMENT]\"]}\r\n)\r\n\r\ndomain_tokenizer = domain_trainer.train_from_files(\r\n    [\"medical_corpus.txt\"],\r\n    initial_tokenizer=tokenizer  # Start from base tokenizer\r\n)\r\n```\r\n\r\n### Performance Optimization\r\n\r\n```python\r\n# Optimize for inference\r\ntokenizer.enable_caching()  # Cache tokenization results\r\nfast_tokens = tokenizer.tokenize(\"Optimized for speed!\")\r\n```\r\n\r\n## Documentation\r\n\r\nFor detailed documentation, examples, and API reference, please visit:\r\n\r\n[Ultra-Tokenizer Documentation](https://pranav271103.github.io/Ultra-Tokenizer/)\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) to get started.\r\n\r\n## License\r\n\r\nThis project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Contact\r\n\r\nFor questions or feedback, please [open an issue](https://github.com/pranav271103/Ultra-Tokenizer/issues) or contact [pranav.singh01010101@gmail.com](mailto:pranav.singh01010101@gmail.com).\r\n\r\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Advanced tokenizer with support for BPE, WordPiece, and Unigram algorithms",
    "version": "0.1.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/pranav271103/Ultra-Tokenizer/issues",
        "Documentation": "https://github.com/pranav271103/Ultra-Tokenizer.git#readme",
        "Homepage": "https://github.com/pranav271103/Ultra-Tokenizer.git",
        "Source Code": "https://github.com/pranav271103/Ultra-Tokenizer.git"
    },
    "split_keywords": [
        "tokenizer",
        " nlp",
        " bpe",
        " wordpiece",
        " unigram"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "feaf183266aaba6e894f73d007c6fe9b0a4148b3879e796446a3a933a7aa2561",
                "md5": "3a5966b28802e3534810bf1dfadd2e03",
                "sha256": "5bde95e48f2a4632b0d0651e9b020fb845afe55a657e7bc4a7ddcae5dfdc8326"
            },
            "downloads": -1,
            "filename": "ultra_tokenizer-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a5966b28802e3534810bf1dfadd2e03",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 31779,
            "upload_time": "2025-08-16T09:58:49",
            "upload_time_iso_8601": "2025-08-16T09:58:49.741252Z",
            "url": "https://files.pythonhosted.org/packages/fe/af/183266aaba6e894f73d007c6fe9b0a4148b3879e796446a3a933a7aa2561/ultra_tokenizer-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f66bd3391cd6509696482a9fbb2023cad8717c9949918340f1576ded92fec734",
                "md5": "99dea39a1c1a10838e44ee3c5a54f10f",
                "sha256": "961255ad0e1dadadafd1d9dde21516ff42d2aeb8f9b02b10fb33e007d475e8d5"
            },
            "downloads": -1,
            "filename": "ultra_tokenizer-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "99dea39a1c1a10838e44ee3c5a54f10f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 39129,
            "upload_time": "2025-08-16T09:58:51",
            "upload_time_iso_8601": "2025-08-16T09:58:51.327444Z",
            "url": "https://files.pythonhosted.org/packages/f6/6b/d3391cd6509696482a9fbb2023cad8717c9949918340f1576ded92fec734/ultra_tokenizer-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-16 09:58:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pranav271103",
    "github_project": "Ultra-Tokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.62.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "6.2.5"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "2.12.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "21.12b0"
                ]
            ]
        },
        {
            "name": "isort",
            "specs": [
                [
                    ">=",
                    "5.10.1"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    ">=",
                    "0.910"
                ]
            ]
        },
        {
            "name": "sentencepiece",
            "specs": [
                [
                    ">=",
                    "0.1.96"
                ]
            ]
        },
        {
            "name": "tokenizers",
            "specs": [
                [
                    ">=",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.12.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.9.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "0.24.2"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.4.0"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    ">=",
                    "0.11.0"
                ]
            ]
        }
    ],
    "lcname": "ultra-tokenizer"
}
        
Elapsed time: 0.68164s