# Ultra-Tokenizer
[](https://pypi.org/project/ultra-tokenizer/)
[](https://pypi.org/project/ultra-tokenizer/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://pranav271103.github.io/Ultra-Tokenizer/)
## Features
- **Multiple Tokenization Algorithms**: Supports BPE, WordPiece, and Unigram algorithms
- **High Performance**: Optimized for speed and memory efficiency
- **Easy Integration**: Simple API for training and using tokenizers
- **Production Ready**: Battle-tested with comprehensive test coverage
- **Type Hints**: Full Python type support for better development experience
## Installation
Install the latest stable version from PyPI:
```bash
pip install ultra-tokenizer
```
## Quick Start
### Basic Usage
```python
from ultra_tokenizer import Tokenizer
# Initialize tokenizer with default settings
tokenizer = Tokenizer()
# Tokenize text
text = "Hello, world! This is Ultra-Tokenizer in action."
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['Hello', ',', 'world', '!', 'This', 'is', 'Ultra', '-', 'Token', '##izer', 'in', 'action', '.']
```
### Training a New Tokenizer
```python
from ultra_tokenizer import TokenizerTrainer
# Initialize trainer
trainer = TokenizerTrainer(
vocab_size=30000,
min_frequency=2,
show_progress=True
)
# Train on text files
tokenizer = trainer.train_from_files(["file1.txt", "file2.txt"])
# Save tokenizer
tokenizer.save("my_tokenizer.json")
# Load tokenizer
from ultra_tokenizer import Tokenizer
tokenizer = Tokenizer.from_file("my_tokenizer.json")
```
## Advanced Usage
### Custom Tokenization Rules
```python
from ultra_tokenizer import Tokenizer
import re
# Custom tokenization with regex pattern
custom_tokenizer = Tokenizer(
tokenization_pattern=r"\b\w+\b|\S" # Words or non-whitespace characters
)
```
### Batch Processing
```python
# Process multiple texts efficiently
texts = ["First sentence.", "Second sentence.", "Third sentence."]
all_tokens = [tokenizer.tokenize(text) for text in texts]
```
## Customization
### Special Tokens
```python
from ultra_tokenizer import Tokenizer
# Initialize with custom special tokens
tokenizer = Tokenizer(
special_tokens={
"unk_token": "[UNK]",
"pad_token": "[PAD]",
"cls_token": "[CLS]",
"sep_token": "[SEP]"
}
)
```
### Custom Preprocessing
```python
def custom_preprocessor(text):
# Your custom preprocessing logic here
text = text.lower()
text = re.sub(r'\s+', ' ', text).strip()
return text
tokenizer = Tokenizer(preprocessing_fn=custom_preprocessor)
```
## Fine-tuning
### Updating an Existing Tokenizer
```python
# Continue training on new data
with open("new_data.txt", "r", encoding="utf-8") as f:
trainer = TokenizerTrainer(vocab_size=35000) # Slightly larger vocab
tokenizer = trainer.train_from_files(
["new_data.txt"],
initial_tokenizer=tokenizer # Start from existing tokenizer
)
```
### Domain-Specific Fine-tuning
```python
# Fine-tune on domain-specific data
domain_trainer = TokenizerTrainer(
vocab_size=32000,
min_frequency=1, # Include rare terms
special_tokens={"additional_special_tokens": ["[MED]", "[DISEASE]", "[TREATMENT]"]}
)
domain_tokenizer = domain_trainer.train_from_files(
["medical_corpus.txt"],
initial_tokenizer=tokenizer # Start from base tokenizer
)
```
### Performance Optimization
```python
# Optimize for inference
tokenizer.enable_caching() # Cache tokenization results
fast_tokens = tokenizer.tokenize("Optimized for speed!")
```
## Documentation
For detailed documentation, examples, and API reference, please visit:
[Ultra-Tokenizer Documentation](https://pranav271103.github.io/Ultra-Tokenizer/)
## Contributing
Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) to get started.
## License
This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
## Contact
For questions or feedback, please [open an issue](https://github.com/pranav271103/Ultra-Tokenizer/issues) or contact [pranav.singh01010101@gmail.com](mailto:pranav.singh01010101@gmail.com).
Raw data
{
"_id": null,
"home_page": "https://github.com/pranav271103/Ultra-Tokenizer.git",
"name": "ultra-tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "tokenizer, nlp, bpe, wordpiece, unigram",
"author": "Pranav Singh",
"author_email": "Pranav Singh <pranav.singh01010101@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f6/6b/d3391cd6509696482a9fbb2023cad8717c9949918340f1576ded92fec734/ultra_tokenizer-0.1.3.tar.gz",
"platform": null,
"description": "# Ultra-Tokenizer\r\n\r\n[](https://pypi.org/project/ultra-tokenizer/)\r\n[](https://pypi.org/project/ultra-tokenizer/)\r\n[](https://opensource.org/licenses/Apache-2.0)\r\n[](https://pranav271103.github.io/Ultra-Tokenizer/)\r\n\r\n## Features\r\n\r\n- **Multiple Tokenization Algorithms**: Supports BPE, WordPiece, and Unigram algorithms\r\n- **High Performance**: Optimized for speed and memory efficiency\r\n- **Easy Integration**: Simple API for training and using tokenizers\r\n- **Production Ready**: Battle-tested with comprehensive test coverage\r\n- **Type Hints**: Full Python type support for better development experience\r\n\r\n## Installation\r\n\r\nInstall the latest stable version from PyPI:\r\n\r\n```bash\r\npip install ultra-tokenizer\r\n```\r\n\r\n## Quick Start\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom ultra_tokenizer import Tokenizer\r\n\r\n# Initialize tokenizer with default settings\r\ntokenizer = Tokenizer()\r\n\r\n# Tokenize text\r\ntext = \"Hello, world! This is Ultra-Tokenizer in action.\"\r\ntokens = tokenizer.tokenize(text)\r\nprint(tokens)\r\n# Output: ['Hello', ',', 'world', '!', 'This', 'is', 'Ultra', '-', 'Token', '##izer', 'in', 'action', '.']\r\n```\r\n\r\n### Training a New Tokenizer\r\n\r\n```python\r\nfrom ultra_tokenizer import TokenizerTrainer\r\n\r\n# Initialize trainer\r\ntrainer = TokenizerTrainer(\r\n vocab_size=30000,\r\n min_frequency=2,\r\n show_progress=True\r\n)\r\n\r\n# Train on text files\r\ntokenizer = trainer.train_from_files([\"file1.txt\", \"file2.txt\"])\r\n\r\n# Save tokenizer\r\ntokenizer.save(\"my_tokenizer.json\")\r\n\r\n# Load tokenizer\r\nfrom ultra_tokenizer import Tokenizer\r\ntokenizer = Tokenizer.from_file(\"my_tokenizer.json\")\r\n```\r\n\r\n## Advanced Usage\r\n\r\n### Custom Tokenization Rules\r\n\r\n```python\r\nfrom ultra_tokenizer import Tokenizer\r\nimport re\r\n\r\n# Custom tokenization with regex pattern\r\ncustom_tokenizer = Tokenizer(\r\n tokenization_pattern=r\"\\b\\w+\\b|\\S\" # Words or non-whitespace characters\r\n)\r\n```\r\n\r\n### Batch Processing\r\n\r\n```python\r\n# Process multiple texts efficiently\r\ntexts = [\"First sentence.\", \"Second sentence.\", \"Third sentence.\"]\r\nall_tokens = [tokenizer.tokenize(text) for text in texts]\r\n```\r\n\r\n## Customization\r\n\r\n### Special Tokens\r\n\r\n```python\r\nfrom ultra_tokenizer import Tokenizer\r\n\r\n# Initialize with custom special tokens\r\ntokenizer = Tokenizer(\r\n special_tokens={\r\n \"unk_token\": \"[UNK]\",\r\n \"pad_token\": \"[PAD]\",\r\n \"cls_token\": \"[CLS]\",\r\n \"sep_token\": \"[SEP]\"\r\n }\r\n)\r\n```\r\n\r\n### Custom Preprocessing\r\n\r\n```python\r\ndef custom_preprocessor(text):\r\n # Your custom preprocessing logic here\r\n text = text.lower()\r\n text = re.sub(r'\\s+', ' ', text).strip()\r\n return text\r\n\r\ntokenizer = Tokenizer(preprocessing_fn=custom_preprocessor)\r\n```\r\n\r\n## Fine-tuning\r\n\r\n### Updating an Existing Tokenizer\r\n\r\n```python\r\n# Continue training on new data\r\nwith open(\"new_data.txt\", \"r\", encoding=\"utf-8\") as f:\r\n trainer = TokenizerTrainer(vocab_size=35000) # Slightly larger vocab\r\n tokenizer = trainer.train_from_files(\r\n [\"new_data.txt\"],\r\n initial_tokenizer=tokenizer # Start from existing tokenizer\r\n )\r\n```\r\n\r\n### Domain-Specific Fine-tuning\r\n\r\n```python\r\n# Fine-tune on domain-specific data\r\ndomain_trainer = TokenizerTrainer(\r\n vocab_size=32000,\r\n min_frequency=1, # Include rare terms\r\n special_tokens={\"additional_special_tokens\": [\"[MED]\", \"[DISEASE]\", \"[TREATMENT]\"]}\r\n)\r\n\r\ndomain_tokenizer = domain_trainer.train_from_files(\r\n [\"medical_corpus.txt\"],\r\n initial_tokenizer=tokenizer # Start from base tokenizer\r\n)\r\n```\r\n\r\n### Performance Optimization\r\n\r\n```python\r\n# Optimize for inference\r\ntokenizer.enable_caching() # Cache tokenization results\r\nfast_tokens = tokenizer.tokenize(\"Optimized for speed!\")\r\n```\r\n\r\n## Documentation\r\n\r\nFor detailed documentation, examples, and API reference, please visit:\r\n\r\n[Ultra-Tokenizer Documentation](https://pranav271103.github.io/Ultra-Tokenizer/)\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) to get started.\r\n\r\n## License\r\n\r\nThis project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Contact\r\n\r\nFor questions or feedback, please [open an issue](https://github.com/pranav271103/Ultra-Tokenizer/issues) or contact [pranav.singh01010101@gmail.com](mailto:pranav.singh01010101@gmail.com).\r\n\r\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Advanced tokenizer with support for BPE, WordPiece, and Unigram algorithms",
"version": "0.1.3",
"project_urls": {
"Bug Tracker": "https://github.com/pranav271103/Ultra-Tokenizer/issues",
"Documentation": "https://github.com/pranav271103/Ultra-Tokenizer.git#readme",
"Homepage": "https://github.com/pranav271103/Ultra-Tokenizer.git",
"Source Code": "https://github.com/pranav271103/Ultra-Tokenizer.git"
},
"split_keywords": [
"tokenizer",
" nlp",
" bpe",
" wordpiece",
" unigram"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "feaf183266aaba6e894f73d007c6fe9b0a4148b3879e796446a3a933a7aa2561",
"md5": "3a5966b28802e3534810bf1dfadd2e03",
"sha256": "5bde95e48f2a4632b0d0651e9b020fb845afe55a657e7bc4a7ddcae5dfdc8326"
},
"downloads": -1,
"filename": "ultra_tokenizer-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3a5966b28802e3534810bf1dfadd2e03",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 31779,
"upload_time": "2025-08-16T09:58:49",
"upload_time_iso_8601": "2025-08-16T09:58:49.741252Z",
"url": "https://files.pythonhosted.org/packages/fe/af/183266aaba6e894f73d007c6fe9b0a4148b3879e796446a3a933a7aa2561/ultra_tokenizer-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f66bd3391cd6509696482a9fbb2023cad8717c9949918340f1576ded92fec734",
"md5": "99dea39a1c1a10838e44ee3c5a54f10f",
"sha256": "961255ad0e1dadadafd1d9dde21516ff42d2aeb8f9b02b10fb33e007d475e8d5"
},
"downloads": -1,
"filename": "ultra_tokenizer-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "99dea39a1c1a10838e44ee3c5a54f10f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 39129,
"upload_time": "2025-08-16T09:58:51",
"upload_time_iso_8601": "2025-08-16T09:58:51.327444Z",
"url": "https://files.pythonhosted.org/packages/f6/6b/d3391cd6509696482a9fbb2023cad8717c9949918340f1576ded92fec734/ultra_tokenizer-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-16 09:58:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pranav271103",
"github_project": "Ultra-Tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.62.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"6.2.5"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"2.12.0"
]
]
},
{
"name": "black",
"specs": [
[
">=",
"21.12b0"
]
]
},
{
"name": "isort",
"specs": [
[
">=",
"5.10.1"
]
]
},
{
"name": "mypy",
"specs": [
[
">=",
"0.910"
]
]
},
{
"name": "sentencepiece",
"specs": [
[
">=",
"0.1.96"
]
]
},
{
"name": "tokenizers",
"specs": [
[
">=",
"0.12.1"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.12.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.9.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"0.24.2"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.4.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.11.0"
]
]
}
],
"lcname": "ultra-tokenizer"
}