# UniTok V4
The documentation for v3, old version, can be found [here](README_v3.md) in Chinese.
## Overview
[](https://badge.fury.io/py/unitok)
Welcome to the UniTok v4!
This library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values.
Please refer to [UniTok Handbook](https://unitok.qijiong.work) for more detailed information.
## Road from V3 to V4
### Changes and Comparisons
| Feature | UniTok v3 | UniTok v4 | Comments |
|---------------------------------|-------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------|
| `UniTok` class | Solely for tokenization | Manages the entire preprocessing lifecycle | |
| `UniDep` class | Data loading and combining | Removed | V4 combines the functionalities of `UniTok` and `UniDep` into a single class. |
| `Column` class | Column name is for both the original and tokenized datasets | N/A | V4 introduces a `Job` class. |
| `Job` class | N/A | Defines how a specific column should be tokenized | |
| `Tokenizer` class | Ambiguous return type definition | `return_list` parameter must be of type `bool` | |
| `Tokenizer` class | Only supports `BertTokenizer` for text processing | Supports all Tokenizers in the transformers library | New `TransformersTokenizer` class |
| `analyse` method | Supported | Not supported Currently | |
| `Meta` class | Only for human-friendly displaying | Manager for `Job`, `Tokenizer`, and `Vocab` | |
| `unitok` command | Visualization in the terminal | More colorful and detailed output | |
| `Vocab` class (unitok >= 4.1.0) | Save and load vocabulary using text files | Save and load vocabulary using pickle files | Avoids issues with special characters in text files |
### How to Migrate the Processed Data
```bash
unidep-upgrade-v4 <path>
```
## Installation
**Requirements**
- Python 3.7 or later
- Dependencies:
- pandas
- transformers
- tqdm
- rich
**Install UniTok via pip**
```bash
pip install unitok
```
## Core Concepts
**States**
- `initialized`: The initial state after creating a UniTok instance.
- `tokenized`: Achieved after applying tokenization to the dataset.
- `organized`: Reached after combining multiple datasets via operations like union.
**Components**
- UniTok: Manages the dataset preprocessing lifecycle.
- Job: Defines how a specific column should be tokenized.
- Tokenizer: Encodes data using various methods (e.g., BERT, splitting by delimiters).
- Vocabulary: Stores and manages unique tokens across datasets.
**Primary Key (key_job)**
The `key_job` acts as the primary key for operations like `getitem` and `union`, ensuring consistency across datasets.
## Usage Guide
### Loading Data
Load datasets using pandas:
```python
import pandas as pd
item = pd.read_csv(
filepath_or_buffer='news-sample.tsv',
sep='\t',
names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'],
usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('') # Handle missing values
user = pd.read_csv(
filepath_or_buffer='user-sample.tsv',
sep='\t',
names=['uid', 'history'],
)
interaction = pd.read_csv(
filepath_or_buffer='interaction-sample.tsv',
sep='\t',
names=['uid', 'nid', 'click'],
)
```
### Defining and Adding Jobs
Define tokenization jobs for different columns:
```python
from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer
item_vocab = Vocab(name='nid') # will be used across datasets
user_vocab = Vocab(name='uid') # will be used across datasets
with UniTok() as item_ut:
bert_tokenizer = BertTokenizer(vocab='bert')
llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')
item_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
item_ut.add_job(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)
item_ut.add_job(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)
item_ut.add_job(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)
item_ut.add_job(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)
item_ut.add_job(tokenizer=EntityTokenizer(vocab='category'), column='category')
item_ut.add_job(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')
with UniTok() as user_ut:
user_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
user_ut.add_job(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)
with UniTok() as inter_ut:
inter_ut.add_index_job(name='index')
inter_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
inter_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
inter_ut.add_job(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')
```
### Tokenizing Data
Tokenize and save the processed data:
```python
item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit() # will raise an error if new items are detected in the user or interaction datasets
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')
```
### Combining Datasets
Combine datasets using union:
```python
# => {'category': 0, 'nid': 0, 'title@bert': [1996, 9639, 3035, 3870, ...], 'title@llama': [450, 1771, 4167, 10470, ...], 'abstract@bert': [4497, 1996, 14960, 2015, ...], 'abstract@llama': [1383, 459, 278, 451, ...], 'subcategory': 0}
print(item_ut[0])
# => {'uid': 0, 'history': [0, 1, 2]}
print(user_ut[0])
# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}
print(inter_ut[0])
with inter_ut:
inter_ut.union(user_ut)
# => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}
print(inter_ut[0])
```
### Glance at the Terminal
```bash
unitok sample-ut/item
```
```text
UniTok (4beta)
Sample Size: 10
ID Column: nid
Jobs
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Tokenizer ┃ Tokenizer ID ┃ Column Mapping ┃ Vocab ┃ Max Length ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ TransformersTokenizer │ auto_2VN5Ko │ abstract -> abstract@llama │ llama (size=32024) │ 50 │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer │ auto_C0b9Du │ subcategory -> subcategory │ subcategory (size=8) │ N/A │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ TransformersTokenizer │ auto_2VN5Ko │ title -> title@llama │ llama (size=32024) │ 20 │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer │ auto_4WQYxo │ category -> category │ category (size=4) │ N/A │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ BertTokenizer │ auto_Y9tADT │ abstract -> abstract@bert │ bert (size=30522) │ 46 │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ BertTokenizer │ auto_Y9tADT │ title -> title@bert │ bert (size=30522) │ 16 │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer │ auto_qwQALc │ nid -> nid │ nid (size=10) │ N/A │
└──────────────────────────────────────┴───────────────────────┴──────────────────────────────────────────────┴───────────────────────────────────┴───────────────────┘
```
## Contributing
We welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.
Our TODO list includes:
- [ ] More detailed documentation
- [ ] More examples and tutorials
- [ ] More SQL-like operations
- [ ] Analysis and visualization tools
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/Jyonn/UnifiedTokenizer",
"name": "UniTok",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "token, tokenizer, NLP, transformers, glove, bert, llama",
"author": "Jyonn Liu",
"author_email": "liu@qijiong.work",
"download_url": "https://files.pythonhosted.org/packages/bd/bf/1e35d2295d61ce38699d8276d8709ccbd49fe33e281ff9fc564319be08ee/UniTok-4.3.6.tar.gz",
"platform": "any",
"description": "# UniTok V4\n\nThe documentation for v3, old version, can be found [here](README_v3.md) in Chinese.\n\n## Overview\n\n[](https://badge.fury.io/py/unitok)\n\nWelcome to the UniTok v4! \nThis library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values. \n\nPlease refer to [UniTok Handbook](https://unitok.qijiong.work) for more detailed information.\n\n## Road from V3 to V4\n\n### Changes and Comparisons\n\n| Feature | UniTok v3 | UniTok v4 | Comments |\n|---------------------------------|-------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------|\n| `UniTok` class | Solely for tokenization | Manages the entire preprocessing lifecycle | |\n| `UniDep` class | Data loading and combining | Removed | V4 combines the functionalities of `UniTok` and `UniDep` into a single class. |\n| `Column` class | Column name is for both the original and tokenized datasets | N/A | V4 introduces a `Job` class. |\n| `Job` class | N/A | Defines how a specific column should be tokenized | |\n| `Tokenizer` class | Ambiguous return type definition | `return_list` parameter must be of type `bool` | |\n| `Tokenizer` class | Only supports `BertTokenizer` for text processing | Supports all Tokenizers in the transformers library | New `TransformersTokenizer` class |\n| `analyse` method | Supported | Not supported Currently | |\n| `Meta` class | Only for human-friendly displaying | Manager for `Job`, `Tokenizer`, and `Vocab` | |\n| `unitok` command | Visualization in the terminal | More colorful and detailed output | |\n| `Vocab` class (unitok >= 4.1.0) | Save and load vocabulary using text files | Save and load vocabulary using pickle files | Avoids issues with special characters in text files |\n\n### How to Migrate the Processed Data\n\n```bash\nunidep-upgrade-v4 <path>\n```\n\n## Installation\n\n**Requirements**\n\n- Python 3.7 or later\n- Dependencies:\n - pandas\n - transformers\n - tqdm\n - rich\n\n**Install UniTok via pip**\n\n```bash\npip install unitok\n```\n\n## Core Concepts\n\n**States**\n\n- `initialized`: The initial state after creating a UniTok instance.\n- `tokenized`: Achieved after applying tokenization to the dataset.\n- `organized`: Reached after combining multiple datasets via operations like union.\n\n**Components**\n\n- UniTok: Manages the dataset preprocessing lifecycle.\n- Job: Defines how a specific column should be tokenized.\n- Tokenizer: Encodes data using various methods (e.g., BERT, splitting by delimiters).\n- Vocabulary: Stores and manages unique tokens across datasets.\n\n**Primary Key (key_job)**\n\nThe `key_job` acts as the primary key for operations like `getitem` and `union`, ensuring consistency across datasets.\n\n## Usage Guide\n\n### Loading Data\n\nLoad datasets using pandas:\n\n```python\nimport pandas as pd\n\nitem = pd.read_csv(\n filepath_or_buffer='news-sample.tsv',\n sep='\\t',\n names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'],\n usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],\n)\nitem['abstract'] = item['abstract'].fillna('') # Handle missing values\n\nuser = pd.read_csv(\n filepath_or_buffer='user-sample.tsv',\n sep='\\t',\n names=['uid', 'history'],\n)\n\ninteraction = pd.read_csv(\n filepath_or_buffer='interaction-sample.tsv',\n sep='\\t',\n names=['uid', 'nid', 'click'],\n)\n```\n\n### Defining and Adding Jobs\n\nDefine tokenization jobs for different columns:\n\n```python\nfrom unitok import UniTok, Vocab\nfrom unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer\n\nitem_vocab = Vocab(name='nid') # will be used across datasets\nuser_vocab = Vocab(name='uid') # will be used across datasets\n\nwith UniTok() as item_ut:\n bert_tokenizer = BertTokenizer(vocab='bert')\n llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')\n\n item_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)\n item_ut.add_job(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)\n item_ut.add_job(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)\n item_ut.add_job(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)\n item_ut.add_job(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)\n item_ut.add_job(tokenizer=EntityTokenizer(vocab='category'), column='category')\n item_ut.add_job(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')\n\nwith UniTok() as user_ut:\n user_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)\n user_ut.add_job(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)\n\nwith UniTok() as inter_ut:\n inter_ut.add_index_job(name='index')\n inter_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')\n inter_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')\n inter_ut.add_job(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')\n```\n\n### Tokenizing Data\n\nTokenize and save the processed data:\n\n```python\nitem_ut.tokenize(item).save('sample-ut/item')\nitem_vocab.deny_edit() # will raise an error if new items are detected in the user or interaction datasets\nuser_ut.tokenize(user).save('sample-ut/user')\ninter_ut.tokenize(interaction).save('sample-ut/interaction')\n```\n\n### Combining Datasets\n\nCombine datasets using union:\n\n```python\n# => {'category': 0, 'nid': 0, 'title@bert': [1996, 9639, 3035, 3870, ...], 'title@llama': [450, 1771, 4167, 10470, ...], 'abstract@bert': [4497, 1996, 14960, 2015, ...], 'abstract@llama': [1383, 459, 278, 451, ...], 'subcategory': 0}\nprint(item_ut[0])\n\n# => {'uid': 0, 'history': [0, 1, 2]}\nprint(user_ut[0])\n\n# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}\nprint(inter_ut[0])\n\nwith inter_ut:\n inter_ut.union(user_ut)\n\n # => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}\n print(inter_ut[0])\n```\n\n### Glance at the Terminal\n\n```bash\nunitok sample-ut/item\n```\n\n```text\nUniTok (4beta)\nSample Size: 10\nID Column: nid\n\n Jobs \n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Tokenizer \u2503 Tokenizer ID \u2503 Column Mapping \u2503 Vocab \u2503 Max Length \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 TransformersTokenizer \u2502 auto_2VN5Ko \u2502 abstract -> abstract@llama \u2502 llama (size=32024) \u2502 50 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 EntityTokenizer \u2502 auto_C0b9Du \u2502 subcategory -> subcategory \u2502 subcategory (size=8) \u2502 N/A \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 TransformersTokenizer \u2502 auto_2VN5Ko \u2502 title -> title@llama \u2502 llama (size=32024) \u2502 20 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 EntityTokenizer \u2502 auto_4WQYxo \u2502 category -> category \u2502 category (size=4) \u2502 N/A \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 BertTokenizer \u2502 auto_Y9tADT \u2502 abstract -> abstract@bert \u2502 bert (size=30522) \u2502 46 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 BertTokenizer \u2502 auto_Y9tADT \u2502 title -> title@bert \u2502 bert (size=30522) \u2502 16 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 EntityTokenizer \u2502 auto_qwQALc \u2502 nid -> nid \u2502 nid (size=10) \u2502 N/A \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Contributing\n\nWe welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.\n\nOur TODO list includes:\n\n- [ ] More detailed documentation\n- [ ] More examples and tutorials\n- [ ] More SQL-like operations\n- [ ] Analysis and visualization tools\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n\n",
"bugtrack_url": null,
"license": "MIT Licence",
"summary": "Unified Tokenizer",
"version": "4.3.6",
"project_urls": {
"Homepage": "https://github.com/Jyonn/UnifiedTokenizer"
},
"split_keywords": [
"token",
" tokenizer",
" nlp",
" transformers",
" glove",
" bert",
" llama"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bdbf1e35d2295d61ce38699d8276d8709ccbd49fe33e281ff9fc564319be08ee",
"md5": "698477ea6d0907102a51b2fec554549e",
"sha256": "dbd284b342e95ef218f53d26bc3a66e1d2a8392b37083553254f885f263e999b"
},
"downloads": -1,
"filename": "UniTok-4.3.6.tar.gz",
"has_sig": false,
"md5_digest": "698477ea6d0907102a51b2fec554549e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 38680,
"upload_time": "2025-01-30T14:28:25",
"upload_time_iso_8601": "2025-01-30T14:28:25.048402Z",
"url": "https://files.pythonhosted.org/packages/bd/bf/1e35d2295d61ce38699d8276d8709ccbd49fe33e281ff9fc564319be08ee/UniTok-4.3.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-30 14:28:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Jyonn",
"github_project": "UnifiedTokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "termplot",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "setuptools",
"specs": []
},
{
"name": "pigmento",
"specs": []
},
{
"name": "rich",
"specs": []
},
{
"name": "nltk",
"specs": []
}
],
"lcname": "unitok"
}