UniTok


NameUniTok JSON
Version 4.3.6 PyPI version JSON
download
home_pagehttps://github.com/Jyonn/UnifiedTokenizer
SummaryUnified Tokenizer
upload_time2025-01-30 14:28:25
maintainerNone
docs_urlNone
authorJyonn Liu
requires_pythonNone
licenseMIT Licence
keywords token tokenizer nlp transformers glove bert llama
VCS
bugtrack_url
requirements pandas transformers termplot numpy tqdm setuptools pigmento rich nltk
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # UniTok V4

The documentation for v3, old version, can be found [here](README_v3.md) in Chinese.

## Overview

[![PyPI version](https://badge.fury.io/py/unitok.svg)](https://badge.fury.io/py/unitok)

Welcome to the UniTok v4! 
This library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values. 

Please refer to [UniTok Handbook](https://unitok.qijiong.work) for more detailed information.

## Road from V3 to V4

### Changes and Comparisons

| Feature                         | UniTok v3                                                   | UniTok v4                                           | Comments                                                                      |
|---------------------------------|-------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------|
| `UniTok` class                  | Solely for tokenization                                     | Manages the entire preprocessing lifecycle          |                                                                               |
| `UniDep` class                  | Data loading and combining                                  | Removed                                             | V4 combines the functionalities of `UniTok` and `UniDep` into a single class. |
| `Column` class                  | Column name is for both the original and tokenized datasets | N/A                                                 | V4 introduces a `Job` class.                                                  |
| `Job` class                     | N/A                                                         | Defines how a specific column should be tokenized   |                                                                               |
| `Tokenizer` class               | Ambiguous return type definition                            | `return_list` parameter must be of type `bool`      |                                                                               |
| `Tokenizer` class               | Only supports `BertTokenizer` for text processing           | Supports all Tokenizers in the transformers library | New `TransformersTokenizer` class                                             |
| `analyse` method                | Supported                                                   | Not supported Currently                             |                                                                               |
| `Meta` class                    | Only for human-friendly displaying                          | Manager for `Job`, `Tokenizer`, and `Vocab`         |                                                                               |
| `unitok` command                | Visualization in the terminal                               | More colorful and detailed output                   |                                                                               |
| `Vocab` class (unitok >= 4.1.0) | Save and load vocabulary using text files                   | Save and load vocabulary using pickle files         | Avoids issues with special characters in text files                           |

### How to Migrate the Processed Data

```bash
unidep-upgrade-v4 <path>
```

## Installation

**Requirements**

- Python 3.7 or later
- Dependencies:
  - pandas
  - transformers
  - tqdm
  - rich

**Install UniTok via pip**

```bash
pip install unitok
```

## Core Concepts

**States**

- `initialized`: The initial state after creating a UniTok instance.
- `tokenized`: Achieved after applying tokenization to the dataset.
- `organized`: Reached after combining multiple datasets via operations like union.

**Components**

- UniTok: Manages the dataset preprocessing lifecycle.
- Job: Defines how a specific column should be tokenized.
- Tokenizer: Encodes data using various methods (e.g., BERT, splitting by delimiters).
- Vocabulary: Stores and manages unique tokens across datasets.

**Primary Key (key_job)**

The `key_job` acts as the primary key for operations like `getitem` and `union`, ensuring consistency across datasets.

## Usage Guide

### Loading Data

Load datasets using pandas:

```python
import pandas as pd

item = pd.read_csv(
    filepath_or_buffer='news-sample.tsv',
    sep='\t',
    names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'],
    usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('')  # Handle missing values

user = pd.read_csv(
    filepath_or_buffer='user-sample.tsv',
    sep='\t',
    names=['uid', 'history'],
)

interaction = pd.read_csv(
    filepath_or_buffer='interaction-sample.tsv',
    sep='\t',
    names=['uid', 'nid', 'click'],
)
```

### Defining and Adding Jobs

Define tokenization jobs for different columns:

```python
from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer

item_vocab = Vocab(name='nid')  # will be used across datasets
user_vocab = Vocab(name='uid')  # will be used across datasets

with UniTok() as item_ut:
    bert_tokenizer = BertTokenizer(vocab='bert')
    llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')

    item_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
    item_ut.add_job(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)
    item_ut.add_job(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)
    item_ut.add_job(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)
    item_ut.add_job(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)
    item_ut.add_job(tokenizer=EntityTokenizer(vocab='category'), column='category')
    item_ut.add_job(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')

with UniTok() as user_ut:
    user_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
    user_ut.add_job(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)

with UniTok() as inter_ut:
    inter_ut.add_index_job(name='index')
    inter_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
    inter_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
    inter_ut.add_job(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')
```

### Tokenizing Data

Tokenize and save the processed data:

```python
item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit()  # will raise an error if new items are detected in the user or interaction datasets
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')
```

### Combining Datasets

Combine datasets using union:

```python
# => {'category': 0, 'nid': 0, 'title@bert': [1996, 9639, 3035, 3870, ...], 'title@llama': [450, 1771, 4167, 10470, ...], 'abstract@bert': [4497, 1996, 14960, 2015, ...], 'abstract@llama': [1383, 459, 278, 451, ...], 'subcategory': 0}
print(item_ut[0])

# => {'uid': 0, 'history': [0, 1, 2]}
print(user_ut[0])

# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}
print(inter_ut[0])

with inter_ut:
    inter_ut.union(user_ut)

    # => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}
    print(inter_ut[0])
```

### Glance at the Terminal

```bash
unitok sample-ut/item
```

```text
UniTok (4beta)
Sample Size: 10
ID Column: nid

                                                                                 Jobs                                                                                  
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Tokenizer                            ┃     Tokenizer ID      ┃ Column Mapping                               ┃ Vocab                             ┃    Max Length     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ TransformersTokenizer                │      auto_2VN5Ko      │ abstract -> abstract@llama                   │ llama (size=32024)                │        50         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer                      │      auto_C0b9Du      │ subcategory -> subcategory                   │ subcategory (size=8)              │        N/A        │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ TransformersTokenizer                │      auto_2VN5Ko      │ title -> title@llama                         │ llama (size=32024)                │        20         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer                      │      auto_4WQYxo      │ category -> category                         │ category (size=4)                 │        N/A        │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ BertTokenizer                        │      auto_Y9tADT      │ abstract -> abstract@bert                    │ bert (size=30522)                 │        46         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ BertTokenizer                        │      auto_Y9tADT      │ title -> title@bert                          │ bert (size=30522)                 │        16         │
├──────────────────────────────────────┼───────────────────────┼──────────────────────────────────────────────┼───────────────────────────────────┼───────────────────┤
│ EntityTokenizer                      │      auto_qwQALc      │ nid -> nid                                   │ nid (size=10)                     │        N/A        │
└──────────────────────────────────────┴───────────────────────┴──────────────────────────────────────────────┴───────────────────────────────────┴───────────────────┘
```

## Contributing

We welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.

Our TODO list includes:

- [ ] More detailed documentation
- [ ] More examples and tutorials
- [ ] More SQL-like operations
- [ ] Analysis and visualization tools

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Jyonn/UnifiedTokenizer",
    "name": "UniTok",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "token, tokenizer, NLP, transformers, glove, bert, llama",
    "author": "Jyonn Liu",
    "author_email": "liu@qijiong.work",
    "download_url": "https://files.pythonhosted.org/packages/bd/bf/1e35d2295d61ce38699d8276d8709ccbd49fe33e281ff9fc564319be08ee/UniTok-4.3.6.tar.gz",
    "platform": "any",
    "description": "# UniTok V4\n\nThe documentation for v3, old version, can be found [here](README_v3.md) in Chinese.\n\n## Overview\n\n[![PyPI version](https://badge.fury.io/py/unitok.svg)](https://badge.fury.io/py/unitok)\n\nWelcome to the UniTok v4! \nThis library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values. \n\nPlease refer to [UniTok Handbook](https://unitok.qijiong.work) for more detailed information.\n\n## Road from V3 to V4\n\n### Changes and Comparisons\n\n| Feature                         | UniTok v3                                                   | UniTok v4                                           | Comments                                                                      |\n|---------------------------------|-------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------|\n| `UniTok` class                  | Solely for tokenization                                     | Manages the entire preprocessing lifecycle          |                                                                               |\n| `UniDep` class                  | Data loading and combining                                  | Removed                                             | V4 combines the functionalities of `UniTok` and `UniDep` into a single class. |\n| `Column` class                  | Column name is for both the original and tokenized datasets | N/A                                                 | V4 introduces a `Job` class.                                                  |\n| `Job` class                     | N/A                                                         | Defines how a specific column should be tokenized   |                                                                               |\n| `Tokenizer` class               | Ambiguous return type definition                            | `return_list` parameter must be of type `bool`      |                                                                               |\n| `Tokenizer` class               | Only supports `BertTokenizer` for text processing           | Supports all Tokenizers in the transformers library | New `TransformersTokenizer` class                                             |\n| `analyse` method                | Supported                                                   | Not supported Currently                             |                                                                               |\n| `Meta` class                    | Only for human-friendly displaying                          | Manager for `Job`, `Tokenizer`, and `Vocab`         |                                                                               |\n| `unitok` command                | Visualization in the terminal                               | More colorful and detailed output                   |                                                                               |\n| `Vocab` class (unitok >= 4.1.0) | Save and load vocabulary using text files                   | Save and load vocabulary using pickle files         | Avoids issues with special characters in text files                           |\n\n### How to Migrate the Processed Data\n\n```bash\nunidep-upgrade-v4 <path>\n```\n\n## Installation\n\n**Requirements**\n\n- Python 3.7 or later\n- Dependencies:\n  - pandas\n  - transformers\n  - tqdm\n  - rich\n\n**Install UniTok via pip**\n\n```bash\npip install unitok\n```\n\n## Core Concepts\n\n**States**\n\n- `initialized`: The initial state after creating a UniTok instance.\n- `tokenized`: Achieved after applying tokenization to the dataset.\n- `organized`: Reached after combining multiple datasets via operations like union.\n\n**Components**\n\n- UniTok: Manages the dataset preprocessing lifecycle.\n- Job: Defines how a specific column should be tokenized.\n- Tokenizer: Encodes data using various methods (e.g., BERT, splitting by delimiters).\n- Vocabulary: Stores and manages unique tokens across datasets.\n\n**Primary Key (key_job)**\n\nThe `key_job` acts as the primary key for operations like `getitem` and `union`, ensuring consistency across datasets.\n\n## Usage Guide\n\n### Loading Data\n\nLoad datasets using pandas:\n\n```python\nimport pandas as pd\n\nitem = pd.read_csv(\n    filepath_or_buffer='news-sample.tsv',\n    sep='\\t',\n    names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'],\n    usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],\n)\nitem['abstract'] = item['abstract'].fillna('')  # Handle missing values\n\nuser = pd.read_csv(\n    filepath_or_buffer='user-sample.tsv',\n    sep='\\t',\n    names=['uid', 'history'],\n)\n\ninteraction = pd.read_csv(\n    filepath_or_buffer='interaction-sample.tsv',\n    sep='\\t',\n    names=['uid', 'nid', 'click'],\n)\n```\n\n### Defining and Adding Jobs\n\nDefine tokenization jobs for different columns:\n\n```python\nfrom unitok import UniTok, Vocab\nfrom unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer\n\nitem_vocab = Vocab(name='nid')  # will be used across datasets\nuser_vocab = Vocab(name='uid')  # will be used across datasets\n\nwith UniTok() as item_ut:\n    bert_tokenizer = BertTokenizer(vocab='bert')\n    llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')\n\n    item_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)\n    item_ut.add_job(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)\n    item_ut.add_job(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)\n    item_ut.add_job(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)\n    item_ut.add_job(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)\n    item_ut.add_job(tokenizer=EntityTokenizer(vocab='category'), column='category')\n    item_ut.add_job(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')\n\nwith UniTok() as user_ut:\n    user_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)\n    user_ut.add_job(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)\n\nwith UniTok() as inter_ut:\n    inter_ut.add_index_job(name='index')\n    inter_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')\n    inter_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')\n    inter_ut.add_job(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')\n```\n\n### Tokenizing Data\n\nTokenize and save the processed data:\n\n```python\nitem_ut.tokenize(item).save('sample-ut/item')\nitem_vocab.deny_edit()  # will raise an error if new items are detected in the user or interaction datasets\nuser_ut.tokenize(user).save('sample-ut/user')\ninter_ut.tokenize(interaction).save('sample-ut/interaction')\n```\n\n### Combining Datasets\n\nCombine datasets using union:\n\n```python\n# => {'category': 0, 'nid': 0, 'title@bert': [1996, 9639, 3035, 3870, ...], 'title@llama': [450, 1771, 4167, 10470, ...], 'abstract@bert': [4497, 1996, 14960, 2015, ...], 'abstract@llama': [1383, 459, 278, 451, ...], 'subcategory': 0}\nprint(item_ut[0])\n\n# => {'uid': 0, 'history': [0, 1, 2]}\nprint(user_ut[0])\n\n# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}\nprint(inter_ut[0])\n\nwith inter_ut:\n    inter_ut.union(user_ut)\n\n    # => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}\n    print(inter_ut[0])\n```\n\n### Glance at the Terminal\n\n```bash\nunitok sample-ut/item\n```\n\n```text\nUniTok (4beta)\nSample Size: 10\nID Column: nid\n\n                                                                                 Jobs                                                                                  \n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Tokenizer                            \u2503     Tokenizer ID      \u2503 Column Mapping                               \u2503 Vocab                             \u2503    Max Length     \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 TransformersTokenizer                \u2502      auto_2VN5Ko      \u2502 abstract -> abstract@llama                   \u2502 llama (size=32024)                \u2502        50         \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 EntityTokenizer                      \u2502      auto_C0b9Du      \u2502 subcategory -> subcategory                   \u2502 subcategory (size=8)              \u2502        N/A        \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 TransformersTokenizer                \u2502      auto_2VN5Ko      \u2502 title -> title@llama                         \u2502 llama (size=32024)                \u2502        20         \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 EntityTokenizer                      \u2502      auto_4WQYxo      \u2502 category -> category                         \u2502 category (size=4)                 \u2502        N/A        \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 BertTokenizer                        \u2502      auto_Y9tADT      \u2502 abstract -> abstract@bert                    \u2502 bert (size=30522)                 \u2502        46         \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 BertTokenizer                        \u2502      auto_Y9tADT      \u2502 title -> title@bert                          \u2502 bert (size=30522)                 \u2502        16         \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 EntityTokenizer                      \u2502      auto_qwQALc      \u2502 nid -> nid                                   \u2502 nid (size=10)                     \u2502        N/A        \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Contributing\n\nWe welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.\n\nOur TODO list includes:\n\n- [ ] More detailed documentation\n- [ ] More examples and tutorials\n- [ ] More SQL-like operations\n- [ ] Analysis and visualization tools\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT Licence",
    "summary": "Unified Tokenizer",
    "version": "4.3.6",
    "project_urls": {
        "Homepage": "https://github.com/Jyonn/UnifiedTokenizer"
    },
    "split_keywords": [
        "token",
        " tokenizer",
        " nlp",
        " transformers",
        " glove",
        " bert",
        " llama"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bdbf1e35d2295d61ce38699d8276d8709ccbd49fe33e281ff9fc564319be08ee",
                "md5": "698477ea6d0907102a51b2fec554549e",
                "sha256": "dbd284b342e95ef218f53d26bc3a66e1d2a8392b37083553254f885f263e999b"
            },
            "downloads": -1,
            "filename": "UniTok-4.3.6.tar.gz",
            "has_sig": false,
            "md5_digest": "698477ea6d0907102a51b2fec554549e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 38680,
            "upload_time": "2025-01-30T14:28:25",
            "upload_time_iso_8601": "2025-01-30T14:28:25.048402Z",
            "url": "https://files.pythonhosted.org/packages/bd/bf/1e35d2295d61ce38699d8276d8709ccbd49fe33e281ff9fc564319be08ee/UniTok-4.3.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-30 14:28:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Jyonn",
    "github_project": "UnifiedTokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "termplot",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "pigmento",
            "specs": []
        },
        {
            "name": "rich",
            "specs": []
        },
        {
            "name": "nltk",
            "specs": []
        }
    ],
    "lcname": "unitok"
}
        
Elapsed time: 1.11761s