UniTok


NameUniTok JSON
Version 3.5.3 PyPI version JSON
download
home_pagehttps://github.com/Jyonn/UnifiedTokenizer
SummaryUnified Tokenizer
upload_time2024-11-24 23:56:42
maintainerNone
docs_urlNone
authorJyonn Liu
requires_pythonNone
licenseMIT Licence
keywords token tokenizer
VCS
bugtrack_url
requirements pandas transformers termplot numpy tqdm prettytable setuptools
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # UniTok V3: 类SQL数据预处理工具包

Updated on 2023.11.04

## 1. 简介

UniTok 是史上第一个类SQL的数据预处理工具包,提供了一整套的数据封装和编辑工具。

UniTok 主要包括两大组件:负责统一数据处理的`UniTok` 和 负责数据读取和二次编辑的`UniDep`:
- `UniTok` 通过分词器(Tokenizers)和数据列(Columns)等组件将生数据(Raw Data)进行分词与ID化操作,并最终以numpy数组格式存储为一张数据表。
- `UniDep` 读取由`UniTok`生成的数据表以及元数据(如词表信息),可以直接与Pytorch的Dataset结合使用,也可以完成二次编辑、和其他数据表合并、导出等操作。
- 在3.1.9版本后,我们推出`Fut` 组件,它是`UniTok`的替代品,可以更快速地完成数据预处理。

## 2. 安装

使用pip安装:

```bash
pip install unitok>=3.4.8
```

## 3. 主要功能

### 3.1 UniTok

UniTok提供了一整套的数据预处理工具,包括不同类型的分词器、数据列的管理等。具体来说,UniTok 提供了多种类型的分词器,可以满足不同类型数据的分词需求。每个分词器都继承自 `BaseTok` 类。

此外,UniTok 提供了 `Column` 类来管理数据列。每个 `Column` 对象包含一个分词器(Tokenizer)和一个序列操作器(SeqOperator)。

我们以新闻推荐系统场景为例,数据集可能包含以下部分:

- 新闻内容数据`(news.tsv)`:每一行是一条新闻,包含新闻ID、新闻标题、摘要、类别、子类别等多个特征,用`\t`分隔。
- 用户历史数据`(user.tsv)`:每一行是一位用户,包含用户ID和用户历史点击新闻的ID列表,新闻ID用` `分隔。
- 交互数据:包含训练`(train.tsv)`、验证`(dev.tsv)`和测试数据`(test.tsv)`。每一行是一条交互记录,包含用户ID、新闻ID、是否点击,用`\t`分隔。

我们首先分析以上每个属性的数据类型:

| 文件        | 属性       | 类型  | 样例                                                                   | 备注                      |
|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|
| news.tsv  | nid      | str | N1234                                                                | 新闻ID,唯一标识               |
| news.tsv  | title    | str | After 10 years, the iPhone is still the best smartphone in the world | 新闻标题,通常用BertTokenizer分词 |
| news.tsv  | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now.      | 新闻摘要,通常用BertTokenizer分词 |
| news.tsv  | category | str | Technology                                                           | 新闻类别,不可分割               |
| news.tsv  | subcat   | str | Mobile                                                               | 新闻子类别,不可分割              |
| user.tsv  | uid      | str | U1234                                                                | 用户ID,唯一标识               |
| user.tsv  | history  | str | N1234 N1235 N1236                                                    | 用户历史,被` `分割             |
| train.tsv | uid      | str | U1234                                                                | 用户ID,与`user.tsv`一致      |
| train.tsv | nid      | str | N1234                                                                | 新闻ID,与`news.tsv`一致      |
| train.tsv | label    | int | 1                                                                    | 是否点击,0表示未点击,1表示点击       |

我们可以对以上属性进行分类:

| 属性               | 类型  | 预设分词器     | 备注                                  |
|------------------|-----|-----------|-------------------------------------|
| nid, uid, index  | str | IdTok     | 唯一标识                                |
| title, abstract  | str | BertTok   | 指定参数`vocab_dir="bert-base-uncased"` |
| category, subcat | str | EntityTok | 不可分割                                |
| history          | str | SplitTok  | 指定参数`sep=' '`                       |
| label            | int | NumberTok | 指定参数`vocab_size=2`,只有0和1两种情况        |

通过以下代码,我们可以针对每个文件构建一个UniTok对象:

```python
from UniTok import UniTok, Column, Vocab
from UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok

# Create a news id vocab, commonly used in news data, history data, and interaction data.
nid_vocab = Vocab('nid')

# Create a bert tokenizer, commonly used in tokenizing title and abstract.
eng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng')

# Create a news UniTok object.
news_ut = UniTok()

# Add columns to the news UniTok object.
news_ut.add_col(Column(
    # Specify the vocab. The column name will be set to 'nid' automatically if not specified.
    tok=IdTok(vocab=nid_vocab),
)).add_col(Column(
    # The column name will be set to 'title', rather than the name of eng_tok 'eng'.
    name='title',
    tok=eng_tok,
    max_length=20,  # Specify the max length. The exceeding part will be truncated.
)).add_col(Column(
    name='abstract',
    tok=eng_tok,  # Abstract and title use the same tokenizer.
    max_length=30,
)).add_col(Column(
    name='category',
    tok=EntTok,  # Vocab will be created automatically, and the vocab name will be set to 'category'.
)).add_col(Column(
    name='subcat',
    tok=EntTok,  # Vocab will be created automatically, and the vocab name will be set to 'subcat'.
))

# Read the data file.
news_ut.read('news.tsv', sep='\t')

# Tokenize the data.
news_ut.tokenize() 

# Store the tokenized data.
news_ut.store('data/news')

# Create a user id vocab, commonly used in user data and interaction data.
uid_vocab = Vocab('uid')  # 在用户数据和交互数据中都会用到

# Create a user UniTok object.
user_ut = UniTok()

# Add columns to the user UniTok object.
user_ut.add_col(Column(
    tok=IdTok(vocab=uid_vocab),
)).add_col(Column(
    name='history',
    tok=SplitTok(sep=' '),  # The news id in the history data is separated by space.
))

# Read the data file.
user_ut.read('user.tsv', sep='\t') 

# Tokenize the data.
user_ut.tokenize() 

# Store the tokenized data.
user_ut.store('data/user')


def inter_tokenize(mode):
    # Create an interaction UniTok object.
    inter_ut = UniTok()
    
    # Add columns to the interaction UniTok object.
    inter_ut.add_index_col(
        # The index column in the interaction data is automatically generated, and the tokenizer does not need to be specified.
    ).add_col(Column(
        # Align with the uid column in user_ut.
        tok=EntTok(vocab=uid_vocab), 
    )).add_col(Column(
        # Align with the nid column in news_ut.
        tok=EntTok(vocab=nid_vocab),  
    )).add_col(Column(
        name='label',
        # The label column in the interaction data only has two values, 0 and 1.
        tok=NumberTok(vocab_size=2),  # NumberTok is supported by UniTok >= 3.0.11.
    ))

    # Read the data file.
    inter_ut.read(f'{mode}.tsv', sep='\t')
    
    # Tokenize the data.
    inter_ut.tokenize() 
    
    # Store the tokenized data.
    inter_ut.store(mode)

    
inter_tokenize('data/train')
inter_tokenize('data/dev')
inter_tokenize('data/test')
```

### 3.2 UniDep

UniDep 是一个数据依赖处理类,可以用于加载和访问 UniTok 预处理后的数据。UniDep 包括词汇表(Vocabs),元数据(Meta)等。

`Vocabs` 类是用来集中管理所有的词汇表的。每个 `Vocab` 对象包含了对象到索引的映射,索引到对象的映射,以及一些其它的属性和方法。

`Meta` 类用来管理元数据,包括加载、保存和升级元数据。

以下是一个简单的使用示例:

```python
from UniTok import UniDep

# Load the data.
dep = UniDep('data/news')

# Get sample size.
print(len(dep))

# Get the first sample.
print(dep[0])
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Jyonn/UnifiedTokenizer",
    "name": "UniTok",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "token, tokenizer",
    "author": "Jyonn Liu",
    "author_email": "liu@qijiong.work",
    "download_url": "https://files.pythonhosted.org/packages/63/53/aab09922cc3122017b9bb3cd25be0bff45cd8b37af2e3d4b6d11de5deb71/UniTok-3.5.3.tar.gz",
    "platform": "any",
    "description": "# UniTok V3: \u7c7bSQL\u6570\u636e\u9884\u5904\u7406\u5de5\u5177\u5305\n\nUpdated on 2023.11.04\n\n## 1. \u7b80\u4ecb\n\nUniTok \u662f\u53f2\u4e0a\u7b2c\u4e00\u4e2a\u7c7bSQL\u7684\u6570\u636e\u9884\u5904\u7406\u5de5\u5177\u5305\uff0c\u63d0\u4f9b\u4e86\u4e00\u6574\u5957\u7684\u6570\u636e\u5c01\u88c5\u548c\u7f16\u8f91\u5de5\u5177\u3002\n\nUniTok \u4e3b\u8981\u5305\u62ec\u4e24\u5927\u7ec4\u4ef6\uff1a\u8d1f\u8d23\u7edf\u4e00\u6570\u636e\u5904\u7406\u7684`UniTok` \u548c \u8d1f\u8d23\u6570\u636e\u8bfb\u53d6\u548c\u4e8c\u6b21\u7f16\u8f91\u7684`UniDep`\uff1a\n- `UniTok` \u901a\u8fc7\u5206\u8bcd\u5668\uff08Tokenizers\uff09\u548c\u6570\u636e\u5217\uff08Columns\uff09\u7b49\u7ec4\u4ef6\u5c06\u751f\u6570\u636e\uff08Raw Data\uff09\u8fdb\u884c\u5206\u8bcd\u4e0eID\u5316\u64cd\u4f5c\uff0c\u5e76\u6700\u7ec8\u4ee5numpy\u6570\u7ec4\u683c\u5f0f\u5b58\u50a8\u4e3a\u4e00\u5f20\u6570\u636e\u8868\u3002\n- `UniDep` \u8bfb\u53d6\u7531`UniTok`\u751f\u6210\u7684\u6570\u636e\u8868\u4ee5\u53ca\u5143\u6570\u636e\uff08\u5982\u8bcd\u8868\u4fe1\u606f\uff09\uff0c\u53ef\u4ee5\u76f4\u63a5\u4e0ePytorch\u7684Dataset\u7ed3\u5408\u4f7f\u7528\uff0c\u4e5f\u53ef\u4ee5\u5b8c\u6210\u4e8c\u6b21\u7f16\u8f91\u3001\u548c\u5176\u4ed6\u6570\u636e\u8868\u5408\u5e76\u3001\u5bfc\u51fa\u7b49\u64cd\u4f5c\u3002\n- \u57283.1.9\u7248\u672c\u540e\uff0c\u6211\u4eec\u63a8\u51fa`Fut` \u7ec4\u4ef6\uff0c\u5b83\u662f`UniTok`\u7684\u66ff\u4ee3\u54c1\uff0c\u53ef\u4ee5\u66f4\u5feb\u901f\u5730\u5b8c\u6210\u6570\u636e\u9884\u5904\u7406\u3002\n\n## 2. \u5b89\u88c5\n\n\u4f7f\u7528pip\u5b89\u88c5\uff1a\n\n```bash\npip install unitok>=3.4.8\n```\n\n## 3. \u4e3b\u8981\u529f\u80fd\n\n### 3.1 UniTok\n\nUniTok\u63d0\u4f9b\u4e86\u4e00\u6574\u5957\u7684\u6570\u636e\u9884\u5904\u7406\u5de5\u5177\uff0c\u5305\u62ec\u4e0d\u540c\u7c7b\u578b\u7684\u5206\u8bcd\u5668\u3001\u6570\u636e\u5217\u7684\u7ba1\u7406\u7b49\u3002\u5177\u4f53\u6765\u8bf4\uff0cUniTok \u63d0\u4f9b\u4e86\u591a\u79cd\u7c7b\u578b\u7684\u5206\u8bcd\u5668\uff0c\u53ef\u4ee5\u6ee1\u8db3\u4e0d\u540c\u7c7b\u578b\u6570\u636e\u7684\u5206\u8bcd\u9700\u6c42\u3002\u6bcf\u4e2a\u5206\u8bcd\u5668\u90fd\u7ee7\u627f\u81ea `BaseTok` \u7c7b\u3002\n\n\u6b64\u5916\uff0cUniTok \u63d0\u4f9b\u4e86 `Column` \u7c7b\u6765\u7ba1\u7406\u6570\u636e\u5217\u3002\u6bcf\u4e2a `Column` \u5bf9\u8c61\u5305\u542b\u4e00\u4e2a\u5206\u8bcd\u5668\uff08Tokenizer\uff09\u548c\u4e00\u4e2a\u5e8f\u5217\u64cd\u4f5c\u5668\uff08SeqOperator\uff09\u3002\n\n\u6211\u4eec\u4ee5\u65b0\u95fb\u63a8\u8350\u7cfb\u7edf\u573a\u666f\u4e3a\u4f8b\uff0c\u6570\u636e\u96c6\u53ef\u80fd\u5305\u542b\u4ee5\u4e0b\u90e8\u5206\uff1a\n\n- \u65b0\u95fb\u5185\u5bb9\u6570\u636e`(news.tsv)`\uff1a\u6bcf\u4e00\u884c\u662f\u4e00\u6761\u65b0\u95fb\uff0c\u5305\u542b\u65b0\u95fbID\u3001\u65b0\u95fb\u6807\u9898\u3001\u6458\u8981\u3001\u7c7b\u522b\u3001\u5b50\u7c7b\u522b\u7b49\u591a\u4e2a\u7279\u5f81\uff0c\u7528`\\t`\u5206\u9694\u3002\n- \u7528\u6237\u5386\u53f2\u6570\u636e`(user.tsv)`\uff1a\u6bcf\u4e00\u884c\u662f\u4e00\u4f4d\u7528\u6237\uff0c\u5305\u542b\u7528\u6237ID\u548c\u7528\u6237\u5386\u53f2\u70b9\u51fb\u65b0\u95fb\u7684ID\u5217\u8868\uff0c\u65b0\u95fbID\u7528` `\u5206\u9694\u3002\n- \u4ea4\u4e92\u6570\u636e\uff1a\u5305\u542b\u8bad\u7ec3`(train.tsv)`\u3001\u9a8c\u8bc1`(dev.tsv)`\u548c\u6d4b\u8bd5\u6570\u636e`(test.tsv)`\u3002\u6bcf\u4e00\u884c\u662f\u4e00\u6761\u4ea4\u4e92\u8bb0\u5f55\uff0c\u5305\u542b\u7528\u6237ID\u3001\u65b0\u95fbID\u3001\u662f\u5426\u70b9\u51fb\uff0c\u7528`\\t`\u5206\u9694\u3002\n\n\u6211\u4eec\u9996\u5148\u5206\u6790\u4ee5\u4e0a\u6bcf\u4e2a\u5c5e\u6027\u7684\u6570\u636e\u7c7b\u578b\uff1a\n\n| \u6587\u4ef6        | \u5c5e\u6027       | \u7c7b\u578b  | \u6837\u4f8b                                                                   | \u5907\u6ce8                      |\n|-----------|----------|-----|----------------------------------------------------------------------|-------------------------|\n| news.tsv  | nid      | str | N1234                                                                | \u65b0\u95fbID\uff0c\u552f\u4e00\u6807\u8bc6               |\n| news.tsv  | title    | str | After 10 years, the iPhone is still the best smartphone in the world | \u65b0\u95fb\u6807\u9898\uff0c\u901a\u5e38\u7528BertTokenizer\u5206\u8bcd |\n| news.tsv  | abstract | str | The iPhone 11 Pro is the best smartphone you can buy right now.      | \u65b0\u95fb\u6458\u8981\uff0c\u901a\u5e38\u7528BertTokenizer\u5206\u8bcd |\n| news.tsv  | category | str | Technology                                                           | \u65b0\u95fb\u7c7b\u522b\uff0c\u4e0d\u53ef\u5206\u5272               |\n| news.tsv  | subcat   | str | Mobile                                                               | \u65b0\u95fb\u5b50\u7c7b\u522b\uff0c\u4e0d\u53ef\u5206\u5272              |\n| user.tsv  | uid      | str | U1234                                                                | \u7528\u6237ID\uff0c\u552f\u4e00\u6807\u8bc6               |\n| user.tsv  | history  | str | N1234 N1235 N1236                                                    | \u7528\u6237\u5386\u53f2\uff0c\u88ab` `\u5206\u5272             |\n| train.tsv | uid      | str | U1234                                                                | \u7528\u6237ID\uff0c\u4e0e`user.tsv`\u4e00\u81f4      |\n| train.tsv | nid      | str | N1234                                                                | \u65b0\u95fbID\uff0c\u4e0e`news.tsv`\u4e00\u81f4      |\n| train.tsv | label    | int | 1                                                                    | \u662f\u5426\u70b9\u51fb\uff0c0\u8868\u793a\u672a\u70b9\u51fb\uff0c1\u8868\u793a\u70b9\u51fb       |\n\n\u6211\u4eec\u53ef\u4ee5\u5bf9\u4ee5\u4e0a\u5c5e\u6027\u8fdb\u884c\u5206\u7c7b\uff1a\n\n| \u5c5e\u6027               | \u7c7b\u578b  | \u9884\u8bbe\u5206\u8bcd\u5668     | \u5907\u6ce8                                  |\n|------------------|-----|-----------|-------------------------------------|\n| nid, uid, index  | str | IdTok     | \u552f\u4e00\u6807\u8bc6                                |\n| title, abstract  | str | BertTok   | \u6307\u5b9a\u53c2\u6570`vocab_dir=\"bert-base-uncased\"` |\n| category, subcat | str | EntityTok | \u4e0d\u53ef\u5206\u5272                                |\n| history          | str | SplitTok  | \u6307\u5b9a\u53c2\u6570`sep=' '`                       |\n| label            | int | NumberTok | \u6307\u5b9a\u53c2\u6570`vocab_size=2`\uff0c\u53ea\u67090\u548c1\u4e24\u79cd\u60c5\u51b5        |\n\n\u901a\u8fc7\u4ee5\u4e0b\u4ee3\u7801\uff0c\u6211\u4eec\u53ef\u4ee5\u9488\u5bf9\u6bcf\u4e2a\u6587\u4ef6\u6784\u5efa\u4e00\u4e2aUniTok\u5bf9\u8c61\uff1a\n\n```python\nfrom UniTok import UniTok, Column, Vocab\nfrom UniTok.tok import IdTok, BertTok, EntTok, SplitTok, NumberTok\n\n# Create a news id vocab, commonly used in news data, history data, and interaction data.\nnid_vocab = Vocab('nid')\n\n# Create a bert tokenizer, commonly used in tokenizing title and abstract.\neng_tok = BertTok(vocab_dir='bert-base-uncased', name='eng')\n\n# Create a news UniTok object.\nnews_ut = UniTok()\n\n# Add columns to the news UniTok object.\nnews_ut.add_col(Column(\n    # Specify the vocab. The column name will be set to 'nid' automatically if not specified.\n    tok=IdTok(vocab=nid_vocab),\n)).add_col(Column(\n    # The column name will be set to 'title', rather than the name of eng_tok 'eng'.\n    name='title',\n    tok=eng_tok,\n    max_length=20,  # Specify the max length. The exceeding part will be truncated.\n)).add_col(Column(\n    name='abstract',\n    tok=eng_tok,  # Abstract and title use the same tokenizer.\n    max_length=30,\n)).add_col(Column(\n    name='category',\n    tok=EntTok,  # Vocab will be created automatically, and the vocab name will be set to 'category'.\n)).add_col(Column(\n    name='subcat',\n    tok=EntTok,  # Vocab will be created automatically, and the vocab name will be set to 'subcat'.\n))\n\n# Read the data file.\nnews_ut.read('news.tsv', sep='\\t')\n\n# Tokenize the data.\nnews_ut.tokenize() \n\n# Store the tokenized data.\nnews_ut.store('data/news')\n\n# Create a user id vocab, commonly used in user data and interaction data.\nuid_vocab = Vocab('uid')  # \u5728\u7528\u6237\u6570\u636e\u548c\u4ea4\u4e92\u6570\u636e\u4e2d\u90fd\u4f1a\u7528\u5230\n\n# Create a user UniTok object.\nuser_ut = UniTok()\n\n# Add columns to the user UniTok object.\nuser_ut.add_col(Column(\n    tok=IdTok(vocab=uid_vocab),\n)).add_col(Column(\n    name='history',\n    tok=SplitTok(sep=' '),  # The news id in the history data is separated by space.\n))\n\n# Read the data file.\nuser_ut.read('user.tsv', sep='\\t') \n\n# Tokenize the data.\nuser_ut.tokenize() \n\n# Store the tokenized data.\nuser_ut.store('data/user')\n\n\ndef inter_tokenize(mode):\n    # Create an interaction UniTok object.\n    inter_ut = UniTok()\n    \n    # Add columns to the interaction UniTok object.\n    inter_ut.add_index_col(\n        # The index column in the interaction data is automatically generated, and the tokenizer does not need to be specified.\n    ).add_col(Column(\n        # Align with the uid column in user_ut.\n        tok=EntTok(vocab=uid_vocab), \n    )).add_col(Column(\n        # Align with the nid column in news_ut.\n        tok=EntTok(vocab=nid_vocab),  \n    )).add_col(Column(\n        name='label',\n        # The label column in the interaction data only has two values, 0 and 1.\n        tok=NumberTok(vocab_size=2),  # NumberTok is supported by UniTok >= 3.0.11.\n    ))\n\n    # Read the data file.\n    inter_ut.read(f'{mode}.tsv', sep='\\t')\n    \n    # Tokenize the data.\n    inter_ut.tokenize() \n    \n    # Store the tokenized data.\n    inter_ut.store(mode)\n\n    \ninter_tokenize('data/train')\ninter_tokenize('data/dev')\ninter_tokenize('data/test')\n```\n\n### 3.2 UniDep\n\nUniDep \u662f\u4e00\u4e2a\u6570\u636e\u4f9d\u8d56\u5904\u7406\u7c7b\uff0c\u53ef\u4ee5\u7528\u4e8e\u52a0\u8f7d\u548c\u8bbf\u95ee UniTok \u9884\u5904\u7406\u540e\u7684\u6570\u636e\u3002UniDep \u5305\u62ec\u8bcd\u6c47\u8868\uff08Vocabs\uff09\uff0c\u5143\u6570\u636e\uff08Meta\uff09\u7b49\u3002\n\n`Vocabs` \u7c7b\u662f\u7528\u6765\u96c6\u4e2d\u7ba1\u7406\u6240\u6709\u7684\u8bcd\u6c47\u8868\u7684\u3002\u6bcf\u4e2a `Vocab` \u5bf9\u8c61\u5305\u542b\u4e86\u5bf9\u8c61\u5230\u7d22\u5f15\u7684\u6620\u5c04\uff0c\u7d22\u5f15\u5230\u5bf9\u8c61\u7684\u6620\u5c04\uff0c\u4ee5\u53ca\u4e00\u4e9b\u5176\u5b83\u7684\u5c5e\u6027\u548c\u65b9\u6cd5\u3002\n\n`Meta` \u7c7b\u7528\u6765\u7ba1\u7406\u5143\u6570\u636e\uff0c\u5305\u62ec\u52a0\u8f7d\u3001\u4fdd\u5b58\u548c\u5347\u7ea7\u5143\u6570\u636e\u3002\n\n\u4ee5\u4e0b\u662f\u4e00\u4e2a\u7b80\u5355\u7684\u4f7f\u7528\u793a\u4f8b\uff1a\n\n```python\nfrom UniTok import UniDep\n\n# Load the data.\ndep = UniDep('data/news')\n\n# Get sample size.\nprint(len(dep))\n\n# Get the first sample.\nprint(dep[0])\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT Licence",
    "summary": "Unified Tokenizer",
    "version": "3.5.3",
    "project_urls": {
        "Homepage": "https://github.com/Jyonn/UnifiedTokenizer"
    },
    "split_keywords": [
        "token",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6353aab09922cc3122017b9bb3cd25be0bff45cd8b37af2e3d4b6d11de5deb71",
                "md5": "e6980b22b61924e85ff58815fd62d28b",
                "sha256": "48bef34bd7387ce2ed0783953966d4cdf4164c680d786568ef5285b77ebaa6da"
            },
            "downloads": -1,
            "filename": "UniTok-3.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "e6980b22b61924e85ff58815fd62d28b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 22205,
            "upload_time": "2024-11-24T23:56:42",
            "upload_time_iso_8601": "2024-11-24T23:56:42.178093Z",
            "url": "https://files.pythonhosted.org/packages/63/53/aab09922cc3122017b9bb3cd25be0bff45cd8b37af2e3d4b6d11de5deb71/UniTok-3.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-24 23:56:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Jyonn",
    "github_project": "UnifiedTokenizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "termplot",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "prettytable",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        }
    ],
    "lcname": "unitok"
}
        
Elapsed time: 0.95704s