kin-tokenizer


Namekin-tokenizer JSON
Version 3.3.1 PyPI version JSON
download
home_pagehttps://github.com/Nschadrack/Kin-Tokenizer
SummaryKinyarwanda tokenizer for encoding and decoding Kinyarwanda language text
upload_time2024-08-16 10:14:03
maintainerNone
docs_urlNone
authorSchadrack Niyibizi
requires_python>=3.7
licenseNone
keywords tokenizer kinyarwanda kingpt
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kin-Tokenizer

`kin-tokenizer` is a Python library designed for tokenizing Kinyarwanda language text, it can also tokenizer other languages like English, French or similar languages but with low compression rate since the tokenizer was trained on Kinyarwanda text only.
There is an option for training your own custom tokenizer using defined function or method. It is covered in the section of training your own tokenizer.
Hence, you can use that way to train your own tokenizer using dataset for a different language. It is based on the Byte Pair Encoding (BPE) algorithm.

It can both encode and decode text in Kinyarwanda. The metric used to measure the accuracy of this tokenizer is the **compression rate** and ability to encode and decode texts

**Compression rate** is the ratio of the total original number of characters in the text to the number of tokens in the encoded text.

**Example:**

For the sentence: `"Nagiye gusura abanyeshuri."`

- The sentence has **26 characters**.
- Suppose the sentence is tokenized into the following tokens: `[23, 45, 67, 89, 23, 123, 44, 22, 55, 22, 45]`.
- The total number of tokens is **11**.

$$ \text{Compression Rate} = \frac{26}{11} $$

So, the **compression rate is 2.36X** (where X indicates that the number is approximate).

## Special Tokens

**<|PAD|>** is a special token for padding, it is represented by **0** in the vocab

**<|EOS|>** is a special token for indicating the end of sequence, it is represented by **vocab_size - 1** in the vocab

## Kin-Tokenizer 3.3

Version 3.3 has a vocabulary size of 20,257. It was the first version used for testing but had poor accuracy in tokenizing text. The compression rate was low due to a bug in the algorithm and was trained on a dataset with over 445 million characters.

## Kin-Tokenizer 3.3.1

Version 3.3.1 features a vocabulary size of 4001 with improved words or subwords. It was trained on 1 million Kinyarwanda characters and shows a significantly improved compression rate compared to the previous version. The bug from the previous algorithm has been fixed.

## Installation

You can install the package using pip:

```bash
    pip install kin-tokenizer

```

## Basis Usage

 ```python

    from kin_tokenizer import KinTokenizer  # Importing Tokenizer class

    # Creating an instance of tokenizer
    tokenizer = KinTokenizer()

    # Loading the state of the tokenizer (pretrained tokenizer)
    tokenizer.load()

    # Encoding
    text = """
    Nuko Semugeshi akunda uwo mutwe w'abatwa, bitwaga Ishabi, awugira intore ze. Bukeye
bataha biyereka ingabo; dore ko hambere nta mihamirizo yindi yabaga mu Rwanda; guhamiriza
byaje vuba biturutse i Burundi. Ubwo bataha Semugeshi n'abatware be barabitegereza basanga
ari abahanga bose, ariko Waga akaba umuhanga w'imena muri bo; nyamara muri ubwo buhanga
bwe akagiramo intege nke ku mpamvu yo kunanuka, yari afite uruti ruke."""

    tokens = tokenizer.encode(text)
    print(tokens)

    # Calculating compression rate
    text_len = len(text)
    tokens_len = len(tokens)

    compression_rate = text_len / tokens_len
    print(f"Compression rate: {compression_rate:.2f}X")


    # Creating sequences from tokens encoded from text
    x_seq, y_seq = create_sequences(tokens, seq_len=20)

    # Decoding
    decoded_text = tokenizer.decode(tokens)
    print(decoded_text)

    # Decoding one sequence from created sequences
    print(f"Decoded sequence:\n {tokenizer.decode(x_seq[0])}")

    # Printing the vocab size
    print(tokenizer.vocab_size)

    # Print vocabulary (first 200 items)
    count = 0
    for k, v in tokenizer.vocab.items():
        print("{} : {}".format(k, v))
        count += 1
        if count > 200:
            break
```

## Training Your Own Tokenizer

You can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text.
**N.B**: Your chosen vocab_size will be met depening on the amount of data you have used for training. The **vocab_size** is a hyperparameter to be adjusted for better vocabularies in your vocab, and also the size of your dataset and diversity matters. The vocab is initialized by count of 256 from 1-255 unicode characters and 0 for <|PAD|>  and 1 for special token <|EOS|> which marks the end of squence and it will be assigned the last token in the vocabulary.

```python

    from kin_tokenizer import KinTokenizer
    from kin_tokenizer.utils import train_kin_tokenizer, create_sequences

    # Training the tokenizer
    tokenizer = train_kin_tokenizer(training_text, vocab_size=512, save=True, tokenizer_path=SAVE_PATH_ROOT)


    # Encoding text using custom trained tokenizer
    tokens = tokenizer.encode(text)

    # Creating sequences
    x_seq, y_seq = create_sequences(tokens, seq_len=20)

```

## Contributing

The project is still being updated and contributions are welcome. You can contribute by:

- Reporting bugs
- Suggesting features
- Writing or improving documentation
- Submitting pull requests
  

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Nschadrack/Kin-Tokenizer",
    "name": "kin-tokenizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "Tokenizer, Kinyarwanda, KinGPT",
    "author": "Schadrack Niyibizi",
    "author_email": "niyibizischadrack@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/7a/a8/0a05c4195609159745ef5b5712991dae48bfcf87b3e667e2ae1e5a5a3f40/kin_tokenizer-3.3.1.tar.gz",
    "platform": null,
    "description": "# Kin-Tokenizer\r\n\r\n`kin-tokenizer` is a Python library designed for tokenizing Kinyarwanda language text, it can also tokenizer other languages like English, French or similar languages but with low compression rate since the tokenizer was trained on Kinyarwanda text only.\r\nThere is an option for training your own custom tokenizer using defined function or method. It is covered in the section of training your own tokenizer.\r\nHence, you can use that way to train your own tokenizer using dataset for a different language. It is based on the Byte Pair Encoding (BPE) algorithm.\r\n\r\nIt can both encode and decode text in Kinyarwanda. The metric used to measure the accuracy of this tokenizer is the **compression rate** and ability to encode and decode texts\r\n\r\n**Compression rate** is the ratio of the total original number of characters in the text to the number of tokens in the encoded text.\r\n\r\n**Example:**\r\n\r\nFor the sentence: `\"Nagiye gusura abanyeshuri.\"`\r\n\r\n- The sentence has **26 characters**.\r\n- Suppose the sentence is tokenized into the following tokens: `[23, 45, 67, 89, 23, 123, 44, 22, 55, 22, 45]`.\r\n- The total number of tokens is **11**.\r\n\r\n$$ \\text{Compression Rate} = \\frac{26}{11} $$\r\n\r\nSo, the **compression rate is 2.36X** (where X indicates that the number is approximate).\r\n\r\n## Special Tokens\r\n\r\n**<|PAD|>** is a special token for padding, it is represented by **0** in the vocab\r\n\r\n**<|EOS|>** is a special token for indicating the end of sequence, it is represented by **vocab_size - 1** in the vocab\r\n\r\n## Kin-Tokenizer 3.3\r\n\r\nVersion 3.3 has a vocabulary size of 20,257. It was the first version used for testing but had poor accuracy in tokenizing text. The compression rate was low due to a bug in the algorithm and was trained on a dataset with over 445 million characters.\r\n\r\n## Kin-Tokenizer 3.3.1\r\n\r\nVersion 3.3.1 features a vocabulary size of 4001 with improved words or subwords. It was trained on 1 million Kinyarwanda characters and shows a significantly improved compression rate compared to the previous version. The bug from the previous algorithm has been fixed.\r\n\r\n## Installation\r\n\r\nYou can install the package using pip:\r\n\r\n```bash\r\n    pip install kin-tokenizer\r\n\r\n```\r\n\r\n## Basis Usage\r\n\r\n ```python\r\n\r\n    from kin_tokenizer import KinTokenizer  # Importing Tokenizer class\r\n\r\n    # Creating an instance of tokenizer\r\n    tokenizer = KinTokenizer()\r\n\r\n    # Loading the state of the tokenizer (pretrained tokenizer)\r\n    tokenizer.load()\r\n\r\n    # Encoding\r\n    text = \"\"\"\r\n    Nuko Semugeshi akunda uwo mutwe w'abatwa, bitwaga Ishabi, awugira intore ze. Bukeye\r\nbataha biyereka ingabo; dore ko hambere nta mihamirizo yindi yabaga mu Rwanda; guhamiriza\r\nbyaje vuba biturutse i Burundi. Ubwo bataha Semugeshi n'abatware be barabitegereza basanga\r\nari abahanga bose, ariko Waga akaba umuhanga w'imena muri bo; nyamara muri ubwo buhanga\r\nbwe akagiramo intege nke ku mpamvu yo kunanuka, yari afite uruti ruke.\"\"\"\r\n\r\n    tokens = tokenizer.encode(text)\r\n    print(tokens)\r\n\r\n    # Calculating compression rate\r\n    text_len = len(text)\r\n    tokens_len = len(tokens)\r\n\r\n    compression_rate = text_len / tokens_len\r\n    print(f\"Compression rate: {compression_rate:.2f}X\")\r\n\r\n\r\n    # Creating sequences from tokens encoded from text\r\n    x_seq, y_seq = create_sequences(tokens, seq_len=20)\r\n\r\n    # Decoding\r\n    decoded_text = tokenizer.decode(tokens)\r\n    print(decoded_text)\r\n\r\n    # Decoding one sequence from created sequences\r\n    print(f\"Decoded sequence:\\n {tokenizer.decode(x_seq[0])}\")\r\n\r\n    # Printing the vocab size\r\n    print(tokenizer.vocab_size)\r\n\r\n    # Print vocabulary (first 200 items)\r\n    count = 0\r\n    for k, v in tokenizer.vocab.items():\r\n        print(\"{} : {}\".format(k, v))\r\n        count += 1\r\n        if count > 200:\r\n            break\r\n```\r\n\r\n## Training Your Own Tokenizer\r\n\r\nYou can also train your own tokenizer using the utils module, which provides two functions: a training function and a function for creating sequences after encoding your text.\r\n**N.B**: Your chosen vocab_size will be met depening on the amount of data you have used for training. The **vocab_size** is a hyperparameter to be adjusted for better vocabularies in your vocab, and also the size of your dataset and diversity matters. The vocab is initialized by count of 256 from 1-255 unicode characters and 0 for <|PAD|>  and 1 for special token <|EOS|> which marks the end of squence and it will be assigned the last token in the vocabulary.\r\n\r\n```python\r\n\r\n    from kin_tokenizer import KinTokenizer\r\n    from kin_tokenizer.utils import train_kin_tokenizer, create_sequences\r\n\r\n    # Training the tokenizer\r\n    tokenizer = train_kin_tokenizer(training_text, vocab_size=512, save=True, tokenizer_path=SAVE_PATH_ROOT)\r\n\r\n\r\n    # Encoding text using custom trained tokenizer\r\n    tokens = tokenizer.encode(text)\r\n\r\n    # Creating sequences\r\n    x_seq, y_seq = create_sequences(tokens, seq_len=20)\r\n\r\n```\r\n\r\n## Contributing\r\n\r\nThe project is still being updated and contributions are welcome. You can contribute by:\r\n\r\n- Reporting bugs\r\n- Suggesting features\r\n- Writing or improving documentation\r\n- Submitting pull requests\r\n  \r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Kinyarwanda tokenizer for encoding and decoding Kinyarwanda language text",
    "version": "3.3.1",
    "project_urls": {
        "Homepage": "https://github.com/Nschadrack/Kin-Tokenizer"
    },
    "split_keywords": [
        "tokenizer",
        " kinyarwanda",
        " kingpt"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e2603dcb1f90b41968ee6fa37bba4c18d392a5f41506d6737f78aec3d3eb8197",
                "md5": "6f25359ce2aa6877dc3d1116cf7dd78e",
                "sha256": "4e771be45943f26f7dc09dccbba701aa60c6482098934a7fa7baaf2d96b72425"
            },
            "downloads": -1,
            "filename": "kin_tokenizer-3.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f25359ce2aa6877dc3d1116cf7dd78e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 53350,
            "upload_time": "2024-08-16T10:14:01",
            "upload_time_iso_8601": "2024-08-16T10:14:01.619537Z",
            "url": "https://files.pythonhosted.org/packages/e2/60/3dcb1f90b41968ee6fa37bba4c18d392a5f41506d6737f78aec3d3eb8197/kin_tokenizer-3.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7aa80a05c4195609159745ef5b5712991dae48bfcf87b3e667e2ae1e5a5a3f40",
                "md5": "aeb68a96f4960b1be6d3b7e2271a66a0",
                "sha256": "845f8e96061f7e6d2f526c76edc7e5ab6115733fe6f51a39115ae6397b6ad922"
            },
            "downloads": -1,
            "filename": "kin_tokenizer-3.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "aeb68a96f4960b1be6d3b7e2271a66a0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 55878,
            "upload_time": "2024-08-16T10:14:03",
            "upload_time_iso_8601": "2024-08-16T10:14:03.393530Z",
            "url": "https://files.pythonhosted.org/packages/7a/a8/0a05c4195609159745ef5b5712991dae48bfcf87b3e667e2ae1e5a5a3f40/kin_tokenizer-3.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-16 10:14:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Nschadrack",
    "github_project": "Kin-Tokenizer",
    "github_not_found": true,
    "lcname": "kin-tokenizer"
}
        
Elapsed time: 0.31022s