budoux


Namebudoux JSON
Version 0.6.3 PyPI version JSON
download
home_pageNone
SummaryBudouX is the successor of Budou
upload_time2024-10-22 05:57:18
maintainerNone
docs_urlNone
authorShuhei Iitsuka
requires_python>=3.8
licenseApache-2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!-- markdownlint-disable MD014 -->
# BudouX

[![PyPI](https://img.shields.io/pypi/v/budoux?color=blue)](https://pypi.org/project/budoux/) [![npm](https://img.shields.io/npm/v/budoux?color=yellow)](https://www.npmjs.com/package/budoux) [![Maven Central](https://img.shields.io/maven-central/v/com.google.budoux/budoux)](https://mvnrepository.com/artifact/com.google.budoux/budoux)

Standalone. Small. Language-neutral.

BudouX is the successor to [Budou](https://github.com/google/budou), the machine learning powered line break organizer tool.

![Example](https://raw.githubusercontent.com/google/budoux/main/example.png)

It is **standalone**. It works with no dependency on third-party word segmenters such as Google cloud natural language API.

It is **small**. It takes only around 15 KB including its machine learning model. It's reasonable to use it even on the client-side.

It is **language-neutral**. You can train a model for any language by feeding a dataset to BudouX’s training script.

Last but not least, BudouX supports HTML inputs.

## Demo

<https://google.github.io/budoux>

## Natural languages supported by pretrained models

- Japanese
- Simplified Chinese
- Traditional Chinese
- Thai

### Korean support?
Korean uses spaces between words, so you can generally prevent words from being split across lines by applying the CSS property `word-break: keep-all` to the paragraph, which should be much more performant than installing BudouX.
That said, we're happy to explore dedicated Korean language support if the above solution proves insufficient.

## Supported Programming languages

- Python
- [JavaScript](https://github.com/google/budoux/tree/main/javascript/)
- [Java](https://github.com/google/budoux/tree/main/java/)

## Python module

### Install

```shellsession
$ pip install budoux
```

### Usage

#### Library

You can get a list of phrases by feeding a sentence to the parser.
The easiest way is to get a parser is loading the default parser for each language.

**Japanese:**

```python
import budoux
parser = budoux.load_default_japanese_parser()
print(parser.parse('今日は天気です。'))
# ['今日は', '天気です。']
```

**Simplified Chinese:**

```python
import budoux
parser = budoux.load_default_simplified_chinese_parser()
print(parser.parse('今天是晴天。'))
# ['今天', '是', '晴天。']
```

**Traditional Chinese:**

```python
import budoux
parser = budoux.load_default_traditional_chinese_parser()
print(parser.parse('今天是晴天。'))
# ['今天', '是', '晴天。']
```

**Thai:**
```python
import budoux
parser = budoux.load_default_thai_parser()
print(parser.parse('วันนี้อากาศดี'))
# ['วัน', 'นี้', 'อากาศ', 'ดี']
```

You can also translate an HTML string to wrap phrases with non-breaking markup.
The default parser uses zero-width space (U+200B) to separate phrases.

```python
print(parser.translate_html_string('今日は<b>とても天気</b>です。'))
# <span style="word-break: keep-all; overflow-wrap: anywhere;">今日は<b>\u200bとても\u200b天気</b>です。</span>
```

Please note that separators are denoted as `\u200b` in the example above for
illustrative purposes, but the actual output is an invisible string as it's a
zero-width space.

If you have a custom model, you can use it as follows.

```python
with open('/path/to/your/model.json') as f:
  model = json.load(f)
parser = budoux.Parser(model)
```

A model file for BudouX is a JSON file that contains pairs of a feature and its score extracted by machine learning training.
Each score represents the significance of the feature in determining whether to break the sentence at a specific point.

For more details of the JavaScript model, please refer to [JavaScript module README](https://github.com/google/budoux/tree/main/javascript/README.md).

#### CLI

You can also format inputs on your terminal with `budoux` command.

```shellsession
$ budoux 本日は晴天です。 # default: japanese
本日は
晴天です。

$ budoux -l ja 本日は晴天です。
本日は
晴天です。

$ budoux -l zh-hans 今天天气晴朗。
今天
天气
晴朗。

$ budoux -l zh-hant 今天天氣晴朗。
今天
天氣
晴朗。

$ budoux -l th วันนี้อากาศดี
วัน
นี้
อากาศ
ดี
```

```shellsession
$ echo $'本日は晴天です。\n明日は曇りでしょう。' | budoux
本日は
晴天です。
---
明日は
曇りでしょう。
```

```shellsession
$ budoux 本日は晴天です。 -H
<span style="word-break: keep-all; overflow-wrap: anywhere;">本日は\u200b晴天です。</span>
```

Please note that separators are denoted as `\u200b` in the example above for
illustrative purposes, but the actual output is an invisible string as it's a
zero-width space.

If you want to see help, run `budoux -h`.

```shellsession
$ budoux -h
usage: budoux [-h] [-H] [-m JSON | -l LANG] [-d STR] [-V] [TXT]

BudouX is the successor to Budou,
the machine learning powered line break organizer tool.

positional arguments:
  TXT                    text (default: None)

optional arguments:
  -h, --help             show this help message and exit
  -H, --html             HTML mode (default: False)
  -m JSON, --model JSON  custom model file path (default: /path/to/budoux/models/ja.json)
  -l LANG, --lang LANG   language of custom model (default: None)
  -d STR, --delim STR    output delimiter in TEXT mode (default: ---)
  -V, --version          show program's version number and exit

supported languages of `-l`, `--lang`:
- ja
- zh-hans
- zh-hant
- th
```

## Caveat

BudouX supports HTML inputs and outputs HTML strings with markup that wraps phrases, but it's not meant to be used as an HTML sanitizer. **BudouX doesn't sanitize any inputs.** Malicious HTML inputs yield malicious HTML outputs. Please use it with an appropriate sanitizer library if you don't trust the input.

## Background

English text has many clues, like spacing and hyphenation, that enable beautiful and readable line breaks. However, some CJK languages lack these clues, and so are notoriously more difficult to process. Line breaks can occur randomly and usually in the middle of a word or a phrase without a more careful approach. This is a long-standing issue in typography on the Web, which results in a degradation of readability.

Budou was proposed as a solution to this problem in 2016. It automatically translates CJK sentences into HTML with lexical phrases wrapped in non-breaking markup, so as to semantically control line breaks. Budou has solved this problem to some extent, but it still has some problems integrating with modern web production workflow.

The biggest barrier in applying Budou to a website is that it has dependency on third-party word segmenters. Usually a word segmenter is a large program that is infeasible to download for every web page request. It would also be an undesirable option making a request to a cloud-based word segmentation service for every sentence, considering the speed and cost. That’s why we need a standalone line break organizer tool equipped with its own segmentation engine small enough to be bundled in a client-side JavaScript code.

Budou*X* is the successor to Budou, designed to be integrated with your website with no hassle.

## How it works

BudouX uses the [AdaBoost algorithm](https://en.wikipedia.org/wiki/AdaBoost) to segment a sentence into phrases by considering the task as a binary classification problem to predict whether to break or not between all characters. It uses features such as the characters around the break point, their Unicode blocks, and combinations of them to make a prediction. The output machine learning model, which is encoded as a JSON file, stores pairs of the feature and its significance score. The BudouX parser takes a model file to construct a segmenter and translates input sentences into a list of phrases.

## Building a custom model

You can build your own custom model for any language by preparing training data in the target language.
A training dataset is a large text file that consists of sentences separated by phrases with the separator symbol "▁" (U+2581) like below.

```text
私は▁遅刻魔で、▁待ち合わせに▁いつも▁遅刻してしまいます。
メールで▁待ち合わせ▁相手に▁一言、▁「ごめんね」と▁謝れば▁どうにか▁なると▁思っていました。
海外では▁ケータイを▁持っていない。
```

Assuming the text file is saved as `mysource.txt`, you can build your own custom model by running the following commands.

```shellsession
$ pip install .[dev]
$ python scripts/encode_data.py mysource.txt -o encoded_data.txt
$ python scripts/train.py encoded_data.txt -o weights.txt
$ python scripts/build_model.py weights.txt -o mymodel.json
```

Please note that `train.py` takes time to complete depending on your computer resources.
Good news is that the training algorithm is an [anytime algorithm](https://en.wikipedia.org/wiki/Anytime_algorithm), so you can get a weights file even if you interrupt the execution. You can build a valid model file by passing that weights file to `build_model.py` even in such a case.

## Constructing a training dataset from the KNBC corpus for Japanese

The default model for Japanese (`budoux/models/ja.json`) is built using the [KNBC corpus](https://nlp.ist.i.kyoto-u.ac.jp/kuntt/).
You can create a training dataset, which we name `source_knbc.txt` below for example, from the corpus by running the following commands:

```shellsession
$ curl -o knbc.tar.bz2 https://nlp.ist.i.kyoto-u.ac.jp/kuntt/KNBC_v1.0_090925_utf8.tar.bz2
$ tar -xf knbc.tar.bz2  # outputs KNBC_v1.0_090925_utf8 directory
$ python scripts/prepare_knbc.py KNBC_v1.0_090925_utf8 -o source_knbc.txt
```

## Author

[Shuhei Iitsuka](https://tushuhei.com)

## Disclaimer

This is not an officially supported Google product.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "budoux",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Shuhei Iitsuka",
    "author_email": "tushuhei@google.com",
    "download_url": "https://files.pythonhosted.org/packages/41/50/40244252c6e9785a618c3ef460cc833e39e16d315e095036098669e96f16/budoux-0.6.3.tar.gz",
    "platform": null,
    "description": "<!-- markdownlint-disable MD014 -->\n# BudouX\n\n[![PyPI](https://img.shields.io/pypi/v/budoux?color=blue)](https://pypi.org/project/budoux/) [![npm](https://img.shields.io/npm/v/budoux?color=yellow)](https://www.npmjs.com/package/budoux) [![Maven Central](https://img.shields.io/maven-central/v/com.google.budoux/budoux)](https://mvnrepository.com/artifact/com.google.budoux/budoux)\n\nStandalone. Small. Language-neutral.\n\nBudouX is the successor to [Budou](https://github.com/google/budou), the machine learning powered line break organizer tool.\n\n![Example](https://raw.githubusercontent.com/google/budoux/main/example.png)\n\nIt is **standalone**. It works with no dependency on third-party word segmenters such as Google cloud natural language API.\n\nIt is **small**. It takes only around 15 KB including its machine learning model. It's reasonable to use it even on the client-side.\n\nIt is **language-neutral**. You can train a model for any language by feeding a dataset to BudouX\u2019s training script.\n\nLast but not least, BudouX supports HTML inputs.\n\n## Demo\n\n<https://google.github.io/budoux>\n\n## Natural languages supported by pretrained models\n\n- Japanese\n- Simplified Chinese\n- Traditional Chinese\n- Thai\n\n### Korean support?\nKorean uses spaces between words, so you can generally prevent words from being split across lines by applying the CSS property `word-break: keep-all` to the paragraph, which should be much more performant than installing BudouX.\nThat said, we're happy to explore dedicated Korean language support if the above solution proves insufficient.\n\n## Supported Programming languages\n\n- Python\n- [JavaScript](https://github.com/google/budoux/tree/main/javascript/)\n- [Java](https://github.com/google/budoux/tree/main/java/)\n\n## Python module\n\n### Install\n\n```shellsession\n$ pip install budoux\n```\n\n### Usage\n\n#### Library\n\nYou can get a list of phrases by feeding a sentence to the parser.\nThe easiest way is to get a parser is loading the default parser for each language.\n\n**Japanese:**\n\n```python\nimport budoux\nparser = budoux.load_default_japanese_parser()\nprint(parser.parse('\u4eca\u65e5\u306f\u5929\u6c17\u3067\u3059\u3002'))\n# ['\u4eca\u65e5\u306f', '\u5929\u6c17\u3067\u3059\u3002']\n```\n\n**Simplified Chinese:**\n\n```python\nimport budoux\nparser = budoux.load_default_simplified_chinese_parser()\nprint(parser.parse('\u4eca\u5929\u662f\u6674\u5929\u3002'))\n# ['\u4eca\u5929', '\u662f', '\u6674\u5929\u3002']\n```\n\n**Traditional Chinese:**\n\n```python\nimport budoux\nparser = budoux.load_default_traditional_chinese_parser()\nprint(parser.parse('\u4eca\u5929\u662f\u6674\u5929\u3002'))\n# ['\u4eca\u5929', '\u662f', '\u6674\u5929\u3002']\n```\n\n**Thai:**\n```python\nimport budoux\nparser = budoux.load_default_thai_parser()\nprint(parser.parse('\u0e27\u0e31\u0e19\u0e19\u0e35\u0e49\u0e2d\u0e32\u0e01\u0e32\u0e28\u0e14\u0e35'))\n# ['\u0e27\u0e31\u0e19', '\u0e19\u0e35\u0e49', '\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e14\u0e35']\n```\n\nYou can also translate an HTML string to wrap phrases with non-breaking markup.\nThe default parser uses zero-width space (U+200B) to separate phrases.\n\n```python\nprint(parser.translate_html_string('\u4eca\u65e5\u306f<b>\u3068\u3066\u3082\u5929\u6c17</b>\u3067\u3059\u3002'))\n# <span style=\"word-break: keep-all; overflow-wrap: anywhere;\">\u4eca\u65e5\u306f<b>\\u200b\u3068\u3066\u3082\\u200b\u5929\u6c17</b>\u3067\u3059\u3002</span>\n```\n\nPlease note that separators are denoted as `\\u200b` in the example above for\nillustrative purposes, but the actual output is an invisible string as it's a\nzero-width space.\n\nIf you have a custom model, you can use it as follows.\n\n```python\nwith open('/path/to/your/model.json') as f:\n  model = json.load(f)\nparser = budoux.Parser(model)\n```\n\nA model file for BudouX is a JSON file that contains pairs of a feature and its score extracted by machine learning training.\nEach score represents the significance of the feature in determining whether to break the sentence at a specific point.\n\nFor more details of the JavaScript model, please refer to [JavaScript module README](https://github.com/google/budoux/tree/main/javascript/README.md).\n\n#### CLI\n\nYou can also format inputs on your terminal with `budoux` command.\n\n```shellsession\n$ budoux \u672c\u65e5\u306f\u6674\u5929\u3067\u3059\u3002 # default: japanese\n\u672c\u65e5\u306f\n\u6674\u5929\u3067\u3059\u3002\n\n$ budoux -l ja \u672c\u65e5\u306f\u6674\u5929\u3067\u3059\u3002\n\u672c\u65e5\u306f\n\u6674\u5929\u3067\u3059\u3002\n\n$ budoux -l zh-hans \u4eca\u5929\u5929\u6c14\u6674\u6717\u3002\n\u4eca\u5929\n\u5929\u6c14\n\u6674\u6717\u3002\n\n$ budoux -l zh-hant \u4eca\u5929\u5929\u6c23\u6674\u6717\u3002\n\u4eca\u5929\n\u5929\u6c23\n\u6674\u6717\u3002\n\n$ budoux -l th \u0e27\u0e31\u0e19\u0e19\u0e35\u0e49\u0e2d\u0e32\u0e01\u0e32\u0e28\u0e14\u0e35\n\u0e27\u0e31\u0e19\n\u0e19\u0e35\u0e49\n\u0e2d\u0e32\u0e01\u0e32\u0e28\n\u0e14\u0e35\n```\n\n```shellsession\n$ echo $'\u672c\u65e5\u306f\u6674\u5929\u3067\u3059\u3002\\n\u660e\u65e5\u306f\u66c7\u308a\u3067\u3057\u3087\u3046\u3002' | budoux\n\u672c\u65e5\u306f\n\u6674\u5929\u3067\u3059\u3002\n---\n\u660e\u65e5\u306f\n\u66c7\u308a\u3067\u3057\u3087\u3046\u3002\n```\n\n```shellsession\n$ budoux \u672c\u65e5\u306f\u6674\u5929\u3067\u3059\u3002 -H\n<span style=\"word-break: keep-all; overflow-wrap: anywhere;\">\u672c\u65e5\u306f\\u200b\u6674\u5929\u3067\u3059\u3002</span>\n```\n\nPlease note that separators are denoted as `\\u200b` in the example above for\nillustrative purposes, but the actual output is an invisible string as it's a\nzero-width space.\n\nIf you want to see help, run `budoux -h`.\n\n```shellsession\n$ budoux -h\nusage: budoux [-h] [-H] [-m JSON | -l LANG] [-d STR] [-V] [TXT]\n\nBudouX is the successor to Budou,\nthe machine learning powered line break organizer tool.\n\npositional arguments:\n  TXT                    text (default: None)\n\noptional arguments:\n  -h, --help             show this help message and exit\n  -H, --html             HTML mode (default: False)\n  -m JSON, --model JSON  custom model file path (default: /path/to/budoux/models/ja.json)\n  -l LANG, --lang LANG   language of custom model (default: None)\n  -d STR, --delim STR    output delimiter in TEXT mode (default: ---)\n  -V, --version          show program's version number and exit\n\nsupported languages of `-l`, `--lang`:\n- ja\n- zh-hans\n- zh-hant\n- th\n```\n\n## Caveat\n\nBudouX supports HTML inputs and outputs HTML strings with markup that wraps phrases, but it's not meant to be used as an HTML sanitizer. **BudouX doesn't sanitize any inputs.** Malicious HTML inputs yield malicious HTML outputs. Please use it with an appropriate sanitizer library if you don't trust the input.\n\n## Background\n\nEnglish text has many clues, like spacing and hyphenation, that enable beautiful and readable line breaks. However, some CJK languages lack these clues, and so are notoriously more difficult to process. Line breaks can occur randomly and usually in the middle of a word or a phrase without a more careful approach. This is a long-standing issue in typography on the Web, which results in a degradation of readability.\n\nBudou was proposed as a solution to this problem in 2016. It automatically translates CJK sentences into HTML with lexical phrases wrapped in non-breaking markup, so as to semantically control line breaks. Budou has solved this problem to some extent, but it still has some problems integrating with modern web production workflow.\n\nThe biggest barrier in applying Budou to a website is that it has dependency on third-party word segmenters. Usually a word segmenter is a large program that is infeasible to download for every web page request. It would also be an undesirable option making a request to a cloud-based word segmentation service for every sentence, considering the speed and cost. That\u2019s why we need a standalone line break organizer tool equipped with its own segmentation engine small enough to be bundled in a client-side JavaScript code.\n\nBudou*X* is the successor to Budou, designed to be integrated with your website with no hassle.\n\n## How it works\n\nBudouX uses the [AdaBoost algorithm](https://en.wikipedia.org/wiki/AdaBoost) to segment a sentence into phrases by considering the task as a binary classification problem to predict whether to break or not between all characters. It uses features such as the characters around the break point, their Unicode blocks, and combinations of them to make a prediction. The output machine learning model, which is encoded as a JSON file, stores pairs of the feature and its significance score. The BudouX parser takes a model file to construct a segmenter and translates input sentences into a list of phrases.\n\n## Building a custom model\n\nYou can build your own custom model for any language by preparing training data in the target language.\nA training dataset is a large text file that consists of sentences separated by phrases with the separator symbol \"\u2581\" (U+2581) like below.\n\n```text\n\u79c1\u306f\u2581\u9045\u523b\u9b54\u3067\u3001\u2581\u5f85\u3061\u5408\u308f\u305b\u306b\u2581\u3044\u3064\u3082\u2581\u9045\u523b\u3057\u3066\u3057\u307e\u3044\u307e\u3059\u3002\n\u30e1\u30fc\u30eb\u3067\u2581\u5f85\u3061\u5408\u308f\u305b\u2581\u76f8\u624b\u306b\u2581\u4e00\u8a00\u3001\u2581\u300c\u3054\u3081\u3093\u306d\u300d\u3068\u2581\u8b1d\u308c\u3070\u2581\u3069\u3046\u306b\u304b\u2581\u306a\u308b\u3068\u2581\u601d\u3063\u3066\u3044\u307e\u3057\u305f\u3002\n\u6d77\u5916\u3067\u306f\u2581\u30b1\u30fc\u30bf\u30a4\u3092\u2581\u6301\u3063\u3066\u3044\u306a\u3044\u3002\n```\n\nAssuming the text file is saved as `mysource.txt`, you can build your own custom model by running the following commands.\n\n```shellsession\n$ pip install .[dev]\n$ python scripts/encode_data.py mysource.txt -o encoded_data.txt\n$ python scripts/train.py encoded_data.txt -o weights.txt\n$ python scripts/build_model.py weights.txt -o mymodel.json\n```\n\nPlease note that `train.py` takes time to complete depending on your computer resources.\nGood news is that the training algorithm is an [anytime algorithm](https://en.wikipedia.org/wiki/Anytime_algorithm), so you can get a weights file even if you interrupt the execution. You can build a valid model file by passing that weights file to `build_model.py` even in such a case.\n\n## Constructing a training dataset from the KNBC corpus for Japanese\n\nThe default model for Japanese (`budoux/models/ja.json`) is built using the [KNBC corpus](https://nlp.ist.i.kyoto-u.ac.jp/kuntt/).\nYou can create a training dataset, which we name `source_knbc.txt` below for example, from the corpus by running the following commands:\n\n```shellsession\n$ curl -o knbc.tar.bz2 https://nlp.ist.i.kyoto-u.ac.jp/kuntt/KNBC_v1.0_090925_utf8.tar.bz2\n$ tar -xf knbc.tar.bz2  # outputs KNBC_v1.0_090925_utf8 directory\n$ python scripts/prepare_knbc.py KNBC_v1.0_090925_utf8 -o source_knbc.txt\n```\n\n## Author\n\n[Shuhei Iitsuka](https://tushuhei.com)\n\n## Disclaimer\n\nThis is not an officially supported Google product.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "BudouX is the successor of Budou",
    "version": "0.6.3",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1004c13233ad73e4acf66ef22934204cfd42c53a0b7d3c62a5597ece3e627750",
                "md5": "d1ee82e38ad2dd5fd2c637fb7ef37152",
                "sha256": "3a8b98d511138a23d67e0f888dc7dfff1395dbd77bd136db98a24634e2b29cf5"
            },
            "downloads": -1,
            "filename": "budoux-0.6.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1ee82e38ad2dd5fd2c637fb7ef37152",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 118532,
            "upload_time": "2024-10-22T05:57:16",
            "upload_time_iso_8601": "2024-10-22T05:57:16.743028Z",
            "url": "https://files.pythonhosted.org/packages/10/04/c13233ad73e4acf66ef22934204cfd42c53a0b7d3c62a5597ece3e627750/budoux-0.6.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "415040244252c6e9785a618c3ef460cc833e39e16d315e095036098669e96f16",
                "md5": "563e35d1fdd3f7fab0958b80e4bdf9ee",
                "sha256": "66bdaadf7ac0a130e58994d4ea329733c3c0c2716ca05c97819e1c9c9065c886"
            },
            "downloads": -1,
            "filename": "budoux-0.6.3.tar.gz",
            "has_sig": false,
            "md5_digest": "563e35d1fdd3f7fab0958b80e4bdf9ee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 117251,
            "upload_time": "2024-10-22T05:57:18",
            "upload_time_iso_8601": "2024-10-22T05:57:18.513747Z",
            "url": "https://files.pythonhosted.org/packages/41/50/40244252c6e9785a618c3ef460cc833e39e16d315e095036098669e96f16/budoux-0.6.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-22 05:57:18",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "budoux"
}
        
Elapsed time: 2.49069s