Name | split-lang JSON |
Version |
2.0.5
JSON |
| download |
home_page | https://github.com/DoodleBears/langsplit |
Summary | A package for splitting text by languages through concatenating over split substrings based on their language |
upload_time | 2024-10-31 15:35:25 |
maintainer | None |
docs_url | None |
author | DoodleBear |
requires_python | >=3.9 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<div align="center">
<img alt="VisActor Logo" width=50% src="https://github.com/DoodleBears/split-lang/blob/main/.github/profile/split-lang-logo.svg"/>
<img alt="VisActor Logo" width=70% src="https://github.com/DoodleBears/split-lang/blob/main/.github/profile/split-lang-banner.svg"/>
</div>
<div align="center">
<h1>split-lang</h1>
**English** | [**中文简体**](./docs/zh/README.md) | [**日本語**](./docs/ja/README.md)
Split text by languages through concatenating over split substrings based on their language, powered by
splitting: [`budoux`](https://github.com/google/budoux) and rule-base splitting
language detection: [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect) and [`wordfreq`](https://github.com/rspeer/wordfreq)
</div>
<br/>
<div align="center">
[![PyPI version](https://badge.fury.io/py/split-lang.svg)](https://badge.fury.io/py/split-lang)
[![Downloads](https://static.pepy.tech/badge/split-lang)](https://pepy.tech/project/split-lang)
[![Downloads](https://static.pepy.tech/badge/split-lang/month)](https://pepy.tech/project/split-lang)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/DoodleBears/split-lang/blob/main/LICENSE)
![GitHub Repo stars](https://img.shields.io/github/stars/DoodleBears/split-lang)
[![wakatime](https://wakatime.com/badge/user/5728d95a-5cfb-4acb-b600-e34c2fc231b6/project/e06e0a00-9ba1-453d-8c62-a0b2604aaaad.svg)](https://wakatime.com/badge/user/5728d95a-5cfb-4acb-b600-e34c2fc231b6/project/e06e0a00-9ba1-453d-8c62-a0b2604aaaad)
</div>
# 1. 💡How it works
**Stage 1**: rule-based split (separate character, punctuation and digit)
- `hello, how are you` -> `hello` | `,` | `how are you`
**Stage 2**: over-split text to substrings by [`budoux`](https://github.com/google/budoux) for Chinese mix with Japanese, ` ` (space) for **not** [scripta continua](https://en.wikipedia.org/wiki/Scriptio_continua)
- `你喜欢看アニメ吗` -> `你` | `喜欢` | `看` | `アニメ` | `吗`
- `昨天見た映画はとても感動的でした` -> `昨天` | `見た` | `映画` | `は` | `とても` | `感動` | `的` | `で` | `した`
- `how are you` -> `how ` | `are ` | `you`
**Stage 3**: concatenate substrings based on their languages using [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect), [`wordfreq`](https://github.com/rspeer/wordfreq) and regex (rule-based)
- `你` | `喜欢` | `看` | `アニメ` | `吗` -> `你喜欢看` | `アニメ` | `吗`
- `昨天` | `見た` | `映画` | `は` | `とても` | `感動` | `的` | `で` | `した` -> `昨天` | `見た映画はとても感動的でした`
- `how ` | `are ` | `you` -> `how are you`
<details>
<summary>More split examples</summary>
```python
correct_substrings : ['x|我是 ', 'x|VGroupChatBot', 'punctuation|,', 'x|一个旨在支持多人通信的助手', 'punctuation|,', 'x|通过可视化消息来帮助团队成员更好地交流', 'punctuation|。', 'x|我可以帮助团队成员更好地整理和共享信息', 'punctuation|,', 'x|特别是在讨论', 'punctuation|、', 'x|会议和', 'x|Brainstorming', 'x|等情况下', 'punctuation|。', 'x|你好我的名字是', 'x|西野くまです', 'x|my name is bob', 'x|很高兴认识你', 'x|どうぞよろしくお願いいたします', 'punctuation|「', 'x|こんにちは', 'punctuation|」', 'x|是什么意思', 'punctuation|。']
test_split_substrings: ['zh|我是 ', 'en|VGroupChatBot', 'punctuation|,', 'zh|一个旨在支持多人通信的助手', 'punctuation|,', 'zh|通过可视化消息来帮助团队成员更好地交流', 'punctuation|。', 'zh|我可以帮助团队成员更好地整理和共享信息', 'punctuation|,', 'zh|特别是在讨论', 'punctuation|、', 'zh|会议和', 'en|Brainstorming', 'zh|等情况下', 'punctuation|。', 'zh|你好我的名字是', 'ja|西野くまです', 'en|my name is bob', 'zh|很高兴认识你', 'ja|どうぞよろしくお願いいたします', 'punctuation|「', 'ja|こんにち は', 'punctuation|」', 'zh|是什么意思', 'punctuation|。']
acc : 25/25
--------------------------
correct_substrings : ['x|我的名字是', 'x|西野くまです', 'punctuation|。', 'x|I am from Tokyo', 'punctuation|, ', 'x|日本の首都', 'punctuation|。', 'x|今天的天气非常好']
test_split_substrings: ['zh|我的名字是', 'ja|西野くまです', 'punctuation|。', 'en|I am from Tokyo', 'punctuation|, ', 'ja|日本の首都', 'punctuation|。', 'zh|今天的天气非常好']
acc : 8/8
--------------------------
correct_substrings : ['x|你好', 'punctuation|,', 'x|今日はどこへ行きますか', 'punctuation|?']
test_split_substrings: ['zh|你好', 'punctuation|,', 'ja|今日はどこへ行きますか', 'punctuation|?']
acc : 4/4
--------------------------
correct_substrings : ['x|你好', 'x|今日はどこへ行きますか', 'punctuation|?']
test_split_substrings: ['zh|你好', 'ja|今日はどこへ行きますか', 'punctuation|?']
acc : 3/3
--------------------------
correct_substrings : ['x|我的名字是', 'x|田中さんです', 'punctuation|。']
test_split_substrings: ['zh|我的名字是田中', 'ja|さんです', 'punctuation|。']
acc : 1/3
--------------------------
correct_substrings : ['x|我喜欢吃寿司和拉面', 'x|おいしいです', 'punctuation|。']
test_split_substrings: ['zh|我喜欢吃寿司和拉面', 'ja|おいしいです', 'punctuation|。']
acc : 3/3
--------------------------
correct_substrings : ['x|今天', 'x|の天気はとてもいいですね', 'punctuation|。']
test_split_substrings: ['zh|今天', 'ja|の天気はとてもいいですね', 'punctuation|。']
acc : 3/3
--------------------------
correct_substrings : ['x|我在学习', 'x|日本語少し難しいです', 'punctuation|。']
test_split_substrings: ['zh|我在学习日本語少', 'ja|し難しいです', 'punctuation|。']
acc : 1/3
--------------------------
correct_substrings : ['x|日语真是', 'x|おもしろい', 'x|啊']
test_split_substrings: ['zh|日语真是', 'ja|おもしろい', 'zh|啊']
acc : 3/3
--------------------------
correct_substrings : ['x|你喜欢看', 'x|アニメ', 'x|吗', 'punctuation|?']
test_split_substrings: ['zh|你喜欢看', 'ja|アニメ', 'zh|吗', 'punctuation|?']
acc : 4/4
--------------------------
correct_substrings : ['x|我想去日本旅行', 'punctuation|、', 'x|特に京都に行きたいです', 'punctuation|。']
test_split_substrings: ['zh|我想去日本旅行', 'punctuation|、', 'ja|特に京都に行きたいです', 'punctuation|。']
acc : 4/4
--------------------------
correct_substrings : ['x|昨天', 'x|見た映画はとても感動的でした', 'punctuation|。', 'x|我朋友是日本人', 'x|彼はとても優しいです', 'punctuation|。']
test_split_substrings: ['zh|昨天', 'ja|見た映画はとても感動的でした', 'punctuation|。', 'zh|我朋友是日本人', 'ja|彼はとても優しいです', 'punctuation|。']
acc : 6/6
--------------------------
correct_substrings : ['x|我们一起去', 'x|カラオケ', 'x|吧', 'punctuation|、', 'x|楽しそうです', 'punctuation|。']
test_split_substrings: ['zh|我们一起去', 'ja|カラオケ', 'zh|吧', 'punctuation|、', 'ja|楽しそうです', 'punctuation|。']
acc : 6/6
--------------------------
correct_substrings : ['x|我的家在北京', 'punctuation|、', 'x|でも', 'punctuation|、', 'x|仕事で東京に住んでいます', 'punctuation|。']
test_split_substrings: ['ja|我的家在北京', 'punctuation|、', 'ja|でも', 'punctuation|、', 'ja|仕事で東京に住んでいます', 'punctuation|。']
acc : 6/6
--------------------------
correct_substrings : ['x|我在学做日本料理', 'punctuation|、', 'x|日本料理を作るのを習っています', 'punctuation|。']
test_split_substrings: ['ja|我在学做日本料理', 'punctuation|、', 'ja|日本料理を作るのを習っています', 'punctuation|。']
acc : 4/4
--------------------------
correct_substrings : ['x|你会说几种语言', 'punctuation|、', 'x|何ヶ国語話せますか', 'punctuation|?']
test_split_substrings: ['zh|你会说几种语言', 'punctuation|、', 'ja|何ヶ国語話せますか', 'punctuation|?']
acc : 4/4
--------------------------
correct_substrings : ['x|我昨天看了一本书', 'punctuation|、', 'x|その本はとても面白かったです', 'punctuation|。']
test_split_substrings: ['zh|我昨天看了一本书', 'punctuation|、', 'ja|その本はとても面白かったです', 'punctuation|。']
acc : 4/4
--------------------------
correct_substrings : ['x|你最近好吗', 'punctuation|、', 'x|最近どうですか', 'punctuation|?']
test_split_substrings: ['zh|你最近好吗', 'punctuation|、', 'ja|最近どうですか', 'punctuation|?']
acc : 4/4
--------------------------
correct_substrings : ['x|你最近好吗', 'x|最近どうですか', 'punctuation|?']
test_split_substrings: ['zh|你最近好吗最近', 'ja|どうですか', 'punctuation|?']
acc : 1/3
--------------------------
correct_substrings : ['x|我在学做日本料理', 'x|와 한국 요리', 'punctuation|、', 'x|日本料理を作るのを習っています', 'punctuation|。']
test_split_substrings: ['ja|我在学做日本料理', 'ko|와 한국 요리', 'punctuation|、', 'ja|日本料理を作るのを習っています', 'punctuation|。']
acc : 5/5
--------------------------
correct_substrings : ['x|你会说几种语言', 'punctuation|、', 'x|何ヶ国語話せますか', 'punctuation|?', 'x|몇 개 언어를 할 수 있어요', 'punctuation|?']
test_split_substrings: ['zh|你会说几种语言', 'punctuation|、', 'ja|何ヶ国語話せますか', 'punctuation|?', 'ko|몇 개 언어를 할 수 있어요', 'punctuation|?']
acc : 6/6
--------------------------
correct_substrings : ['x|我昨天看了一本书', 'punctuation|、', 'x|その本はとても面白かったです', 'punctuation|。', 'x|어제 책을 읽었는데', 'punctuation|, ', 'x|정말 재미있었어요', 'punctuation|。']
test_split_substrings: ['zh|我昨天看了一本书', 'punctuation|、', 'ja|その本はとても面白かったです', 'punctuation|。', 'ko|어제 책을 읽었는데', 'punctuation|, ', 'ko|정말 재미있었어요', 'punctuation|。']
acc : 8/8
--------------------------
correct_substrings : ['x|我们一起去逛街', 'x|와 쇼핑', 'punctuation|、', 'x|買い物に行きましょう', 'punctuation|。', 'x|쇼핑하러 가요', 'punctuation|。']
test_split_substrings: ['zh|我们一起去逛街', 'ko|와 쇼핑', 'punctuation|、', 'ja|買い物に行きましょう', 'punctuation|。', 'ko|쇼핑하러 가요', 'punctuation|。']
acc : 7/7
--------------------------
correct_substrings : ['x|你最近好吗', 'punctuation|、', 'x|最近どうですか', 'punctuation|?', 'x|요즘 어떻게 지내요', 'punctuation|?']
test_split_substrings: ['zh|你最近好吗', 'punctuation|、', 'ja|最近どうですか', 'punctuation|?', 'ko|요즘 어떻게 지내요', 'punctuation|?']
acc : 6/6
--------------------------
correct_substrings : ['x|Bonjour', 'punctuation|, ', "x|wie geht's dir ", 'x|today', 'punctuation|?']
test_split_substrings: ['fr|Bonjour', 'punctuation|, ', "de|wie geht's dir ", 'en|today', 'punctuation|?']
acc : 5/5
--------------------------
correct_substrings : ['x|Vielen Dank ', 'x|merci beaucoup ', 'x|for your help', 'punctuation|.']
test_split_substrings: ['de|Vielen ', 'fr|Dank merci beaucoup ', 'en|for your help', 'punctuation|.']
acc : 2/4
--------------------------
correct_substrings : ['x|Ich bin müde ', 'x|je suis fatigué ', 'x|and I need some rest', 'punctuation|.']
test_split_substrings: ['de|Ich ', 'en|bin ', 'de|müde ', 'fr|je suis fatigué ', 'en|and I need some rest', 'punctuation|.']
acc : 3/4
--------------------------
correct_substrings : ['x|Ich mag dieses Buch ', 'x|ce livre est intéressant ', 'x|and it has a great story', 'punctuation|.']
test_split_substrings: ['de|Ich mag dieses Buch ', 'fr|ce livre est intéressant ', 'en|and it has a great story', 'punctuation|.']
acc : 4/4
--------------------------
correct_substrings : ['x|Ich mag dieses Buch', 'punctuation|, ', 'x|ce livre est intéressant', 'punctuation|, ', 'x|and it has a great story', 'punctuation|.']
test_split_substrings: ['de|Ich mag dieses Buch', 'punctuation|, ', 'fr|ce livre est intéressant', 'punctuation|, ', 'en|and it has a great story', 'punctuation|.']
acc : 6/6
--------------------------
correct_substrings : ['x|The shirt is ', 'x|9.15 ', 'x|dollars', 'punctuation|.']
test_split_substrings: ['en|The shirt is ', 'digit|9', 'punctuation|.', 'digit|15 ', 'en|dollars', 'punctuation|.']
acc : 3/4
--------------------------
correct_substrings : ['x|The shirt is ', 'digit|233 ', 'x|dollars', 'punctuation|.']
test_split_substrings: ['en|The shirt is ', 'digit|233 ', 'en|dollars', 'punctuation|.']
acc : 4/4
--------------------------
correct_substrings : ['x|lang', 'punctuation|-', 'x|split']
test_split_substrings: ['en|lang', 'punctuation|-', 'en|split']
acc : 3/3
--------------------------
correct_substrings : ['x|I have ', 'digit|10', 'punctuation|, ', 'x|€']
test_split_substrings: ['en|I have ', 'digit|10', 'punctuation|, ', 'fr|€']
acc : 4/4
--------------------------
correct_substrings : ['x|日本のメディアでは', 'punctuation|「', 'x|匿名掲示板', 'punctuation|」', 'x|であると紹介されることが多いが', 'punctuation|、', 'x|2003年1月7日から全書き込みについて', 'x|IP', 'x|アドレスの記録・保存を始めており', 'punctuation|、', 'x|厳密には匿名掲示板ではなくなっていると', 'x|CNET Japan', 'x|は報じている']
test_split_substrings: ['ja|日本のメディアでは', 'punctuation|「', 'ja|匿名掲示板', 'punctuation|」', 'ja|であると紹介されることが多いが', 'punctuation|、', 'digit|2003', 'ja|年', 'digit|1', 'ja|月', 'digit|7', 'ja|日から全書き込みについて', 'en|IP', 'ja|アドレスの記録・保存を始めており', 'punctuation|、', 'ja|厳密には匿名掲示板ではなくなっていると', 'en|CNET Japan', 'ja|は報じている']
acc : 12/13
--------------------------
correct_substrings : ['x|日本語', 'punctuation|(', 'x|にほんご', 'punctuation|、', 'x|にっぽんご', 'punctuation|)', 'x|は', 'punctuation|、', 'x|日本国内や', 'punctuation|、', 'x|かつての日本領だった国', 'punctuation|、', 'x|そして国外移民や移住者を含む日本人同士の間で使用されている言語', 'punctuation|。', 'x|日本は法令によって公用語を規定していないが', 'punctuation|、', 'x|法令その他の公用文は全て日本語で記述され', 'punctuation|、', 'x|各種法令において日本語を用いることが規定され', 'punctuation|、', 'x|学校教育においては「国語」の教科として学習を行うなど', 'punctuation|、', 'x|事実上日本国内において唯一の公用語となっている', 'punctuation|。']
test_split_substrings: ['ja|日本語', 'punctuation|(', 'ja|にほんご', 'punctuation|、', 'ja|にっぽんご', 'punctuation|)', 'ja|は', 'punctuation|、', 'ja|日本国内や', 'punctuation|、', 'ja|かつての日本領だった国', 'punctuation|、', 'ja|そして国外移民 や移住者を含む日本人同士の間で使用されている言語', 'punctuation|。', 'ja|日本は法令によって公用語を規定していないが', 'punctuation|、', 'ja|法令その他の公用文は全て日本語で記述され', 'punctuation|、', 'ja|各種法令において日本語を用いることが規定され', 'punctuation|、', 'ja|学校教育においては', 'punctuation|「', 'ja|国語', 'punctuation|」', 'ja|の教科として学習を行うなど', 'punctuation|、', 'ja|事実上日本国内において唯一の公用語となっている', 'punctuation|。']
acc : 23/24
--------------------------
correct_substrings : ['x|日语是日本通用语及事实上的官方语言', 'punctuation|。', 'x|没有精确的日语使用人口的统计', 'punctuation|,', 'x|如果计算日本人口以及居住在日本以外的日本人', 'punctuation|、', 'x|日侨和日裔', 'punctuation|,', 'x|日语使用者应超过一亿三千万人', 'punctuation|。']
test_split_substrings: ['zh|日语是日本通用语及事实上的官方语言', 'punctuation|。', 'zh|没有精确的日语使用人口的统计', 'punctuation|,', 'zh|如果计算日本人口以及居住在日本以外的日本人', 'punctuation|、', 'zh|日侨和日裔', 'punctuation|,', 'zh|日语使用 者应超过一亿三千万人', 'punctuation|。']
acc : 10/10
--------------------------
total substring num: 217
test total substring num: 230
text acc num: 205
precision: 0.9447004608294931
recall: 0.8913043478260869
F1 Score: 0.9172259507829977
time: 0.3573117256164551
```
</details>
# 2. 🪨Motivation
- `TTS (Text-To-Speech)` model often **fails** on multi-language speech generation, there are two ways to do:
- Train a model can pronounce multiple languages
- **(This Package)** Separate sentence based on language first, then use different language models
- Existed models in NLP toolkit (e.g. `SpaCy`, `jieba`) is usually helpful for dealing with text in **ONE** language for each model. Which means multi-language texts need pre-process, like texts below:
```
你喜欢看アニメ吗?
Vielen Dank merci beaucoup for your help.
你最近好吗、最近どうですか?요즘 어떻게 지내요?sky is clear and sunny。
```
- [1. 💡How it works](#1-how-it-works)
- [2. 🪨Motivation](#2-motivation)
- [3. 📕Usage](#3-usage)
- [3.1. 🚀Installation](#31-installation)
- [3.2. Basic](#32-basic)
- [3.2.1. `split_by_lang`](#321-split_by_lang)
- [3.2.2. `merge_across_digit`](#322-merge_across_digit)
- [3.3. Advanced](#33-advanced)
- [3.3.1. usage of `lang_map` and `default_lang` (for your languages)](#331-usage-of-lang_map-and-default_lang-for-your-languages)
- [4. Acknowledgement](#4-acknowledgement)
- [5. ✨Star History](#5-star-history)
# 3. 📕Usage
## 3.1. 🚀Installation
You can install the package using pip:
```bash
pip install split-lang
```
****
## 3.2. Basic
### 3.2.1. `split_by_lang`
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)
```python
from split_lang import LangSplitter
lang_splitter = LangSplitter()
text = "你喜欢看アニメ吗"
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
```
```
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗
```
```python
from split_lang import LangSplitter
lang_splitter = LangSplitter(merge_across_punctuation=True)
import time
texts = [
"你喜欢看アニメ吗?我也喜欢看",
"Please star this project on GitHub, Thanks you. I love you请加星这个项目,谢谢你。我爱你この項目をスターしてください、ありがとうございます!愛してる",
]
time1 = time.time()
for text in texts:
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
print("----------------------")
time2 = time.time()
print(time2 - time1)
```
```
0|zh:你喜欢看
1|ja:アニメ
2|zh:吗?我也喜欢看
----------------------
0|en:Please star this project on GitHub, Thanks you. I love you
1|zh:请加星这个项目,谢谢你。我爱你
2|ja:この項目をスターしてください、ありがとうございます!愛してる
----------------------
0.007998466491699219
```
### 3.2.2. `merge_across_digit`
```python
lang_splitter.merge_across_digit = False
texts = [
"衬衫的价格是9.15便士",
]
for text in texts:
substr = lang_splitter.split_by_lang(
text=text,
)
for index, item in enumerate(substr):
print(f"{index}|{item.lang}:{item.text}")
```
```
0|zh:衬衫的价格是
1|digit:9.15
2|zh:便士
```
## 3.3. Advanced
### 3.3.1. usage of `lang_map` and `default_lang` (for your languages)
> [!IMPORTANT]
> Add lang code for your usecase if other languages are needed. [See Support Language](https://github.com/zafercavdar/fasttext-langdetect#supported-languages)
- default `lang_map` looks like below
- if `langua-py` or `fasttext` or any other language detector detect the language that is NOT included in `lang_map` will be set to `default_lang`
- if you set `default_lang` or `value` of `key:value` in `lang_map` to `x`, this substring will be merged to the near substring
- `zh` | `x` | `jp` -> `zh` | `jp` (`x` been merged to one side)
- In example below, `zh-tw` is set to `x` because character in `zh` and `jp` sometimes been detected as Traditional Chinese
- default `default_lang` is `x`
```python
DEFAULT_LANG_MAP = {
"zh": "zh",
"yue": "zh", # 粤语
"wuu": "zh", # 吴语
"zh-cn": "zh",
"zh-tw": "x",
"ko": "ko",
"ja": "ja",
"de": "de",
"fr": "fr",
"en": "en",
"hr": "en",
}
DEFAULT_LANG = "x"
```
# 4. Acknowledgement
- Inspired by [LlmKira/fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- Text segmentation depends on [google/budoux](https://github.com/google/budoux)
- Language detection depends on [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect) and [rspeer/wordfreq](https://github.com/rspeer/wordfreq)
# 5. ✨Star History
[![Star History Chart](https://api.star-history.com/svg?repos=DoodleBears/split-lang&type=Timeline)](https://star-history.com/#DoodleBears/split-lang&Timeline)
Raw data
{
"_id": null,
"home_page": "https://github.com/DoodleBears/langsplit",
"name": "split-lang",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": "DoodleBear",
"author_email": "yangmufeng233@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/bd/ba/91e1a4841a61c96e4bf7b72ba2ad692ef491ed943de3d011a07964d82b27/split_lang-2.0.5.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n<img alt=\"VisActor Logo\" width=50% src=\"https://github.com/DoodleBears/split-lang/blob/main/.github/profile/split-lang-logo.svg\"/>\n\n<img alt=\"VisActor Logo\" width=70% src=\"https://github.com/DoodleBears/split-lang/blob/main/.github/profile/split-lang-banner.svg\"/>\n \n</div>\n<div align=\"center\">\n <h1>split-lang</h1>\n\n**English** | [**\u4e2d\u6587\u7b80\u4f53**](./docs/zh/README.md) | [**\u65e5\u672c\u8a9e**](./docs/ja/README.md)\n\nSplit text by languages through concatenating over split substrings based on their language, powered by\n\nsplitting: [`budoux`](https://github.com/google/budoux) and rule-base splitting\n\nlanguage detection: [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect) and [`wordfreq`](https://github.com/rspeer/wordfreq)\n\n</div>\n\n<br/>\n\n<div align=\"center\">\n\n[![PyPI version](https://badge.fury.io/py/split-lang.svg)](https://badge.fury.io/py/split-lang)\n[![Downloads](https://static.pepy.tech/badge/split-lang)](https://pepy.tech/project/split-lang)\n[![Downloads](https://static.pepy.tech/badge/split-lang/month)](https://pepy.tech/project/split-lang)\n\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)\n\n\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/DoodleBears/split-lang/blob/main/LICENSE)\n![GitHub Repo stars](https://img.shields.io/github/stars/DoodleBears/split-lang)\n[![wakatime](https://wakatime.com/badge/user/5728d95a-5cfb-4acb-b600-e34c2fc231b6/project/e06e0a00-9ba1-453d-8c62-a0b2604aaaad.svg)](https://wakatime.com/badge/user/5728d95a-5cfb-4acb-b600-e34c2fc231b6/project/e06e0a00-9ba1-453d-8c62-a0b2604aaaad)\n\n</div>\n\n\n\n\n# 1. \ud83d\udca1How it works\n\n**Stage 1**: rule-based split (separate character, punctuation and digit)\n- `hello, how are you` -> `hello` | `,` | `how are you`\n\n**Stage 2**: over-split text to substrings by [`budoux`](https://github.com/google/budoux) for Chinese mix with Japanese, ` ` (space) for **not** [scripta continua](https://en.wikipedia.org/wiki/Scriptio_continua)\n- `\u4f60\u559c\u6b22\u770b\u30a2\u30cb\u30e1\u5417` -> `\u4f60` | `\u559c\u6b22` | `\u770b` | `\u30a2\u30cb\u30e1` | `\u5417`\n- `\u6628\u5929\u898b\u305f\u6620\u753b\u306f\u3068\u3066\u3082\u611f\u52d5\u7684\u3067\u3057\u305f` -> `\u6628\u5929` | `\u898b\u305f` | `\u6620\u753b` | `\u306f` | `\u3068\u3066\u3082` | `\u611f\u52d5` | `\u7684` | `\u3067` | `\u3057\u305f`\n- `how are you` -> `how ` | `are ` | `you`\n\n**Stage 3**: concatenate substrings based on their languages using [`fast-langdetect`](https://github.com/LlmKira/fast-langdetect), [`wordfreq`](https://github.com/rspeer/wordfreq) and regex (rule-based)\n- `\u4f60` | `\u559c\u6b22` | `\u770b` | `\u30a2\u30cb\u30e1` | `\u5417` -> `\u4f60\u559c\u6b22\u770b` | `\u30a2\u30cb\u30e1` | `\u5417`\n- `\u6628\u5929` | `\u898b\u305f` | `\u6620\u753b` | `\u306f` | `\u3068\u3066\u3082` | `\u611f\u52d5` | `\u7684` | `\u3067` | `\u3057\u305f` -> `\u6628\u5929` | `\u898b\u305f\u6620\u753b\u306f\u3068\u3066\u3082\u611f\u52d5\u7684\u3067\u3057\u305f`\n- `how ` | `are ` | `you` -> `how are you`\n\n<details>\n <summary>More split examples</summary>\n \n ```python\n correct_substrings : ['x|\u6211\u662f ', 'x|VGroupChatBot', 'punctuation|\uff0c', 'x|\u4e00\u4e2a\u65e8\u5728\u652f\u6301\u591a\u4eba\u901a\u4fe1\u7684\u52a9\u624b', 'punctuation|\uff0c', 'x|\u901a\u8fc7\u53ef\u89c6\u5316\u6d88\u606f\u6765\u5e2e\u52a9\u56e2\u961f\u6210\u5458\u66f4\u597d\u5730\u4ea4\u6d41', 'punctuation|\u3002', 'x|\u6211\u53ef\u4ee5\u5e2e\u52a9\u56e2\u961f\u6210\u5458\u66f4\u597d\u5730\u6574\u7406\u548c\u5171\u4eab\u4fe1\u606f', 'punctuation|\uff0c', 'x|\u7279\u522b\u662f\u5728\u8ba8\u8bba', 'punctuation|\u3001', 'x|\u4f1a\u8bae\u548c', 'x|Brainstorming', 'x|\u7b49\u60c5\u51b5\u4e0b', 'punctuation|\u3002', 'x|\u4f60\u597d\u6211\u7684\u540d\u5b57\u662f', 'x|\u897f\u91ce\u304f\u307e\u3067\u3059', 'x|my name is bob', 'x|\u5f88\u9ad8\u5174\u8ba4\u8bc6\u4f60', 'x|\u3069\u3046\u305e\u3088\u308d\u3057\u304f\u304a\u9858\u3044\u3044\u305f\u3057\u307e\u3059', 'punctuation|\u300c', 'x|\u3053\u3093\u306b\u3061\u306f', 'punctuation|\u300d', 'x|\u662f\u4ec0\u4e48\u610f\u601d', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u662f ', 'en|VGroupChatBot', 'punctuation|\uff0c', 'zh|\u4e00\u4e2a\u65e8\u5728\u652f\u6301\u591a\u4eba\u901a\u4fe1\u7684\u52a9\u624b', 'punctuation|\uff0c', 'zh|\u901a\u8fc7\u53ef\u89c6\u5316\u6d88\u606f\u6765\u5e2e\u52a9\u56e2\u961f\u6210\u5458\u66f4\u597d\u5730\u4ea4\u6d41', 'punctuation|\u3002', 'zh|\u6211\u53ef\u4ee5\u5e2e\u52a9\u56e2\u961f\u6210\u5458\u66f4\u597d\u5730\u6574\u7406\u548c\u5171\u4eab\u4fe1\u606f', 'punctuation|\uff0c', 'zh|\u7279\u522b\u662f\u5728\u8ba8\u8bba', 'punctuation|\u3001', 'zh|\u4f1a\u8bae\u548c', 'en|Brainstorming', 'zh|\u7b49\u60c5\u51b5\u4e0b', 'punctuation|\u3002', 'zh|\u4f60\u597d\u6211\u7684\u540d\u5b57\u662f', 'ja|\u897f\u91ce\u304f\u307e\u3067\u3059', 'en|my name is bob', 'zh|\u5f88\u9ad8\u5174\u8ba4\u8bc6\u4f60', 'ja|\u3069\u3046\u305e\u3088\u308d\u3057\u304f\u304a\u9858\u3044\u3044\u305f\u3057\u307e\u3059', 'punctuation|\u300c', 'ja|\u3053\u3093\u306b\u3061 \u306f', 'punctuation|\u300d', 'zh|\u662f\u4ec0\u4e48\u610f\u601d', 'punctuation|\u3002']\nacc : 25/25\n--------------------------\ncorrect_substrings : ['x|\u6211\u7684\u540d\u5b57\u662f', 'x|\u897f\u91ce\u304f\u307e\u3067\u3059', 'punctuation|\u3002', 'x|I am from Tokyo', 'punctuation|, ', 'x|\u65e5\u672c\u306e\u9996\u90fd', 'punctuation|\u3002', 'x|\u4eca\u5929\u7684\u5929\u6c14\u975e\u5e38\u597d']\ntest_split_substrings: ['zh|\u6211\u7684\u540d\u5b57\u662f', 'ja|\u897f\u91ce\u304f\u307e\u3067\u3059', 'punctuation|\u3002', 'en|I am from Tokyo', 'punctuation|, ', 'ja|\u65e5\u672c\u306e\u9996\u90fd', 'punctuation|\u3002', 'zh|\u4eca\u5929\u7684\u5929\u6c14\u975e\u5e38\u597d']\nacc : 8/8\n--------------------------\ncorrect_substrings : ['x|\u4f60\u597d', 'punctuation|\uff0c', 'x|\u4eca\u65e5\u306f\u3069\u3053\u3078\u884c\u304d\u307e\u3059\u304b', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u597d', 'punctuation|\uff0c', 'ja|\u4eca\u65e5\u306f\u3069\u3053\u3078\u884c\u304d\u307e\u3059\u304b', 'punctuation|\uff1f']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u4f60\u597d', 'x|\u4eca\u65e5\u306f\u3069\u3053\u3078\u884c\u304d\u307e\u3059\u304b', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u597d', 'ja|\u4eca\u65e5\u306f\u3069\u3053\u3078\u884c\u304d\u307e\u3059\u304b', 'punctuation|\uff1f']\nacc : 3/3\n--------------------------\ncorrect_substrings : ['x|\u6211\u7684\u540d\u5b57\u662f', 'x|\u7530\u4e2d\u3055\u3093\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u7684\u540d\u5b57\u662f\u7530\u4e2d', 'ja|\u3055\u3093\u3067\u3059', 'punctuation|\u3002']\nacc : 1/3\n--------------------------\ncorrect_substrings : ['x|\u6211\u559c\u6b22\u5403\u5bff\u53f8\u548c\u62c9\u9762', 'x|\u304a\u3044\u3057\u3044\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u559c\u6b22\u5403\u5bff\u53f8\u548c\u62c9\u9762', 'ja|\u304a\u3044\u3057\u3044\u3067\u3059', 'punctuation|\u3002']\nacc : 3/3\n--------------------------\ncorrect_substrings : ['x|\u4eca\u5929', 'x|\u306e\u5929\u6c17\u306f\u3068\u3066\u3082\u3044\u3044\u3067\u3059\u306d', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u4eca\u5929', 'ja|\u306e\u5929\u6c17\u306f\u3068\u3066\u3082\u3044\u3044\u3067\u3059\u306d', 'punctuation|\u3002']\nacc : 3/3\n--------------------------\ncorrect_substrings : ['x|\u6211\u5728\u5b66\u4e60', 'x|\u65e5\u672c\u8a9e\u5c11\u3057\u96e3\u3057\u3044\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u5728\u5b66\u4e60\u65e5\u672c\u8a9e\u5c11', 'ja|\u3057\u96e3\u3057\u3044\u3067\u3059', 'punctuation|\u3002']\nacc : 1/3\n--------------------------\ncorrect_substrings : ['x|\u65e5\u8bed\u771f\u662f', 'x|\u304a\u3082\u3057\u308d\u3044', 'x|\u554a']\ntest_split_substrings: ['zh|\u65e5\u8bed\u771f\u662f', 'ja|\u304a\u3082\u3057\u308d\u3044', 'zh|\u554a']\nacc : 3/3\n--------------------------\ncorrect_substrings : ['x|\u4f60\u559c\u6b22\u770b', 'x|\u30a2\u30cb\u30e1', 'x|\u5417', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u559c\u6b22\u770b', 'ja|\u30a2\u30cb\u30e1', 'zh|\u5417', 'punctuation|\uff1f']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u6211\u60f3\u53bb\u65e5\u672c\u65c5\u884c', 'punctuation|\u3001', 'x|\u7279\u306b\u4eac\u90fd\u306b\u884c\u304d\u305f\u3044\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u60f3\u53bb\u65e5\u672c\u65c5\u884c', 'punctuation|\u3001', 'ja|\u7279\u306b\u4eac\u90fd\u306b\u884c\u304d\u305f\u3044\u3067\u3059', 'punctuation|\u3002']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u6628\u5929', 'x|\u898b\u305f\u6620\u753b\u306f\u3068\u3066\u3082\u611f\u52d5\u7684\u3067\u3057\u305f', 'punctuation|\u3002', 'x|\u6211\u670b\u53cb\u662f\u65e5\u672c\u4eba', 'x|\u5f7c\u306f\u3068\u3066\u3082\u512a\u3057\u3044\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6628\u5929', 'ja|\u898b\u305f\u6620\u753b\u306f\u3068\u3066\u3082\u611f\u52d5\u7684\u3067\u3057\u305f', 'punctuation|\u3002', 'zh|\u6211\u670b\u53cb\u662f\u65e5\u672c\u4eba', 'ja|\u5f7c\u306f\u3068\u3066\u3082\u512a\u3057\u3044\u3067\u3059', 'punctuation|\u3002']\nacc : 6/6\n--------------------------\ncorrect_substrings : ['x|\u6211\u4eec\u4e00\u8d77\u53bb', 'x|\u30ab\u30e9\u30aa\u30b1', 'x|\u5427', 'punctuation|\u3001', 'x|\u697d\u3057\u305d\u3046\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u4eec\u4e00\u8d77\u53bb', 'ja|\u30ab\u30e9\u30aa\u30b1', 'zh|\u5427', 'punctuation|\u3001', 'ja|\u697d\u3057\u305d\u3046\u3067\u3059', 'punctuation|\u3002']\nacc : 6/6\n--------------------------\ncorrect_substrings : ['x|\u6211\u7684\u5bb6\u5728\u5317\u4eac', 'punctuation|\u3001', 'x|\u3067\u3082', 'punctuation|\u3001', 'x|\u4ed5\u4e8b\u3067\u6771\u4eac\u306b\u4f4f\u3093\u3067\u3044\u307e\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['ja|\u6211\u7684\u5bb6\u5728\u5317\u4eac', 'punctuation|\u3001', 'ja|\u3067\u3082', 'punctuation|\u3001', 'ja|\u4ed5\u4e8b\u3067\u6771\u4eac\u306b\u4f4f\u3093\u3067\u3044\u307e\u3059', 'punctuation|\u3002']\nacc : 6/6\n--------------------------\ncorrect_substrings : ['x|\u6211\u5728\u5b66\u505a\u65e5\u672c\u6599\u7406', 'punctuation|\u3001', 'x|\u65e5\u672c\u6599\u7406\u3092\u4f5c\u308b\u306e\u3092\u7fd2\u3063\u3066\u3044\u307e\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['ja|\u6211\u5728\u5b66\u505a\u65e5\u672c\u6599\u7406', 'punctuation|\u3001', 'ja|\u65e5\u672c\u6599\u7406\u3092\u4f5c\u308b\u306e\u3092\u7fd2\u3063\u3066\u3044\u307e\u3059', 'punctuation|\u3002']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u4f60\u4f1a\u8bf4\u51e0\u79cd\u8bed\u8a00', 'punctuation|\u3001', 'x|\u4f55\u30f6\u56fd\u8a9e\u8a71\u305b\u307e\u3059\u304b', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u4f1a\u8bf4\u51e0\u79cd\u8bed\u8a00', 'punctuation|\u3001', 'ja|\u4f55\u30f6\u56fd\u8a9e\u8a71\u305b\u307e\u3059\u304b', 'punctuation|\uff1f']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u6211\u6628\u5929\u770b\u4e86\u4e00\u672c\u4e66', 'punctuation|\u3001', 'x|\u305d\u306e\u672c\u306f\u3068\u3066\u3082\u9762\u767d\u304b\u3063\u305f\u3067\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u6628\u5929\u770b\u4e86\u4e00\u672c\u4e66', 'punctuation|\u3001', 'ja|\u305d\u306e\u672c\u306f\u3068\u3066\u3082\u9762\u767d\u304b\u3063\u305f\u3067\u3059', 'punctuation|\u3002']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u4f60\u6700\u8fd1\u597d\u5417', 'punctuation|\u3001', 'x|\u6700\u8fd1\u3069\u3046\u3067\u3059\u304b', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u6700\u8fd1\u597d\u5417', 'punctuation|\u3001', 'ja|\u6700\u8fd1\u3069\u3046\u3067\u3059\u304b', 'punctuation|\uff1f']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u4f60\u6700\u8fd1\u597d\u5417', 'x|\u6700\u8fd1\u3069\u3046\u3067\u3059\u304b', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u6700\u8fd1\u597d\u5417\u6700\u8fd1', 'ja|\u3069\u3046\u3067\u3059\u304b', 'punctuation|\uff1f']\nacc : 1/3\n--------------------------\ncorrect_substrings : ['x|\u6211\u5728\u5b66\u505a\u65e5\u672c\u6599\u7406', 'x|\uc640 \ud55c\uad6d \uc694\ub9ac', 'punctuation|\u3001', 'x|\u65e5\u672c\u6599\u7406\u3092\u4f5c\u308b\u306e\u3092\u7fd2\u3063\u3066\u3044\u307e\u3059', 'punctuation|\u3002']\ntest_split_substrings: ['ja|\u6211\u5728\u5b66\u505a\u65e5\u672c\u6599\u7406', 'ko|\uc640 \ud55c\uad6d \uc694\ub9ac', 'punctuation|\u3001', 'ja|\u65e5\u672c\u6599\u7406\u3092\u4f5c\u308b\u306e\u3092\u7fd2\u3063\u3066\u3044\u307e\u3059', 'punctuation|\u3002']\nacc : 5/5\n--------------------------\ncorrect_substrings : ['x|\u4f60\u4f1a\u8bf4\u51e0\u79cd\u8bed\u8a00', 'punctuation|\u3001', 'x|\u4f55\u30f6\u56fd\u8a9e\u8a71\u305b\u307e\u3059\u304b', 'punctuation|\uff1f', 'x|\uba87 \uac1c \uc5b8\uc5b4\ub97c \ud560 \uc218 \uc788\uc5b4\uc694', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u4f1a\u8bf4\u51e0\u79cd\u8bed\u8a00', 'punctuation|\u3001', 'ja|\u4f55\u30f6\u56fd\u8a9e\u8a71\u305b\u307e\u3059\u304b', 'punctuation|\uff1f', 'ko|\uba87 \uac1c \uc5b8\uc5b4\ub97c \ud560 \uc218 \uc788\uc5b4\uc694', 'punctuation|\uff1f']\nacc : 6/6\n--------------------------\ncorrect_substrings : ['x|\u6211\u6628\u5929\u770b\u4e86\u4e00\u672c\u4e66', 'punctuation|\u3001', 'x|\u305d\u306e\u672c\u306f\u3068\u3066\u3082\u9762\u767d\u304b\u3063\u305f\u3067\u3059', 'punctuation|\u3002', 'x|\uc5b4\uc81c \ucc45\uc744 \uc77d\uc5c8\ub294\ub370', 'punctuation|, ', 'x|\uc815\ub9d0 \uc7ac\ubbf8\uc788\uc5c8\uc5b4\uc694', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u6628\u5929\u770b\u4e86\u4e00\u672c\u4e66', 'punctuation|\u3001', 'ja|\u305d\u306e\u672c\u306f\u3068\u3066\u3082\u9762\u767d\u304b\u3063\u305f\u3067\u3059', 'punctuation|\u3002', 'ko|\uc5b4\uc81c \ucc45\uc744 \uc77d\uc5c8\ub294\ub370', 'punctuation|, ', 'ko|\uc815\ub9d0 \uc7ac\ubbf8\uc788\uc5c8\uc5b4\uc694', 'punctuation|\u3002']\nacc : 8/8\n--------------------------\ncorrect_substrings : ['x|\u6211\u4eec\u4e00\u8d77\u53bb\u901b\u8857', 'x|\uc640 \uc1fc\ud551', 'punctuation|\u3001', 'x|\u8cb7\u3044\u7269\u306b\u884c\u304d\u307e\u3057\u3087\u3046', 'punctuation|\u3002', 'x|\uc1fc\ud551\ud558\ub7ec \uac00\uc694', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u6211\u4eec\u4e00\u8d77\u53bb\u901b\u8857', 'ko|\uc640 \uc1fc\ud551', 'punctuation|\u3001', 'ja|\u8cb7\u3044\u7269\u306b\u884c\u304d\u307e\u3057\u3087\u3046', 'punctuation|\u3002', 'ko|\uc1fc\ud551\ud558\ub7ec \uac00\uc694', 'punctuation|\u3002']\nacc : 7/7\n--------------------------\ncorrect_substrings : ['x|\u4f60\u6700\u8fd1\u597d\u5417', 'punctuation|\u3001', 'x|\u6700\u8fd1\u3069\u3046\u3067\u3059\u304b', 'punctuation|\uff1f', 'x|\uc694\uc998 \uc5b4\ub5bb\uac8c \uc9c0\ub0b4\uc694', 'punctuation|\uff1f']\ntest_split_substrings: ['zh|\u4f60\u6700\u8fd1\u597d\u5417', 'punctuation|\u3001', 'ja|\u6700\u8fd1\u3069\u3046\u3067\u3059\u304b', 'punctuation|\uff1f', 'ko|\uc694\uc998 \uc5b4\ub5bb\uac8c \uc9c0\ub0b4\uc694', 'punctuation|\uff1f']\nacc : 6/6\n--------------------------\ncorrect_substrings : ['x|Bonjour', 'punctuation|, ', \"x|wie geht's dir \", 'x|today', 'punctuation|?']\ntest_split_substrings: ['fr|Bonjour', 'punctuation|, ', \"de|wie geht's dir \", 'en|today', 'punctuation|?']\nacc : 5/5\n--------------------------\ncorrect_substrings : ['x|Vielen Dank ', 'x|merci beaucoup ', 'x|for your help', 'punctuation|.']\ntest_split_substrings: ['de|Vielen ', 'fr|Dank merci beaucoup ', 'en|for your help', 'punctuation|.']\nacc : 2/4\n--------------------------\ncorrect_substrings : ['x|Ich bin m\u00fcde ', 'x|je suis fatigu\u00e9 ', 'x|and I need some rest', 'punctuation|.']\ntest_split_substrings: ['de|Ich ', 'en|bin ', 'de|m\u00fcde ', 'fr|je suis fatigu\u00e9 ', 'en|and I need some rest', 'punctuation|.']\nacc : 3/4\n--------------------------\ncorrect_substrings : ['x|Ich mag dieses Buch ', 'x|ce livre est int\u00e9ressant ', 'x|and it has a great story', 'punctuation|.']\ntest_split_substrings: ['de|Ich mag dieses Buch ', 'fr|ce livre est int\u00e9ressant ', 'en|and it has a great story', 'punctuation|.']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|Ich mag dieses Buch', 'punctuation|, ', 'x|ce livre est int\u00e9ressant', 'punctuation|, ', 'x|and it has a great story', 'punctuation|.']\ntest_split_substrings: ['de|Ich mag dieses Buch', 'punctuation|, ', 'fr|ce livre est int\u00e9ressant', 'punctuation|, ', 'en|and it has a great story', 'punctuation|.']\nacc : 6/6\n--------------------------\ncorrect_substrings : ['x|The shirt is ', 'x|9.15 ', 'x|dollars', 'punctuation|.']\ntest_split_substrings: ['en|The shirt is ', 'digit|9', 'punctuation|.', 'digit|15 ', 'en|dollars', 'punctuation|.']\nacc : 3/4\n--------------------------\ncorrect_substrings : ['x|The shirt is ', 'digit|233 ', 'x|dollars', 'punctuation|.']\ntest_split_substrings: ['en|The shirt is ', 'digit|233 ', 'en|dollars', 'punctuation|.']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|lang', 'punctuation|-', 'x|split']\ntest_split_substrings: ['en|lang', 'punctuation|-', 'en|split']\nacc : 3/3\n--------------------------\ncorrect_substrings : ['x|I have ', 'digit|10', 'punctuation|, ', 'x|\u20ac']\ntest_split_substrings: ['en|I have ', 'digit|10', 'punctuation|, ', 'fr|\u20ac']\nacc : 4/4\n--------------------------\ncorrect_substrings : ['x|\u65e5\u672c\u306e\u30e1\u30c7\u30a3\u30a2\u3067\u306f', 'punctuation|\u300c', 'x|\u533f\u540d\u63b2\u793a\u677f', 'punctuation|\u300d', 'x|\u3067\u3042\u308b\u3068\u7d39\u4ecb\u3055\u308c\u308b\u3053\u3068\u304c\u591a\u3044\u304c', 'punctuation|\u3001', 'x|2003\u5e741\u67087\u65e5\u304b\u3089\u5168\u66f8\u304d\u8fbc\u307f\u306b\u3064\u3044\u3066', 'x|IP', 'x|\u30a2\u30c9\u30ec\u30b9\u306e\u8a18\u9332\u30fb\u4fdd\u5b58\u3092\u59cb\u3081\u3066\u304a\u308a', 'punctuation|\u3001', 'x|\u53b3\u5bc6\u306b\u306f\u533f\u540d\u63b2\u793a\u677f\u3067\u306f\u306a\u304f\u306a\u3063\u3066\u3044\u308b\u3068', 'x|CNET Japan', 'x|\u306f\u5831\u3058\u3066\u3044\u308b']\ntest_split_substrings: ['ja|\u65e5\u672c\u306e\u30e1\u30c7\u30a3\u30a2\u3067\u306f', 'punctuation|\u300c', 'ja|\u533f\u540d\u63b2\u793a\u677f', 'punctuation|\u300d', 'ja|\u3067\u3042\u308b\u3068\u7d39\u4ecb\u3055\u308c\u308b\u3053\u3068\u304c\u591a\u3044\u304c', 'punctuation|\u3001', 'digit|2003', 'ja|\u5e74', 'digit|1', 'ja|\u6708', 'digit|7', 'ja|\u65e5\u304b\u3089\u5168\u66f8\u304d\u8fbc\u307f\u306b\u3064\u3044\u3066', 'en|IP', 'ja|\u30a2\u30c9\u30ec\u30b9\u306e\u8a18\u9332\u30fb\u4fdd\u5b58\u3092\u59cb\u3081\u3066\u304a\u308a', 'punctuation|\u3001', 'ja|\u53b3\u5bc6\u306b\u306f\u533f\u540d\u63b2\u793a\u677f\u3067\u306f\u306a\u304f\u306a\u3063\u3066\u3044\u308b\u3068', 'en|CNET Japan', 'ja|\u306f\u5831\u3058\u3066\u3044\u308b']\nacc : 12/13\n--------------------------\ncorrect_substrings : ['x|\u65e5\u672c\u8a9e', 'punctuation|\uff08', 'x|\u306b\u307b\u3093\u3054', 'punctuation|\u3001', 'x|\u306b\u3063\u307d\u3093\u3054', 'punctuation|\uff09', 'x|\u306f', 'punctuation|\u3001', 'x|\u65e5\u672c\u56fd\u5185\u3084', 'punctuation|\u3001', 'x|\u304b\u3064\u3066\u306e\u65e5\u672c\u9818\u3060\u3063\u305f\u56fd', 'punctuation|\u3001', 'x|\u305d\u3057\u3066\u56fd\u5916\u79fb\u6c11\u3084\u79fb\u4f4f\u8005\u3092\u542b\u3080\u65e5\u672c\u4eba\u540c\u58eb\u306e\u9593\u3067\u4f7f\u7528\u3055\u308c\u3066\u3044\u308b\u8a00\u8a9e', 'punctuation|\u3002', 'x|\u65e5\u672c\u306f\u6cd5\u4ee4\u306b\u3088\u3063\u3066\u516c\u7528\u8a9e\u3092\u898f\u5b9a\u3057\u3066\u3044\u306a\u3044\u304c', 'punctuation|\u3001', 'x|\u6cd5\u4ee4\u305d\u306e\u4ed6\u306e\u516c\u7528\u6587\u306f\u5168\u3066\u65e5\u672c\u8a9e\u3067\u8a18\u8ff0\u3055\u308c', 'punctuation|\u3001', 'x|\u5404\u7a2e\u6cd5\u4ee4\u306b\u304a\u3044\u3066\u65e5\u672c\u8a9e\u3092\u7528\u3044\u308b\u3053\u3068\u304c\u898f\u5b9a\u3055\u308c', 'punctuation|\u3001', 'x|\u5b66\u6821\u6559\u80b2\u306b\u304a\u3044\u3066\u306f\u300c\u56fd\u8a9e\u300d\u306e\u6559\u79d1\u3068\u3057\u3066\u5b66\u7fd2\u3092\u884c\u3046\u306a\u3069', 'punctuation|\u3001', 'x|\u4e8b\u5b9f\u4e0a\u65e5\u672c\u56fd\u5185\u306b\u304a\u3044\u3066\u552f\u4e00\u306e\u516c\u7528\u8a9e\u3068\u306a\u3063\u3066\u3044\u308b', 'punctuation|\u3002']\ntest_split_substrings: ['ja|\u65e5\u672c\u8a9e', 'punctuation|\uff08', 'ja|\u306b\u307b\u3093\u3054', 'punctuation|\u3001', 'ja|\u306b\u3063\u307d\u3093\u3054', 'punctuation|\uff09', 'ja|\u306f', 'punctuation|\u3001', 'ja|\u65e5\u672c\u56fd\u5185\u3084', 'punctuation|\u3001', 'ja|\u304b\u3064\u3066\u306e\u65e5\u672c\u9818\u3060\u3063\u305f\u56fd', 'punctuation|\u3001', 'ja|\u305d\u3057\u3066\u56fd\u5916\u79fb\u6c11 \u3084\u79fb\u4f4f\u8005\u3092\u542b\u3080\u65e5\u672c\u4eba\u540c\u58eb\u306e\u9593\u3067\u4f7f\u7528\u3055\u308c\u3066\u3044\u308b\u8a00\u8a9e', 'punctuation|\u3002', 'ja|\u65e5\u672c\u306f\u6cd5\u4ee4\u306b\u3088\u3063\u3066\u516c\u7528\u8a9e\u3092\u898f\u5b9a\u3057\u3066\u3044\u306a\u3044\u304c', 'punctuation|\u3001', 'ja|\u6cd5\u4ee4\u305d\u306e\u4ed6\u306e\u516c\u7528\u6587\u306f\u5168\u3066\u65e5\u672c\u8a9e\u3067\u8a18\u8ff0\u3055\u308c', 'punctuation|\u3001', 'ja|\u5404\u7a2e\u6cd5\u4ee4\u306b\u304a\u3044\u3066\u65e5\u672c\u8a9e\u3092\u7528\u3044\u308b\u3053\u3068\u304c\u898f\u5b9a\u3055\u308c', 'punctuation|\u3001', 'ja|\u5b66\u6821\u6559\u80b2\u306b\u304a\u3044\u3066\u306f', 'punctuation|\u300c', 'ja|\u56fd\u8a9e', 'punctuation|\u300d', 'ja|\u306e\u6559\u79d1\u3068\u3057\u3066\u5b66\u7fd2\u3092\u884c\u3046\u306a\u3069', 'punctuation|\u3001', 'ja|\u4e8b\u5b9f\u4e0a\u65e5\u672c\u56fd\u5185\u306b\u304a\u3044\u3066\u552f\u4e00\u306e\u516c\u7528\u8a9e\u3068\u306a\u3063\u3066\u3044\u308b', 'punctuation|\u3002']\nacc : 23/24\n--------------------------\ncorrect_substrings : ['x|\u65e5\u8bed\u662f\u65e5\u672c\u901a\u7528\u8bed\u53ca\u4e8b\u5b9e\u4e0a\u7684\u5b98\u65b9\u8bed\u8a00', 'punctuation|\u3002', 'x|\u6ca1\u6709\u7cbe\u786e\u7684\u65e5\u8bed\u4f7f\u7528\u4eba\u53e3\u7684\u7edf\u8ba1', 'punctuation|\uff0c', 'x|\u5982\u679c\u8ba1\u7b97\u65e5\u672c\u4eba\u53e3\u4ee5\u53ca\u5c45\u4f4f\u5728\u65e5\u672c\u4ee5\u5916\u7684\u65e5\u672c\u4eba', 'punctuation|\u3001', 'x|\u65e5\u4fa8\u548c\u65e5\u88d4', 'punctuation|\uff0c', 'x|\u65e5\u8bed\u4f7f\u7528\u8005\u5e94\u8d85\u8fc7\u4e00\u4ebf\u4e09\u5343\u4e07\u4eba', 'punctuation|\u3002']\ntest_split_substrings: ['zh|\u65e5\u8bed\u662f\u65e5\u672c\u901a\u7528\u8bed\u53ca\u4e8b\u5b9e\u4e0a\u7684\u5b98\u65b9\u8bed\u8a00', 'punctuation|\u3002', 'zh|\u6ca1\u6709\u7cbe\u786e\u7684\u65e5\u8bed\u4f7f\u7528\u4eba\u53e3\u7684\u7edf\u8ba1', 'punctuation|\uff0c', 'zh|\u5982\u679c\u8ba1\u7b97\u65e5\u672c\u4eba\u53e3\u4ee5\u53ca\u5c45\u4f4f\u5728\u65e5\u672c\u4ee5\u5916\u7684\u65e5\u672c\u4eba', 'punctuation|\u3001', 'zh|\u65e5\u4fa8\u548c\u65e5\u88d4', 'punctuation|\uff0c', 'zh|\u65e5\u8bed\u4f7f\u7528 \u8005\u5e94\u8d85\u8fc7\u4e00\u4ebf\u4e09\u5343\u4e07\u4eba', 'punctuation|\u3002']\nacc : 10/10\n--------------------------\ntotal substring num: 217\ntest total substring num: 230\ntext acc num: 205\nprecision: 0.9447004608294931\nrecall: 0.8913043478260869\nF1 Score: 0.9172259507829977\ntime: 0.3573117256164551\n ```\n</details>\n\n\n# 2. \ud83e\udea8Motivation\n- `TTS (Text-To-Speech)` model often **fails** on multi-language speech generation, there are two ways to do:\n - Train a model can pronounce multiple languages\n - **(This Package)** Separate sentence based on language first, then use different language models\n- Existed models in NLP toolkit (e.g. `SpaCy`, `jieba`) is usually helpful for dealing with text in **ONE** language for each model. Which means multi-language texts need pre-process, like texts below: \n\n```\n\u4f60\u559c\u6b22\u770b\u30a2\u30cb\u30e1\u5417\uff1f\nVielen Dank merci beaucoup for your help.\n\u4f60\u6700\u8fd1\u597d\u5417\u3001\u6700\u8fd1\u3069\u3046\u3067\u3059\u304b\uff1f\uc694\uc998 \uc5b4\ub5bb\uac8c \uc9c0\ub0b4\uc694\uff1fsky is clear and sunny\u3002\n```\n\n- [1. \ud83d\udca1How it works](#1-how-it-works)\n- [2. \ud83e\udea8Motivation](#2-motivation)\n- [3. \ud83d\udcd5Usage](#3-usage)\n - [3.1. \ud83d\ude80Installation](#31-installation)\n - [3.2. Basic](#32-basic)\n - [3.2.1. `split_by_lang`](#321-split_by_lang)\n - [3.2.2. `merge_across_digit`](#322-merge_across_digit)\n - [3.3. Advanced](#33-advanced)\n - [3.3.1. usage of `lang_map` and `default_lang` (for your languages)](#331-usage-of-lang_map-and-default_lang-for-your-languages)\n- [4. Acknowledgement](#4-acknowledgement)\n- [5. \u2728Star History](#5-star-history)\n\n\n# 3. \ud83d\udcd5Usage\n\n## 3.1. \ud83d\ude80Installation\n\nYou can install the package using pip:\n\n```bash\npip install split-lang\n```\n\n\n****\n## 3.2. Basic\n### 3.2.1. `split_by_lang`\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DoodleBears/split-lang/blob/main/split-lang-demo.ipynb)\n\n```python\nfrom split_lang import LangSplitter\nlang_splitter = LangSplitter()\ntext = \"\u4f60\u559c\u6b22\u770b\u30a2\u30cb\u30e1\u5417\"\n\nsubstr = lang_splitter.split_by_lang(\n text=text,\n)\nfor index, item in enumerate(substr):\n print(f\"{index}|{item.lang}:{item.text}\")\n```\n\n```\n0|zh:\u4f60\u559c\u6b22\u770b\n1|ja:\u30a2\u30cb\u30e1\n2|zh:\u5417\n```\n\n```python\nfrom split_lang import LangSplitter\nlang_splitter = LangSplitter(merge_across_punctuation=True)\nimport time\ntexts = [\n \"\u4f60\u559c\u6b22\u770b\u30a2\u30cb\u30e1\u5417\uff1f\u6211\u4e5f\u559c\u6b22\u770b\",\n \"Please star this project on GitHub, Thanks you. I love you\u8bf7\u52a0\u661f\u8fd9\u4e2a\u9879\u76ee\uff0c\u8c22\u8c22\u4f60\u3002\u6211\u7231\u4f60\u3053\u306e\u9805\u76ee\u3092\u30b9\u30bf\u30fc\u3057\u3066\u304f\u3060\u3055\u3044\u3001\u3042\u308a\u304c\u3068\u3046\u3054\u3056\u3044\u307e\u3059\uff01\u611b\u3057\u3066\u308b\",\n]\ntime1 = time.time()\nfor text in texts:\n substr = lang_splitter.split_by_lang(\n text=text,\n )\n for index, item in enumerate(substr):\n print(f\"{index}|{item.lang}:{item.text}\")\n print(\"----------------------\")\ntime2 = time.time()\nprint(time2 - time1)\n```\n\n```\n0|zh:\u4f60\u559c\u6b22\u770b\n1|ja:\u30a2\u30cb\u30e1\n2|zh:\u5417\uff1f\u6211\u4e5f\u559c\u6b22\u770b\n----------------------\n0|en:Please star this project on GitHub, Thanks you. I love you\n1|zh:\u8bf7\u52a0\u661f\u8fd9\u4e2a\u9879\u76ee\uff0c\u8c22\u8c22\u4f60\u3002\u6211\u7231\u4f60\n2|ja:\u3053\u306e\u9805\u76ee\u3092\u30b9\u30bf\u30fc\u3057\u3066\u304f\u3060\u3055\u3044\u3001\u3042\u308a\u304c\u3068\u3046\u3054\u3056\u3044\u307e\u3059\uff01\u611b\u3057\u3066\u308b\n----------------------\n0.007998466491699219\n```\n\n### 3.2.2. `merge_across_digit`\n\n```python\nlang_splitter.merge_across_digit = False\ntexts = [\n \"\u886c\u886b\u7684\u4ef7\u683c\u662f9.15\u4fbf\u58eb\",\n]\nfor text in texts:\n substr = lang_splitter.split_by_lang(\n text=text,\n )\n for index, item in enumerate(substr):\n print(f\"{index}|{item.lang}:{item.text}\")\n```\n\n```\n0|zh:\u886c\u886b\u7684\u4ef7\u683c\u662f\n1|digit:9.15\n2|zh:\u4fbf\u58eb\n```\n\n## 3.3. Advanced\n\n### 3.3.1. usage of `lang_map` and `default_lang` (for your languages)\n\n> [!IMPORTANT]\n> Add lang code for your usecase if other languages are needed. [See Support Language](https://github.com/zafercavdar/fasttext-langdetect#supported-languages)\n\n- default `lang_map` looks like below\n - if `langua-py` or `fasttext` or any other language detector detect the language that is NOT included in `lang_map` will be set to `default_lang`\n - if you set `default_lang` or `value` of `key:value` in `lang_map` to `x`, this substring will be merged to the near substring\n - `zh` | `x` | `jp` -> `zh` | `jp` (`x` been merged to one side)\n - In example below, `zh-tw` is set to `x` because character in `zh` and `jp` sometimes been detected as Traditional Chinese\n- default `default_lang` is `x`\n\n```python\nDEFAULT_LANG_MAP = {\n \"zh\": \"zh\",\n \"yue\": \"zh\", # \u7ca4\u8bed\n \"wuu\": \"zh\", # \u5434\u8bed\n \"zh-cn\": \"zh\",\n \"zh-tw\": \"x\",\n \"ko\": \"ko\",\n \"ja\": \"ja\",\n \"de\": \"de\",\n \"fr\": \"fr\",\n \"en\": \"en\",\n \"hr\": \"en\",\n}\nDEFAULT_LANG = \"x\"\n\n```\n\n# 4. Acknowledgement\n\n- Inspired by [LlmKira/fast-langdetect](https://github.com/LlmKira/fast-langdetect)\n- Text segmentation depends on [google/budoux](https://github.com/google/budoux)\n- Language detection depends on [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect) and [rspeer/wordfreq](https://github.com/rspeer/wordfreq)\n\n# 5. \u2728Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=DoodleBears/split-lang&type=Timeline)](https://star-history.com/#DoodleBears/split-lang&Timeline)\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package for splitting text by languages through concatenating over split substrings based on their language",
"version": "2.0.5",
"project_urls": {
"Homepage": "https://github.com/DoodleBears/langsplit"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "72e721a2232997309d06d3e1deac7bd16d77d6fd068472a7f4a35db236171303",
"md5": "feb6c3208a0f0136e4c3f25271fa580b",
"sha256": "4176f9b7977a7910624893d44e5bd79f1855086e36884d527644bdbdd82464c6"
},
"downloads": -1,
"filename": "split_lang-2.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "feb6c3208a0f0136e4c3f25271fa580b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 27015,
"upload_time": "2024-10-31T15:35:24",
"upload_time_iso_8601": "2024-10-31T15:35:24.144044Z",
"url": "https://files.pythonhosted.org/packages/72/e7/21a2232997309d06d3e1deac7bd16d77d6fd068472a7f4a35db236171303/split_lang-2.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bdba91e1a4841a61c96e4bf7b72ba2ad692ef491ed943de3d011a07964d82b27",
"md5": "46fb42a26f6ee74b7b0f08c358d2576b",
"sha256": "90cecc59ae2759a4c7315938f0f5f64cc9a5afb7715b11eafe161f0daf21abcc"
},
"downloads": -1,
"filename": "split_lang-2.0.5.tar.gz",
"has_sig": false,
"md5_digest": "46fb42a26f6ee74b7b0f08c358d2576b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 29849,
"upload_time": "2024-10-31T15:35:25",
"upload_time_iso_8601": "2024-10-31T15:35:25.693572Z",
"url": "https://files.pythonhosted.org/packages/bd/ba/91e1a4841a61c96e4bf7b72ba2ad692ef491ed943de3d011a07964d82b27/split_lang-2.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-31 15:35:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DoodleBears",
"github_project": "langsplit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "split-lang"
}