bilingual


Namebilingual JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryHigh-quality Bangla and English NLP toolkit for production use
upload_time2025-11-15 17:12:12
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache-2.0
keywords nlp bangla bengali bilingual multilingual tokenization translation language-model ai machine-learning
VCS
bugtrack_url
requirements numpy sentencepiece regex tqdm requests fastapi uvicorn pydantic pydantic-settings torch transformers accelerate peft bitsandbytes datasets tensorboard onnx onnxruntime optimum huggingface_hub gradio beautifulsoup4 lxml fake-useragent wikiextractor indic-nlp-library nltk matplotlib scikit-learn typer rich prometheus-client pytest pytest-cov black isort flake8 mypy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Bilingual | দ্বিভাষিক

<div align="center">

**High-quality Bangla + English NLP toolkit for production use**

**প্রোডাকশন ব্যবহারের জন্য উচ্চমানের বাংলা + ইংরেজি NLP টুলকিট**

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[English](#english) | [বাংলা](#বাংলা)

</div>

---

## English

### Overview

**bilingual** is a Python package providing production-ready tools for Bangla and English natural language processing. It focuses on:

- 🌍 **Bilingual Support**: Equal treatment for Bangla and English
- 👶 **Child-Friendly Content**: Special focus on educational and age-appropriate material
- 🚀 **Production Ready**: Easy installation, comprehensive docs, robust testing
- 🔧 **Flexible**: From tokenization to translation, generation to classification
- 📚 **Well-Documented**: Full documentation in both English and Bangla

### Features

- **Text Normalization**: Unicode normalization, punctuation handling, script cleaning
- **Tokenization**: Shared SentencePiece tokenizer optimized for Bangla + English
- **Language Models**: Bilingual pretrained and fine-tuned models for generation
- **Translation**: Bangla ↔ English translation assistance
- **Classification**: Readability scoring, age-level detection, safety filtering
- **Utilities**: Dataset tools, evaluation metrics, preprocessing pipelines

### Quick Start

#### Installation

```bash
pip install bilingual
```

For development:

```bash
git clone https://github.com/YOUR_ORG/bilingual.git
cd bilingual
pip install -e ".[dev]"
```

#### Basic Usage

```python
from bilingual import bilingual_api as bb

# Load tokenizer
tokenizer = bb.load_tokenizer("bilingual-tokenizer")

# Normalize text
text_bn = bb.normalize_text("আমি স্কুলে যাচ্ছি।", lang="bn")
text_en = bb.normalize_text("I am going to school.", lang="en")

# Generate text
prompt = "A short story about a brave rabbit / সাহসী খরগোশের একটি ছোট গল্প"
story = bb.generate(prompt, model_name="bilingual-small-lm", max_tokens=150)

# Translate
translation = bb.translate("আমি বই পড়তে ভালোবাসি।", src="bn", tgt="en")
print(translation)  # "I love to read books."

# Check readability
level = bb.readability_check(text_bn, lang="bn")
print(f"Reading level: {level}")
```

#### CLI Usage

```bash
# Tokenize text
bilingual tokenize --lang bn --text "আমি ভাত খাই।"

# Generate text
bilingual generate --model bilingual-small-lm --prompt "Once upon a time..." --max-tokens 100

# Translate
bilingual translate --src bn --tgt en --text "আমি তোমাকে ভালোবাসি।"

# Evaluate model
bilingual evaluate --dataset data/test.jsonl --model bilingual-small-lm
```

### Project Structure

```
bilingual/
├── bilingual/              # Main package
│   ├── __init__.py
│   ├── api.py             # High-level API
│   ├── tokenizer.py       # Tokenization utilities
│   ├── normalize.py       # Text normalization
│   ├── models/            # Model implementations
│   │   ├── loader.py
│   │   ├── lm.py
│   │   └── translate.py
│   ├── evaluation.py      # Evaluation metrics
│   ├── data_utils.py      # Dataset utilities
│   └── cli.py             # Command-line interface
├── scripts/               # Training and data scripts
├── tests/                 # Test suite
├── docs/                  # Documentation
│   ├── en/               # English docs
│   └── bn/               # Bangla docs
├── datasets/              # Dataset storage
└── models/                # Model storage
```

### Documentation

- 📖 [Full Documentation](docs/en/README.md)
- 🚀 [Quick Start Guide](docs/en/quickstart.md)
- 🔧 [API Reference](docs/en/api.md)
- 🤝 [Contributing Guide](CONTRIBUTING.md)
- 🗺️ [Roadmap](ROADMAP.md)

### Development

```bash
# Run tests
pytest tests/

# Format code
black bilingual/ tests/

# Type checking
mypy bilingual/

# Lint
flake8 bilingual/
```

### Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

Areas where we need help:
- 📊 Dataset collection and curation
- 🤖 Model training and fine-tuning
- 📝 Documentation and translation
- 🧪 Testing and quality assurance
- 🐛 Bug fixes and improvements

### License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

### Citation

If you use this package in your research, please cite:

```bibtex
@software{bilingual2025,
  title = {Bilingual: High-quality Bangla and English NLP Toolkit},
  author = {Bilingual Project Contributors},
  year = {2025},
  url = {https://github.com/YOUR_ORG/bilingual}
}
```

### Acknowledgments

This project is built with support from the open-source community and aims to advance Bangla language technology for everyone.

---

## বাংলা

### সংক্ষিপ্ত বিবরণ

**bilingual** হল একটি Python প্যাকেজ যা বাংলা এবং ইংরেজি প্রাকৃতিক ভাষা প্রক্রিয়াকরণের জন্য প্রোডাকশন-রেডি টুল প্রদান করে। এটি ফোকাস করে:

- 🌍 **দ্বিভাষিক সমর্থন**: বাংলা এবং ইংরেজির জন্য সমান আচরণ
- 👶 **শিশু-বান্ধব কন্টেন্ট**: শিক্ষামূলক এবং বয়স-উপযুক্ত উপাদানের উপর বিশেষ ফোকাস
- 🚀 **প্রোডাকশন রেডি**: সহজ ইনস্টলেশন, ব্যাপক ডক্স, শক্তিশালী টেস্টিং
- 🔧 **নমনীয়**: টোকেনাইজেশন থেকে অনুবাদ, জেনারেশন থেকে শ্রেণীবিভাগ
- 📚 **ভালভাবে ডকুমেন্টেড**: ইংরেজি এবং বাংলা উভয় ভাষায় সম্পূর্ণ ডকুমেন্টেশন

### বৈশিষ্ট্য

- **টেক্সট নরমালাইজেশন**: ইউনিকোড নরমালাইজেশন, বিরামচিহ্ন হ্যান্ডলিং, স্ক্রিপ্ট পরিষ্কার করা
- **টোকেনাইজেশন**: বাংলা + ইংরেজির জন্য অপ্টিমাইজড শেয়ারড SentencePiece টোকেনাইজার
- **ভাষা মডেল**: জেনারেশনের জন্য দ্বিভাষিক প্রিট্রেইনড এবং ফাইন-টিউনড মডেল
- **অনুবাদ**: বাংলা ↔ ইংরেজি অনুবাদ সহায়তা
- **শ্রেণীবিভাগ**: পঠনযোগ্যতা স্কোরিং, বয়স-স্তর সনাক্তকরণ, নিরাপত্তা ফিল্টারিং
- **ইউটিলিটি**: ডেটাসেট টুল, মূল্যায়ন মেট্রিক্স, প্রিপ্রসেসিং পাইপলাইন

### দ্রুত শুরু

#### ইনস্টলেশন

```bash
pip install bilingual
```

ডেভেলপমেন্টের জন্য:

```bash
git clone https://github.com/YOUR_ORG/bilingual.git
cd bilingual
pip install -e ".[dev]"
```

#### মৌলিক ব্যবহার

```python
from bilingual import bilingual_api as bb

# টোকেনাইজার লোড করুন
tokenizer = bb.load_tokenizer("bilingual-tokenizer")

# টেক্সট নরমালাইজ করুন
text_bn = bb.normalize_text("আমি স্কুলে যাচ্ছি।", lang="bn")
text_en = bb.normalize_text("I am going to school.", lang="en")

# টেক্সট জেনারেট করুন
prompt = "A short story about a brave rabbit / সাহসী খরগোশের একটি ছোট গল্প"
story = bb.generate(prompt, model_name="bilingual-small-lm", max_tokens=150)

# অনুবাদ করুন
translation = bb.translate("আমি বই পড়তে ভালোবাসি।", src="bn", tgt="en")
print(translation)  # "I love to read books."

# পঠনযোগ্যতা চেক করুন
level = bb.readability_check(text_bn, lang="bn")
print(f"Reading level: {level}")
```

#### CLI ব্যবহার

```bash
# টেক্সট টোকেনাইজ করুন
bilingual tokenize --lang bn --text "আমি ভাত খাই।"

# টেক্সট জেনারেট করুন
bilingual generate --model bilingual-small-lm --prompt "Once upon a time..." --max-tokens 100

# অনুবাদ করুন
bilingual translate --src bn --tgt en --text "আমি তোমাকে ভালোবাসি।"

# মডেল মূল্যায়ন করুন
bilingual evaluate --dataset data/test.jsonl --model bilingual-small-lm
```

### ডকুমেন্টেশন

- 📖 [সম্পূর্ণ ডকুমেন্টেশন](docs/bn/README.md)
- 🚀 [দ্রুত শুরু গাইড](docs/bn/quickstart.md)
- 🔧 [API রেফারেন্স](docs/bn/api.md)
- 🤝 [অবদান গাইড](CONTRIBUTING.md)
- 🗺️ [রোডম্যাপ](ROADMAP.md)

### অবদান রাখা

আমরা অবদান স্বাগত জানাই! বিস্তারিত জানার জন্য অনুগ্রহ করে আমাদের [অবদান গাইড](CONTRIBUTING.md) দেখুন।

যেসব ক্ষেত্রে আমাদের সাহায্য প্রয়োজন:
- 📊 ডেটাসেট সংগ্রহ এবং কিউরেশন
- 🤖 মডেল ট্রেনিং এবং ফাইন-টিউনিং
- 📝 ডকুমেন্টেশন এবং অনুবাদ
- 🧪 টেস্টিং এবং কোয়ালিটি অ্যাসিউরেন্স
- 🐛 বাগ ফিক্স এবং উন্নতি

### লাইসেন্স

এই প্রকল্পটি Apache License 2.0 এর অধীনে লাইসেন্সপ্রাপ্ত - বিস্তারিত জানার জন্য [LICENSE](LICENSE) ফাইল দেখুন।

### স্বীকৃতি

এই প্রকল্পটি ওপেন-সোর্স কমিউনিটির সমর্থনে তৈরি এবং সবার জন্য বাংলা ভাষা প্রযুক্তি এগিয়ে নিয়ে যাওয়ার লক্ষ্যে কাজ করে।

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bilingual",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "KhulnaSoft Ltd <info@khulnasoft.com>, Md Sulaiman <dev.sulaiman@icloud.com>",
    "keywords": "nlp, bangla, bengali, bilingual, multilingual, tokenization, translation, language-model, ai, machine-learning",
    "author": null,
    "author_email": "KhulnaSoft Ltd <info@khulnasoft.com>",
    "download_url": null,
    "platform": null,
    "description": "# Bilingual | \u09a6\u09cd\u09ac\u09bf\u09ad\u09be\u09b7\u09bf\u0995\n\n<div align=\"center\">\n\n**High-quality Bangla + English NLP toolkit for production use**\n\n**\u09aa\u09cd\u09b0\u09cb\u09a1\u09be\u0995\u09b6\u09a8 \u09ac\u09cd\u09af\u09ac\u09b9\u09be\u09b0\u09c7\u09b0 \u099c\u09a8\u09cd\u09af \u0989\u099a\u09cd\u099a\u09ae\u09be\u09a8\u09c7\u09b0 \u09ac\u09be\u0982\u09b2\u09be + \u0987\u0982\u09b0\u09c7\u099c\u09bf NLP \u099f\u09c1\u09b2\u0995\u09bf\u099f**\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n[English](#english) | [\u09ac\u09be\u0982\u09b2\u09be](#\u09ac\u09be\u0982\u09b2\u09be)\n\n</div>\n\n---\n\n## English\n\n### Overview\n\n**bilingual** is a Python package providing production-ready tools for Bangla and English natural language processing. It focuses on:\n\n- \ud83c\udf0d **Bilingual Support**: Equal treatment for Bangla and English\n- \ud83d\udc76 **Child-Friendly Content**: Special focus on educational and age-appropriate material\n- \ud83d\ude80 **Production Ready**: Easy installation, comprehensive docs, robust testing\n- \ud83d\udd27 **Flexible**: From tokenization to translation, generation to classification\n- \ud83d\udcda **Well-Documented**: Full documentation in both English and Bangla\n\n### Features\n\n- **Text Normalization**: Unicode normalization, punctuation handling, script cleaning\n- **Tokenization**: Shared SentencePiece tokenizer optimized for Bangla + English\n- **Language Models**: Bilingual pretrained and fine-tuned models for generation\n- **Translation**: Bangla \u2194 English translation assistance\n- **Classification**: Readability scoring, age-level detection, safety filtering\n- **Utilities**: Dataset tools, evaluation metrics, preprocessing pipelines\n\n### Quick Start\n\n#### Installation\n\n```bash\npip install bilingual\n```\n\nFor development:\n\n```bash\ngit clone https://github.com/YOUR_ORG/bilingual.git\ncd bilingual\npip install -e \".[dev]\"\n```\n\n#### Basic Usage\n\n```python\nfrom bilingual import bilingual_api as bb\n\n# Load tokenizer\ntokenizer = bb.load_tokenizer(\"bilingual-tokenizer\")\n\n# Normalize text\ntext_bn = bb.normalize_text(\"\u0986\u09ae\u09bf \u09b8\u09cd\u0995\u09c1\u09b2\u09c7 \u09af\u09be\u099a\u09cd\u099b\u09bf\u0964\", lang=\"bn\")\ntext_en = bb.normalize_text(\"I am going to school.\", lang=\"en\")\n\n# Generate text\nprompt = \"A short story about a brave rabbit / \u09b8\u09be\u09b9\u09b8\u09c0 \u0996\u09b0\u0997\u09cb\u09b6\u09c7\u09b0 \u098f\u0995\u099f\u09bf \u099b\u09cb\u099f \u0997\u09b2\u09cd\u09aa\"\nstory = bb.generate(prompt, model_name=\"bilingual-small-lm\", max_tokens=150)\n\n# Translate\ntranslation = bb.translate(\"\u0986\u09ae\u09bf \u09ac\u0987 \u09aa\u09a1\u09bc\u09a4\u09c7 \u09ad\u09be\u09b2\u09cb\u09ac\u09be\u09b8\u09bf\u0964\", src=\"bn\", tgt=\"en\")\nprint(translation)  # \"I love to read books.\"\n\n# Check readability\nlevel = bb.readability_check(text_bn, lang=\"bn\")\nprint(f\"Reading level: {level}\")\n```\n\n#### CLI Usage\n\n```bash\n# Tokenize text\nbilingual tokenize --lang bn --text \"\u0986\u09ae\u09bf \u09ad\u09be\u09a4 \u0996\u09be\u0987\u0964\"\n\n# Generate text\nbilingual generate --model bilingual-small-lm --prompt \"Once upon a time...\" --max-tokens 100\n\n# Translate\nbilingual translate --src bn --tgt en --text \"\u0986\u09ae\u09bf \u09a4\u09cb\u09ae\u09be\u0995\u09c7 \u09ad\u09be\u09b2\u09cb\u09ac\u09be\u09b8\u09bf\u0964\"\n\n# Evaluate model\nbilingual evaluate --dataset data/test.jsonl --model bilingual-small-lm\n```\n\n### Project Structure\n\n```\nbilingual/\n\u251c\u2500\u2500 bilingual/              # Main package\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 api.py             # High-level API\n\u2502   \u251c\u2500\u2500 tokenizer.py       # Tokenization utilities\n\u2502   \u251c\u2500\u2500 normalize.py       # Text normalization\n\u2502   \u251c\u2500\u2500 models/            # Model implementations\n\u2502   \u2502   \u251c\u2500\u2500 loader.py\n\u2502   \u2502   \u251c\u2500\u2500 lm.py\n\u2502   \u2502   \u2514\u2500\u2500 translate.py\n\u2502   \u251c\u2500\u2500 evaluation.py      # Evaluation metrics\n\u2502   \u251c\u2500\u2500 data_utils.py      # Dataset utilities\n\u2502   \u2514\u2500\u2500 cli.py             # Command-line interface\n\u251c\u2500\u2500 scripts/               # Training and data scripts\n\u251c\u2500\u2500 tests/                 # Test suite\n\u251c\u2500\u2500 docs/                  # Documentation\n\u2502   \u251c\u2500\u2500 en/               # English docs\n\u2502   \u2514\u2500\u2500 bn/               # Bangla docs\n\u251c\u2500\u2500 datasets/              # Dataset storage\n\u2514\u2500\u2500 models/                # Model storage\n```\n\n### Documentation\n\n- \ud83d\udcd6 [Full Documentation](docs/en/README.md)\n- \ud83d\ude80 [Quick Start Guide](docs/en/quickstart.md)\n- \ud83d\udd27 [API Reference](docs/en/api.md)\n- \ud83e\udd1d [Contributing Guide](CONTRIBUTING.md)\n- \ud83d\uddfa\ufe0f [Roadmap](ROADMAP.md)\n\n### Development\n\n```bash\n# Run tests\npytest tests/\n\n# Format code\nblack bilingual/ tests/\n\n# Type checking\nmypy bilingual/\n\n# Lint\nflake8 bilingual/\n```\n\n### Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\nAreas where we need help:\n- \ud83d\udcca Dataset collection and curation\n- \ud83e\udd16 Model training and fine-tuning\n- \ud83d\udcdd Documentation and translation\n- \ud83e\uddea Testing and quality assurance\n- \ud83d\udc1b Bug fixes and improvements\n\n### License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n### Citation\n\nIf you use this package in your research, please cite:\n\n```bibtex\n@software{bilingual2025,\n  title = {Bilingual: High-quality Bangla and English NLP Toolkit},\n  author = {Bilingual Project Contributors},\n  year = {2025},\n  url = {https://github.com/YOUR_ORG/bilingual}\n}\n```\n\n### Acknowledgments\n\nThis project is built with support from the open-source community and aims to advance Bangla language technology for everyone.\n\n---\n\n## \u09ac\u09be\u0982\u09b2\u09be\n\n### \u09b8\u0982\u0995\u09cd\u09b7\u09bf\u09aa\u09cd\u09a4 \u09ac\u09bf\u09ac\u09b0\u09a3\n\n**bilingual** \u09b9\u09b2 \u098f\u0995\u099f\u09bf Python \u09aa\u09cd\u09af\u09be\u0995\u09c7\u099c \u09af\u09be \u09ac\u09be\u0982\u09b2\u09be \u098f\u09ac\u0982 \u0987\u0982\u09b0\u09c7\u099c\u09bf \u09aa\u09cd\u09b0\u09be\u0995\u09c3\u09a4\u09bf\u0995 \u09ad\u09be\u09b7\u09be \u09aa\u09cd\u09b0\u0995\u09cd\u09b0\u09bf\u09af\u09bc\u09be\u0995\u09b0\u09a3\u09c7\u09b0 \u099c\u09a8\u09cd\u09af \u09aa\u09cd\u09b0\u09cb\u09a1\u09be\u0995\u09b6\u09a8-\u09b0\u09c7\u09a1\u09bf \u099f\u09c1\u09b2 \u09aa\u09cd\u09b0\u09a6\u09be\u09a8 \u0995\u09b0\u09c7\u0964 \u098f\u099f\u09bf \u09ab\u09cb\u0995\u09be\u09b8 \u0995\u09b0\u09c7:\n\n- \ud83c\udf0d **\u09a6\u09cd\u09ac\u09bf\u09ad\u09be\u09b7\u09bf\u0995 \u09b8\u09ae\u09b0\u09cd\u09a5\u09a8**: \u09ac\u09be\u0982\u09b2\u09be \u098f\u09ac\u0982 \u0987\u0982\u09b0\u09c7\u099c\u09bf\u09b0 \u099c\u09a8\u09cd\u09af \u09b8\u09ae\u09be\u09a8 \u0986\u099a\u09b0\u09a3\n- \ud83d\udc76 **\u09b6\u09bf\u09b6\u09c1-\u09ac\u09be\u09a8\u09cd\u09a7\u09ac \u0995\u09a8\u09cd\u099f\u09c7\u09a8\u09cd\u099f**: \u09b6\u09bf\u0995\u09cd\u09b7\u09be\u09ae\u09c2\u09b2\u0995 \u098f\u09ac\u0982 \u09ac\u09af\u09bc\u09b8-\u0989\u09aa\u09af\u09c1\u0995\u09cd\u09a4 \u0989\u09aa\u09be\u09a6\u09be\u09a8\u09c7\u09b0 \u0989\u09aa\u09b0 \u09ac\u09bf\u09b6\u09c7\u09b7 \u09ab\u09cb\u0995\u09be\u09b8\n- \ud83d\ude80 **\u09aa\u09cd\u09b0\u09cb\u09a1\u09be\u0995\u09b6\u09a8 \u09b0\u09c7\u09a1\u09bf**: \u09b8\u09b9\u099c \u0987\u09a8\u09b8\u09cd\u099f\u09b2\u09c7\u09b6\u09a8, \u09ac\u09cd\u09af\u09be\u09aa\u0995 \u09a1\u0995\u09cd\u09b8, \u09b6\u0995\u09cd\u09a4\u09bf\u09b6\u09be\u09b2\u09c0 \u099f\u09c7\u09b8\u09cd\u099f\u09bf\u0982\n- \ud83d\udd27 **\u09a8\u09ae\u09a8\u09c0\u09af\u09bc**: \u099f\u09cb\u0995\u09c7\u09a8\u09be\u0987\u099c\u09c7\u09b6\u09a8 \u09a5\u09c7\u0995\u09c7 \u0985\u09a8\u09c1\u09ac\u09be\u09a6, \u099c\u09c7\u09a8\u09be\u09b0\u09c7\u09b6\u09a8 \u09a5\u09c7\u0995\u09c7 \u09b6\u09cd\u09b0\u09c7\u09a3\u09c0\u09ac\u09bf\u09ad\u09be\u0997\n- \ud83d\udcda **\u09ad\u09be\u09b2\u09ad\u09be\u09ac\u09c7 \u09a1\u0995\u09c1\u09ae\u09c7\u09a8\u09cd\u099f\u09c7\u09a1**: \u0987\u0982\u09b0\u09c7\u099c\u09bf \u098f\u09ac\u0982 \u09ac\u09be\u0982\u09b2\u09be \u0989\u09ad\u09af\u09bc \u09ad\u09be\u09b7\u09be\u09af\u09bc \u09b8\u09ae\u09cd\u09aa\u09c2\u09b0\u09cd\u09a3 \u09a1\u0995\u09c1\u09ae\u09c7\u09a8\u09cd\u099f\u09c7\u09b6\u09a8\n\n### \u09ac\u09c8\u09b6\u09bf\u09b7\u09cd\u099f\u09cd\u09af\n\n- **\u099f\u09c7\u0995\u09cd\u09b8\u099f \u09a8\u09b0\u09ae\u09be\u09b2\u09be\u0987\u099c\u09c7\u09b6\u09a8**: \u0987\u0989\u09a8\u09bf\u0995\u09cb\u09a1 \u09a8\u09b0\u09ae\u09be\u09b2\u09be\u0987\u099c\u09c7\u09b6\u09a8, \u09ac\u09bf\u09b0\u09be\u09ae\u099a\u09bf\u09b9\u09cd\u09a8 \u09b9\u09cd\u09af\u09be\u09a8\u09cd\u09a1\u09b2\u09bf\u0982, \u09b8\u09cd\u0995\u09cd\u09b0\u09bf\u09aa\u09cd\u099f \u09aa\u09b0\u09bf\u09b7\u09cd\u0995\u09be\u09b0 \u0995\u09b0\u09be\n- **\u099f\u09cb\u0995\u09c7\u09a8\u09be\u0987\u099c\u09c7\u09b6\u09a8**: \u09ac\u09be\u0982\u09b2\u09be + \u0987\u0982\u09b0\u09c7\u099c\u09bf\u09b0 \u099c\u09a8\u09cd\u09af \u0985\u09aa\u09cd\u099f\u09bf\u09ae\u09be\u0987\u099c\u09a1 \u09b6\u09c7\u09af\u09bc\u09be\u09b0\u09a1 SentencePiece \u099f\u09cb\u0995\u09c7\u09a8\u09be\u0987\u099c\u09be\u09b0\n- **\u09ad\u09be\u09b7\u09be \u09ae\u09a1\u09c7\u09b2**: \u099c\u09c7\u09a8\u09be\u09b0\u09c7\u09b6\u09a8\u09c7\u09b0 \u099c\u09a8\u09cd\u09af \u09a6\u09cd\u09ac\u09bf\u09ad\u09be\u09b7\u09bf\u0995 \u09aa\u09cd\u09b0\u09bf\u099f\u09cd\u09b0\u09c7\u0987\u09a8\u09a1 \u098f\u09ac\u0982 \u09ab\u09be\u0987\u09a8-\u099f\u09bf\u0989\u09a8\u09a1 \u09ae\u09a1\u09c7\u09b2\n- **\u0985\u09a8\u09c1\u09ac\u09be\u09a6**: \u09ac\u09be\u0982\u09b2\u09be \u2194 \u0987\u0982\u09b0\u09c7\u099c\u09bf \u0985\u09a8\u09c1\u09ac\u09be\u09a6 \u09b8\u09b9\u09be\u09af\u09bc\u09a4\u09be\n- **\u09b6\u09cd\u09b0\u09c7\u09a3\u09c0\u09ac\u09bf\u09ad\u09be\u0997**: \u09aa\u09a0\u09a8\u09af\u09cb\u0997\u09cd\u09af\u09a4\u09be \u09b8\u09cd\u0995\u09cb\u09b0\u09bf\u0982, \u09ac\u09af\u09bc\u09b8-\u09b8\u09cd\u09a4\u09b0 \u09b8\u09a8\u09be\u0995\u09cd\u09a4\u0995\u09b0\u09a3, \u09a8\u09bf\u09b0\u09be\u09aa\u09a4\u09cd\u09a4\u09be \u09ab\u09bf\u09b2\u09cd\u099f\u09be\u09b0\u09bf\u0982\n- **\u0987\u0989\u099f\u09bf\u09b2\u09bf\u099f\u09bf**: \u09a1\u09c7\u099f\u09be\u09b8\u09c7\u099f \u099f\u09c1\u09b2, \u09ae\u09c2\u09b2\u09cd\u09af\u09be\u09af\u09bc\u09a8 \u09ae\u09c7\u099f\u09cd\u09b0\u09bf\u0995\u09cd\u09b8, \u09aa\u09cd\u09b0\u09bf\u09aa\u09cd\u09b0\u09b8\u09c7\u09b8\u09bf\u0982 \u09aa\u09be\u0987\u09aa\u09b2\u09be\u0987\u09a8\n\n### \u09a6\u09cd\u09b0\u09c1\u09a4 \u09b6\u09c1\u09b0\u09c1\n\n#### \u0987\u09a8\u09b8\u09cd\u099f\u09b2\u09c7\u09b6\u09a8\n\n```bash\npip install bilingual\n```\n\n\u09a1\u09c7\u09ad\u09c7\u09b2\u09aa\u09ae\u09c7\u09a8\u09cd\u099f\u09c7\u09b0 \u099c\u09a8\u09cd\u09af:\n\n```bash\ngit clone https://github.com/YOUR_ORG/bilingual.git\ncd bilingual\npip install -e \".[dev]\"\n```\n\n#### \u09ae\u09cc\u09b2\u09bf\u0995 \u09ac\u09cd\u09af\u09ac\u09b9\u09be\u09b0\n\n```python\nfrom bilingual import bilingual_api as bb\n\n# \u099f\u09cb\u0995\u09c7\u09a8\u09be\u0987\u099c\u09be\u09b0 \u09b2\u09cb\u09a1 \u0995\u09b0\u09c1\u09a8\ntokenizer = bb.load_tokenizer(\"bilingual-tokenizer\")\n\n# \u099f\u09c7\u0995\u09cd\u09b8\u099f \u09a8\u09b0\u09ae\u09be\u09b2\u09be\u0987\u099c \u0995\u09b0\u09c1\u09a8\ntext_bn = bb.normalize_text(\"\u0986\u09ae\u09bf \u09b8\u09cd\u0995\u09c1\u09b2\u09c7 \u09af\u09be\u099a\u09cd\u099b\u09bf\u0964\", lang=\"bn\")\ntext_en = bb.normalize_text(\"I am going to school.\", lang=\"en\")\n\n# \u099f\u09c7\u0995\u09cd\u09b8\u099f \u099c\u09c7\u09a8\u09be\u09b0\u09c7\u099f \u0995\u09b0\u09c1\u09a8\nprompt = \"A short story about a brave rabbit / \u09b8\u09be\u09b9\u09b8\u09c0 \u0996\u09b0\u0997\u09cb\u09b6\u09c7\u09b0 \u098f\u0995\u099f\u09bf \u099b\u09cb\u099f \u0997\u09b2\u09cd\u09aa\"\nstory = bb.generate(prompt, model_name=\"bilingual-small-lm\", max_tokens=150)\n\n# \u0985\u09a8\u09c1\u09ac\u09be\u09a6 \u0995\u09b0\u09c1\u09a8\ntranslation = bb.translate(\"\u0986\u09ae\u09bf \u09ac\u0987 \u09aa\u09a1\u09bc\u09a4\u09c7 \u09ad\u09be\u09b2\u09cb\u09ac\u09be\u09b8\u09bf\u0964\", src=\"bn\", tgt=\"en\")\nprint(translation)  # \"I love to read books.\"\n\n# \u09aa\u09a0\u09a8\u09af\u09cb\u0997\u09cd\u09af\u09a4\u09be \u099a\u09c7\u0995 \u0995\u09b0\u09c1\u09a8\nlevel = bb.readability_check(text_bn, lang=\"bn\")\nprint(f\"Reading level: {level}\")\n```\n\n#### CLI \u09ac\u09cd\u09af\u09ac\u09b9\u09be\u09b0\n\n```bash\n# \u099f\u09c7\u0995\u09cd\u09b8\u099f \u099f\u09cb\u0995\u09c7\u09a8\u09be\u0987\u099c \u0995\u09b0\u09c1\u09a8\nbilingual tokenize --lang bn --text \"\u0986\u09ae\u09bf \u09ad\u09be\u09a4 \u0996\u09be\u0987\u0964\"\n\n# \u099f\u09c7\u0995\u09cd\u09b8\u099f \u099c\u09c7\u09a8\u09be\u09b0\u09c7\u099f \u0995\u09b0\u09c1\u09a8\nbilingual generate --model bilingual-small-lm --prompt \"Once upon a time...\" --max-tokens 100\n\n# \u0985\u09a8\u09c1\u09ac\u09be\u09a6 \u0995\u09b0\u09c1\u09a8\nbilingual translate --src bn --tgt en --text \"\u0986\u09ae\u09bf \u09a4\u09cb\u09ae\u09be\u0995\u09c7 \u09ad\u09be\u09b2\u09cb\u09ac\u09be\u09b8\u09bf\u0964\"\n\n# \u09ae\u09a1\u09c7\u09b2 \u09ae\u09c2\u09b2\u09cd\u09af\u09be\u09af\u09bc\u09a8 \u0995\u09b0\u09c1\u09a8\nbilingual evaluate --dataset data/test.jsonl --model bilingual-small-lm\n```\n\n### \u09a1\u0995\u09c1\u09ae\u09c7\u09a8\u09cd\u099f\u09c7\u09b6\u09a8\n\n- \ud83d\udcd6 [\u09b8\u09ae\u09cd\u09aa\u09c2\u09b0\u09cd\u09a3 \u09a1\u0995\u09c1\u09ae\u09c7\u09a8\u09cd\u099f\u09c7\u09b6\u09a8](docs/bn/README.md)\n- \ud83d\ude80 [\u09a6\u09cd\u09b0\u09c1\u09a4 \u09b6\u09c1\u09b0\u09c1 \u0997\u09be\u0987\u09a1](docs/bn/quickstart.md)\n- \ud83d\udd27 [API \u09b0\u09c7\u09ab\u09be\u09b0\u09c7\u09a8\u09cd\u09b8](docs/bn/api.md)\n- \ud83e\udd1d [\u0985\u09ac\u09a6\u09be\u09a8 \u0997\u09be\u0987\u09a1](CONTRIBUTING.md)\n- \ud83d\uddfa\ufe0f [\u09b0\u09cb\u09a1\u09ae\u09cd\u09af\u09be\u09aa](ROADMAP.md)\n\n### \u0985\u09ac\u09a6\u09be\u09a8 \u09b0\u09be\u0996\u09be\n\n\u0986\u09ae\u09b0\u09be \u0985\u09ac\u09a6\u09be\u09a8 \u09b8\u09cd\u09ac\u09be\u0997\u09a4 \u099c\u09be\u09a8\u09be\u0987! \u09ac\u09bf\u09b8\u09cd\u09a4\u09be\u09b0\u09bf\u09a4 \u099c\u09be\u09a8\u09be\u09b0 \u099c\u09a8\u09cd\u09af \u0985\u09a8\u09c1\u0997\u09cd\u09b0\u09b9 \u0995\u09b0\u09c7 \u0986\u09ae\u09be\u09a6\u09c7\u09b0 [\u0985\u09ac\u09a6\u09be\u09a8 \u0997\u09be\u0987\u09a1](CONTRIBUTING.md) \u09a6\u09c7\u0996\u09c1\u09a8\u0964\n\n\u09af\u09c7\u09b8\u09ac \u0995\u09cd\u09b7\u09c7\u09a4\u09cd\u09b0\u09c7 \u0986\u09ae\u09be\u09a6\u09c7\u09b0 \u09b8\u09be\u09b9\u09be\u09af\u09cd\u09af \u09aa\u09cd\u09b0\u09af\u09bc\u09cb\u099c\u09a8:\n- \ud83d\udcca \u09a1\u09c7\u099f\u09be\u09b8\u09c7\u099f \u09b8\u0982\u0997\u09cd\u09b0\u09b9 \u098f\u09ac\u0982 \u0995\u09bf\u0989\u09b0\u09c7\u09b6\u09a8\n- \ud83e\udd16 \u09ae\u09a1\u09c7\u09b2 \u099f\u09cd\u09b0\u09c7\u09a8\u09bf\u0982 \u098f\u09ac\u0982 \u09ab\u09be\u0987\u09a8-\u099f\u09bf\u0989\u09a8\u09bf\u0982\n- \ud83d\udcdd \u09a1\u0995\u09c1\u09ae\u09c7\u09a8\u09cd\u099f\u09c7\u09b6\u09a8 \u098f\u09ac\u0982 \u0985\u09a8\u09c1\u09ac\u09be\u09a6\n- \ud83e\uddea \u099f\u09c7\u09b8\u09cd\u099f\u09bf\u0982 \u098f\u09ac\u0982 \u0995\u09cb\u09af\u09bc\u09be\u09b2\u09bf\u099f\u09bf \u0985\u09cd\u09af\u09be\u09b8\u09bf\u0989\u09b0\u09c7\u09a8\u09cd\u09b8\n- \ud83d\udc1b \u09ac\u09be\u0997 \u09ab\u09bf\u0995\u09cd\u09b8 \u098f\u09ac\u0982 \u0989\u09a8\u09cd\u09a8\u09a4\u09bf\n\n### \u09b2\u09be\u0987\u09b8\u09c7\u09a8\u09cd\u09b8\n\n\u098f\u0987 \u09aa\u09cd\u09b0\u0995\u09b2\u09cd\u09aa\u099f\u09bf Apache License 2.0 \u098f\u09b0 \u0985\u09a7\u09c0\u09a8\u09c7 \u09b2\u09be\u0987\u09b8\u09c7\u09a8\u09cd\u09b8\u09aa\u09cd\u09b0\u09be\u09aa\u09cd\u09a4 - \u09ac\u09bf\u09b8\u09cd\u09a4\u09be\u09b0\u09bf\u09a4 \u099c\u09be\u09a8\u09be\u09b0 \u099c\u09a8\u09cd\u09af [LICENSE](LICENSE) \u09ab\u09be\u0987\u09b2 \u09a6\u09c7\u0996\u09c1\u09a8\u0964\n\n### \u09b8\u09cd\u09ac\u09c0\u0995\u09c3\u09a4\u09bf\n\n\u098f\u0987 \u09aa\u09cd\u09b0\u0995\u09b2\u09cd\u09aa\u099f\u09bf \u0993\u09aa\u09c7\u09a8-\u09b8\u09cb\u09b0\u09cd\u09b8 \u0995\u09ae\u09bf\u0989\u09a8\u09bf\u099f\u09bf\u09b0 \u09b8\u09ae\u09b0\u09cd\u09a5\u09a8\u09c7 \u09a4\u09c8\u09b0\u09bf \u098f\u09ac\u0982 \u09b8\u09ac\u09be\u09b0 \u099c\u09a8\u09cd\u09af \u09ac\u09be\u0982\u09b2\u09be \u09ad\u09be\u09b7\u09be \u09aa\u09cd\u09b0\u09af\u09c1\u0995\u09cd\u09a4\u09bf \u098f\u0997\u09bf\u09af\u09bc\u09c7 \u09a8\u09bf\u09af\u09bc\u09c7 \u09af\u09be\u0993\u09af\u09bc\u09be\u09b0 \u09b2\u0995\u09cd\u09b7\u09cd\u09af\u09c7 \u0995\u09be\u099c \u0995\u09b0\u09c7\u0964\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "High-quality Bangla and English NLP toolkit for production use",
    "version": "1.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/kothagpt/bilingual/issues",
        "Changelog": "https://github.com/kothagpt/bilingual/releases",
        "Documentation": "https://bilingual.readthedocs.io",
        "Homepage": "https://github.com/kothagpt/bilingual",
        "Issues": "https://github.com/kothagpt/bilingual/issues",
        "Repository": "https://github.com/kothagpt/bilingual",
        "Source Code": "https://github.com/kothagpt/bilingual"
    },
    "split_keywords": [
        "nlp",
        " bangla",
        " bengali",
        " bilingual",
        " multilingual",
        " tokenization",
        " translation",
        " language-model",
        " ai",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "23ae21cc772e2d66252ca829bce0e2f1ed7c3f0ed8e88fed700b55dd55bcb162",
                "md5": "82af4582263c9813bc3ed9fb248a47e3",
                "sha256": "8136ed21120f5b0b781984efa157c3bac860f538ac47978d3ec5b363551aa209"
            },
            "downloads": -1,
            "filename": "bilingual-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "82af4582263c9813bc3ed9fb248a47e3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 121373,
            "upload_time": "2025-11-15T17:12:12",
            "upload_time_iso_8601": "2025-11-15T17:12:12.436707Z",
            "url": "https://files.pythonhosted.org/packages/23/ae/21cc772e2d66252ca829bce0e2f1ed7c3f0ed8e88fed700b55dd55bcb162/bilingual-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-15 17:12:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kothagpt",
    "github_project": "bilingual",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "sentencepiece",
            "specs": [
                [
                    ">=",
                    "0.1.96"
                ]
            ]
        },
        {
            "name": "regex",
            "specs": [
                [
                    ">=",
                    "2021.0.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.62.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.26.0"
                ]
            ]
        },
        {
            "name": "fastapi",
            "specs": [
                [
                    ">=",
                    "0.100.0"
                ]
            ]
        },
        {
            "name": "uvicorn",
            "specs": [
                [
                    ">=",
                    "0.23.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pydantic-settings",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.44.0"
                ]
            ]
        },
        {
            "name": "accelerate",
            "specs": [
                [
                    ">=",
                    "0.20.0"
                ]
            ]
        },
        {
            "name": "peft",
            "specs": [
                [
                    ">=",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "bitsandbytes",
            "specs": [
                [
                    ">=",
                    "0.41.0"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "tensorboard",
            "specs": [
                [
                    ">=",
                    "2.13.0"
                ]
            ]
        },
        {
            "name": "onnx",
            "specs": [
                [
                    ">=",
                    "1.14.0"
                ]
            ]
        },
        {
            "name": "onnxruntime",
            "specs": [
                [
                    ">=",
                    "1.15.0"
                ]
            ]
        },
        {
            "name": "optimum",
            "specs": [
                [
                    ">=",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "huggingface_hub",
            "specs": [
                [
                    ">=",
                    "0.25.0"
                ]
            ]
        },
        {
            "name": "gradio",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.11.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "fake-useragent",
            "specs": [
                [
                    ">=",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "wikiextractor",
            "specs": [
                [
                    ">=",
                    "3.0.6"
                ]
            ]
        },
        {
            "name": "indic-nlp-library",
            "specs": [
                [
                    ">=",
                    "0.92"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.8.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.7.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "typer",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.0.0"
                ]
            ]
        },
        {
            "name": "prometheus-client",
            "specs": [
                [
                    ">=",
                    "0.17.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "22.0.0"
                ]
            ]
        },
        {
            "name": "isort",
            "specs": [
                [
                    ">=",
                    "5.10.0"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    ">=",
                    "0.950"
                ]
            ]
        }
    ],
    "lcname": "bilingual"
}
        
Elapsed time: 3.52009s