# π΅π° romanRekhta
**romanRekhta** is a lightweight, flexible, and extensible Python library for **Roman Urdu Natural Language Processing (NLP)**. It provides essential tools for text preprocessing, tokenization, and stopword removal tailored to Roman Urdu, a widely used informal script across Pakistan and South Asia.
---
## Features
- Modular preprocessing (lowercasing, punctuation removal, emoji handling)
- Flexible stopword filtering (from `.txt` file or custom list)
- Tokenization (word and sentence level)
- Easily extendable and community-contributable
---
## Installation
```bash
pip install -e .
# When published
pip install romanRekhta
```
---
## Quick Start
```python
from romanRekhta.preprocessing import Preprocessor
text = "Apka service boht achi hai! ππ₯"
# Remove emojis and punctuation
pre = Preprocessor(lowercase=True, emoji_handling="remove")
cleaned = pre.process(text)
print(cleaned) # Output: apka service boht achi hai
```
### Emoji Options:
- `"remove"` β Deletes emojis
- `"replace"` β Converts π β smiling_face
- `"ignore"` β Leaves emojis untouched
---
## Stopword Removal
```python
from romanRekhta.stopwords import StopwordHandler
# Load from file and extend with custom stopwords
stop_handler = StopwordHandler(
filepath="stopwords.txt",
custom_stopwords={"mera", "nahi"}
)
tokens = ["ye", "mera", "kaam", "nahi", "acha", "hai"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered) # Output: ['ye', 'kaam', 'acha']
# Load only from file
stop_handler = StopwordHandler(filepath="stopwords.txt")
tokens = ["ye", "bohat", "acha", "kaam", "hai"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered) # Output: ['ye', 'bohat', 'acha', 'kaam']
# Custom stopword list only
custom_words = {"mera", "tumhara"}
stop_handler = StopwordHandler(custom_stopwords=custom_words)
tokens = ["ye", "mera", "bohat", "acha", "kaam", "nahi"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered) # Output: ['ye', 'bohat', 'acha', 'kaam']
# Combine file + custom stopwords
stop_handler = StopwordHandler(
filepath="stopwords.txt",
custom_stopwords={"nahi", "kya"}
)
tokens = ["ye", "nahi", "kya", "acha", "kaam"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered) # Output: ['ye', 'acha', 'kaam']
```
---
## Tokenization
```python
from romanRekhta.tokenizer import word_tokenize, sentence_tokenize
text = "Yeh idea bohat acha hai. Shukriya!"
# Word tokenization
tokens = word_tokenize(text)
print(tokens) # ['Yeh', 'idea', 'bohat', 'acha', 'hai', 'Shukriya']
# Sentence tokenization
sentences = sentence_tokenize(text)
print(sentences) # ['Yeh idea bohat acha hai', ' Shukriya']
# Advanced tokenization method
tokens = word_tokenize(text, method="regex")
```
---
## Configurable Preprocessor Options
| Option | Type | Default | Description |
|--------------------------|------|---------|----------------------------------------------|
| `lowercase` | bool | True | Convert to lowercase |
| `punctuation` | bool | True | Remove punctuation |
| `emoji_handling` | str | remove | Options: `'remove'`, `'replace'`, `'ignore'` |
| `normalize_space` | bool | True | Remove extra whitespaces |
| `remove_non_ascii_chars` | bool | False | Remove emojis and symbols |
---
## File-Based Stopwords
Place a `stopwords.txt` file in your project or repository.
Each line should contain **one Roman Urdu stopword**.
---
## Contributing
We welcome contributors to improve this library! Hereβs how you can help:
- Add more Roman Urdu stopwords to `stopwords.txt`
- Suggest or implement new features (normalization, spell checker, sentiment analysis)
- Report bugs or edge cases
### To contribute:
1. Fork the repo
2. Create a new branch
3. Commit your changes
4. Submit a Pull Request
---
Made with β€οΈ in Pakistan π΅π° for the Roman Urdu NLP community.
Raw data
{
"_id": null,
"home_page": "https://github.com/is-ammar/romanRekhta",
"name": "romanRekhta",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "roman urdu nlp stopwords tokenization emoji preprocessing",
"author": "Muhammad Ammar",
"author_email": "ammarshafique677@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/3a/ac/5a68e7aa0ccb715b2c41f3909db5c8106a12f53f5719ba8a273af6d801f4/romanrekhta-0.1.0.tar.gz",
"platform": null,
"description": "# \ud83c\uddf5\ud83c\uddf0 romanRekhta\r\n\r\n**romanRekhta** is a lightweight, flexible, and extensible Python library for **Roman Urdu Natural Language Processing (NLP)**. It provides essential tools for text preprocessing, tokenization, and stopword removal tailored to Roman Urdu, a widely used informal script across Pakistan and South Asia.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- Modular preprocessing (lowercasing, punctuation removal, emoji handling)\r\n- Flexible stopword filtering (from `.txt` file or custom list)\r\n- Tokenization (word and sentence level)\r\n- Easily extendable and community-contributable\r\n\r\n---\r\n\r\n## Installation\r\n\r\n```bash\r\npip install -e .\r\n# When published\r\npip install romanRekhta\r\n```\r\n\r\n---\r\n\r\n## Quick Start\r\n\r\n```python\r\nfrom romanRekhta.preprocessing import Preprocessor\r\n\r\ntext = \"Apka service boht achi hai! \ud83d\ude0d\ud83d\udd25\"\r\n\r\n# Remove emojis and punctuation\r\npre = Preprocessor(lowercase=True, emoji_handling=\"remove\")\r\ncleaned = pre.process(text)\r\nprint(cleaned) # Output: apka service boht achi hai\r\n```\r\n\r\n### Emoji Options:\r\n\r\n- `\"remove\"` \u2192 Deletes emojis \r\n- `\"replace\"` \u2192 Converts \ud83d\ude0a \u2192 smiling_face \r\n- `\"ignore\"` \u2192 Leaves emojis untouched\r\n\r\n---\r\n\r\n## Stopword Removal\r\n\r\n```python\r\nfrom romanRekhta.stopwords import StopwordHandler\r\n\r\n# Load from file and extend with custom stopwords\r\nstop_handler = StopwordHandler(\r\n filepath=\"stopwords.txt\",\r\n custom_stopwords={\"mera\", \"nahi\"}\r\n)\r\n\r\ntokens = [\"ye\", \"mera\", \"kaam\", \"nahi\", \"acha\", \"hai\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered) # Output: ['ye', 'kaam', 'acha']\r\n\r\n# Load only from file\r\nstop_handler = StopwordHandler(filepath=\"stopwords.txt\")\r\ntokens = [\"ye\", \"bohat\", \"acha\", \"kaam\", \"hai\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered) # Output: ['ye', 'bohat', 'acha', 'kaam']\r\n\r\n# Custom stopword list only\r\ncustom_words = {\"mera\", \"tumhara\"}\r\nstop_handler = StopwordHandler(custom_stopwords=custom_words)\r\ntokens = [\"ye\", \"mera\", \"bohat\", \"acha\", \"kaam\", \"nahi\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered) # Output: ['ye', 'bohat', 'acha', 'kaam']\r\n\r\n# Combine file + custom stopwords\r\nstop_handler = StopwordHandler(\r\n filepath=\"stopwords.txt\",\r\n custom_stopwords={\"nahi\", \"kya\"}\r\n)\r\ntokens = [\"ye\", \"nahi\", \"kya\", \"acha\", \"kaam\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered) # Output: ['ye', 'acha', 'kaam']\r\n```\r\n\r\n---\r\n\r\n## Tokenization\r\n\r\n```python\r\nfrom romanRekhta.tokenizer import word_tokenize, sentence_tokenize\r\n\r\ntext = \"Yeh idea bohat acha hai. Shukriya!\"\r\n\r\n# Word tokenization\r\ntokens = word_tokenize(text)\r\nprint(tokens) # ['Yeh', 'idea', 'bohat', 'acha', 'hai', 'Shukriya']\r\n\r\n# Sentence tokenization\r\nsentences = sentence_tokenize(text)\r\nprint(sentences) # ['Yeh idea bohat acha hai', ' Shukriya']\r\n\r\n# Advanced tokenization method\r\ntokens = word_tokenize(text, method=\"regex\")\r\n```\r\n\r\n---\r\n\r\n## Configurable Preprocessor Options\r\n\r\n| Option | Type | Default | Description |\r\n|--------------------------|------|---------|----------------------------------------------|\r\n| `lowercase` | bool | True | Convert to lowercase |\r\n| `punctuation` | bool | True | Remove punctuation |\r\n| `emoji_handling` | str | remove | Options: `'remove'`, `'replace'`, `'ignore'` |\r\n| `normalize_space` | bool | True | Remove extra whitespaces |\r\n| `remove_non_ascii_chars` | bool | False | Remove emojis and symbols |\r\n\r\n---\r\n\r\n## File-Based Stopwords\r\n\r\nPlace a `stopwords.txt` file in your project or repository. \r\nEach line should contain **one Roman Urdu stopword**.\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nWe welcome contributors to improve this library! Here\u2019s how you can help:\r\n\r\n- Add more Roman Urdu stopwords to `stopwords.txt`\r\n- Suggest or implement new features (normalization, spell checker, sentiment analysis)\r\n- Report bugs or edge cases\r\n\r\n### To contribute:\r\n\r\n1. Fork the repo \r\n2. Create a new branch \r\n3. Commit your changes \r\n4. Submit a Pull Request\r\n\r\n---\r\n\r\nMade with \u2764\ufe0f in Pakistan \ud83c\uddf5\ud83c\uddf0 for the Roman Urdu NLP community.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An NLP library for Roman Urdu text preprocessing, tokenization, and stopword handling.",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/is-ammar/romanRekhta",
"Source": "https://github.com/is-ammar/romanRekhta",
"Tracker": "https://github.com/is-ammar/romanRekhta/issues"
},
"split_keywords": [
"roman",
"urdu",
"nlp",
"stopwords",
"tokenization",
"emoji",
"preprocessing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2f29418828a771d0f793bd261e79e380f621e87c7ec013980e467015de1f4af1",
"md5": "87b1e61e85c634cf90175281860c8bf5",
"sha256": "51cf7880155d83f8330b7cc6743d0f5e050a7b0b511c3b9097b7514c1c1402be"
},
"downloads": -1,
"filename": "romanrekhta-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "87b1e61e85c634cf90175281860c8bf5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 5444,
"upload_time": "2025-07-25T11:42:15",
"upload_time_iso_8601": "2025-07-25T11:42:15.746578Z",
"url": "https://files.pythonhosted.org/packages/2f/29/418828a771d0f793bd261e79e380f621e87c7ec013980e467015de1f4af1/romanrekhta-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3aac5a68e7aa0ccb715b2c41f3909db5c8106a12f53f5719ba8a273af6d801f4",
"md5": "8b46cc37b533e240ebfa68799ab7ea78",
"sha256": "f243b033bea40c21605b3247e8b3bd6a92deba97a1b33998bd5b40921d895d52"
},
"downloads": -1,
"filename": "romanrekhta-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "8b46cc37b533e240ebfa68799ab7ea78",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 4931,
"upload_time": "2025-07-25T11:42:17",
"upload_time_iso_8601": "2025-07-25T11:42:17.327391Z",
"url": "https://files.pythonhosted.org/packages/3a/ac/5a68e7aa0ccb715b2c41f3909db5c8106a12f53f5719ba8a273af6d801f4/romanrekhta-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-25 11:42:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "is-ammar",
"github_project": "romanRekhta",
"github_not_found": true,
"lcname": "romanrekhta"
}