romanRekhta

Name	romanRekhta JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/is-ammar/romanRekhta
Summary	An NLP library for Roman Urdu text preprocessing, tokenization, and stopword handling.
upload_time	2025-07-25 11:42:17
maintainer	None
docs_url	None
author	Muhammad Ammar
requires_python	>=3.7
license	MIT
keywords	roman urdu nlp stopwords tokenization emoji preprocessing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 🇵🇰 romanRekhta

**romanRekhta** is a lightweight, flexible, and extensible Python library for **Roman Urdu Natural Language Processing (NLP)**. It provides essential tools for text preprocessing, tokenization, and stopword removal tailored to Roman Urdu, a widely used informal script across Pakistan and South Asia.

---

## Features

- Modular preprocessing (lowercasing, punctuation removal, emoji handling)
- Flexible stopword filtering (from `.txt` file or custom list)
- Tokenization (word and sentence level)
- Easily extendable and community-contributable

---

## Installation

```bash
pip install -e .
# When published
pip install romanRekhta
```

---

## Quick Start

```python
from romanRekhta.preprocessing import Preprocessor

text = "Apka service boht achi hai! 😍🔥"

# Remove emojis and punctuation
pre = Preprocessor(lowercase=True, emoji_handling="remove")
cleaned = pre.process(text)
print(cleaned)  # Output: apka service boht achi hai
```

### Emoji Options:

- `"remove"` → Deletes emojis  
- `"replace"` → Converts 😊 → smiling_face  
- `"ignore"` → Leaves emojis untouched

---

## Stopword Removal

```python
from romanRekhta.stopwords import StopwordHandler

# Load from file and extend with custom stopwords
stop_handler = StopwordHandler(
    filepath="stopwords.txt",
    custom_stopwords={"mera", "nahi"}
)

tokens = ["ye", "mera", "kaam", "nahi", "acha", "hai"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered)  # Output: ['ye', 'kaam', 'acha']

# Load only from file
stop_handler = StopwordHandler(filepath="stopwords.txt")
tokens = ["ye", "bohat", "acha", "kaam", "hai"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered)  # Output: ['ye', 'bohat', 'acha', 'kaam']

# Custom stopword list only
custom_words = {"mera", "tumhara"}
stop_handler = StopwordHandler(custom_stopwords=custom_words)
tokens = ["ye", "mera", "bohat", "acha", "kaam", "nahi"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered)  # Output: ['ye', 'bohat', 'acha', 'kaam']

# Combine file + custom stopwords
stop_handler = StopwordHandler(
    filepath="stopwords.txt",
    custom_stopwords={"nahi", "kya"}
)
tokens = ["ye", "nahi", "kya", "acha", "kaam"]
filtered = stop_handler.remove_stopwords(tokens)
print(filtered)  # Output: ['ye', 'acha', 'kaam']
```

---

## Tokenization

```python
from romanRekhta.tokenizer import word_tokenize, sentence_tokenize

text = "Yeh idea bohat acha hai. Shukriya!"

# Word tokenization
tokens = word_tokenize(text)
print(tokens)  # ['Yeh', 'idea', 'bohat', 'acha', 'hai', 'Shukriya']

# Sentence tokenization
sentences = sentence_tokenize(text)
print(sentences)  # ['Yeh idea bohat acha hai', ' Shukriya']

# Advanced tokenization method
tokens = word_tokenize(text, method="regex")
```

---

## Configurable Preprocessor Options

| Option                   | Type | Default | Description                                  |
|--------------------------|------|---------|----------------------------------------------|
| `lowercase`              | bool | True    | Convert to lowercase                         |
| `punctuation`            | bool | True    | Remove punctuation                           |
| `emoji_handling`         | str  | remove  | Options: `'remove'`, `'replace'`, `'ignore'` |
| `normalize_space`        | bool | True    | Remove extra whitespaces                     |
| `remove_non_ascii_chars` | bool | False   | Remove emojis and symbols                    |

---

## File-Based Stopwords

Place a `stopwords.txt` file in your project or repository.  
Each line should contain **one Roman Urdu stopword**.

---

## Contributing

We welcome contributors to improve this library! Here’s how you can help:

-  Add more Roman Urdu stopwords to `stopwords.txt`
-  Suggest or implement new features (normalization, spell checker, sentiment analysis)
-  Report bugs or edge cases

### To contribute:

1. Fork the repo  
2. Create a new branch  
3. Commit your changes  
4. Submit a Pull Request

---

Made with ❤️ in Pakistan 🇵🇰 for the Roman Urdu NLP community.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/is-ammar/romanRekhta",
    "name": "romanRekhta",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "roman urdu nlp stopwords tokenization emoji preprocessing",
    "author": "Muhammad Ammar",
    "author_email": "ammarshafique677@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3a/ac/5a68e7aa0ccb715b2c41f3909db5c8106a12f53f5719ba8a273af6d801f4/romanrekhta-0.1.0.tar.gz",
    "platform": null,
    "description": "# \ud83c\uddf5\ud83c\uddf0 romanRekhta\r\n\r\n**romanRekhta** is a lightweight, flexible, and extensible Python library for **Roman Urdu Natural Language Processing (NLP)**. It provides essential tools for text preprocessing, tokenization, and stopword removal tailored to Roman Urdu, a widely used informal script across Pakistan and South Asia.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- Modular preprocessing (lowercasing, punctuation removal, emoji handling)\r\n- Flexible stopword filtering (from `.txt` file or custom list)\r\n- Tokenization (word and sentence level)\r\n- Easily extendable and community-contributable\r\n\r\n---\r\n\r\n## Installation\r\n\r\n```bash\r\npip install -e .\r\n# When published\r\npip install romanRekhta\r\n```\r\n\r\n---\r\n\r\n## Quick Start\r\n\r\n```python\r\nfrom romanRekhta.preprocessing import Preprocessor\r\n\r\ntext = \"Apka service boht achi hai! \ud83d\ude0d\ud83d\udd25\"\r\n\r\n# Remove emojis and punctuation\r\npre = Preprocessor(lowercase=True, emoji_handling=\"remove\")\r\ncleaned = pre.process(text)\r\nprint(cleaned)  # Output: apka service boht achi hai\r\n```\r\n\r\n### Emoji Options:\r\n\r\n- `\"remove\"` \u2192 Deletes emojis  \r\n- `\"replace\"` \u2192 Converts \ud83d\ude0a \u2192 smiling_face  \r\n- `\"ignore\"` \u2192 Leaves emojis untouched\r\n\r\n---\r\n\r\n## Stopword Removal\r\n\r\n```python\r\nfrom romanRekhta.stopwords import StopwordHandler\r\n\r\n# Load from file and extend with custom stopwords\r\nstop_handler = StopwordHandler(\r\n    filepath=\"stopwords.txt\",\r\n    custom_stopwords={\"mera\", \"nahi\"}\r\n)\r\n\r\ntokens = [\"ye\", \"mera\", \"kaam\", \"nahi\", \"acha\", \"hai\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered)  # Output: ['ye', 'kaam', 'acha']\r\n\r\n# Load only from file\r\nstop_handler = StopwordHandler(filepath=\"stopwords.txt\")\r\ntokens = [\"ye\", \"bohat\", \"acha\", \"kaam\", \"hai\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered)  # Output: ['ye', 'bohat', 'acha', 'kaam']\r\n\r\n# Custom stopword list only\r\ncustom_words = {\"mera\", \"tumhara\"}\r\nstop_handler = StopwordHandler(custom_stopwords=custom_words)\r\ntokens = [\"ye\", \"mera\", \"bohat\", \"acha\", \"kaam\", \"nahi\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered)  # Output: ['ye', 'bohat', 'acha', 'kaam']\r\n\r\n# Combine file + custom stopwords\r\nstop_handler = StopwordHandler(\r\n    filepath=\"stopwords.txt\",\r\n    custom_stopwords={\"nahi\", \"kya\"}\r\n)\r\ntokens = [\"ye\", \"nahi\", \"kya\", \"acha\", \"kaam\"]\r\nfiltered = stop_handler.remove_stopwords(tokens)\r\nprint(filtered)  # Output: ['ye', 'acha', 'kaam']\r\n```\r\n\r\n---\r\n\r\n## Tokenization\r\n\r\n```python\r\nfrom romanRekhta.tokenizer import word_tokenize, sentence_tokenize\r\n\r\ntext = \"Yeh idea bohat acha hai. Shukriya!\"\r\n\r\n# Word tokenization\r\ntokens = word_tokenize(text)\r\nprint(tokens)  # ['Yeh', 'idea', 'bohat', 'acha', 'hai', 'Shukriya']\r\n\r\n# Sentence tokenization\r\nsentences = sentence_tokenize(text)\r\nprint(sentences)  # ['Yeh idea bohat acha hai', ' Shukriya']\r\n\r\n# Advanced tokenization method\r\ntokens = word_tokenize(text, method=\"regex\")\r\n```\r\n\r\n---\r\n\r\n## Configurable Preprocessor Options\r\n\r\n| Option                   | Type | Default | Description                                  |\r\n|--------------------------|------|---------|----------------------------------------------|\r\n| `lowercase`              | bool | True    | Convert to lowercase                         |\r\n| `punctuation`            | bool | True    | Remove punctuation                           |\r\n| `emoji_handling`         | str  | remove  | Options: `'remove'`, `'replace'`, `'ignore'` |\r\n| `normalize_space`        | bool | True    | Remove extra whitespaces                     |\r\n| `remove_non_ascii_chars` | bool | False   | Remove emojis and symbols                    |\r\n\r\n---\r\n\r\n## File-Based Stopwords\r\n\r\nPlace a `stopwords.txt` file in your project or repository.  \r\nEach line should contain **one Roman Urdu stopword**.\r\n\r\n---\r\n\r\n## Contributing\r\n\r\nWe welcome contributors to improve this library! Here\u2019s how you can help:\r\n\r\n-  Add more Roman Urdu stopwords to `stopwords.txt`\r\n-  Suggest or implement new features (normalization, spell checker, sentiment analysis)\r\n-  Report bugs or edge cases\r\n\r\n### To contribute:\r\n\r\n1. Fork the repo  \r\n2. Create a new branch  \r\n3. Commit your changes  \r\n4. Submit a Pull Request\r\n\r\n---\r\n\r\nMade with \u2764\ufe0f in Pakistan \ud83c\uddf5\ud83c\uddf0 for the Roman Urdu NLP community.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An NLP library for Roman Urdu text preprocessing, tokenization, and stopword handling.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/is-ammar/romanRekhta",
        "Source": "https://github.com/is-ammar/romanRekhta",
        "Tracker": "https://github.com/is-ammar/romanRekhta/issues"
    },
    "split_keywords": [
        "roman",
        "urdu",
        "nlp",
        "stopwords",
        "tokenization",
        "emoji",
        "preprocessing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2f29418828a771d0f793bd261e79e380f621e87c7ec013980e467015de1f4af1",
                "md5": "87b1e61e85c634cf90175281860c8bf5",
                "sha256": "51cf7880155d83f8330b7cc6743d0f5e050a7b0b511c3b9097b7514c1c1402be"
            },
            "downloads": -1,
            "filename": "romanrekhta-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "87b1e61e85c634cf90175281860c8bf5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 5444,
            "upload_time": "2025-07-25T11:42:15",
            "upload_time_iso_8601": "2025-07-25T11:42:15.746578Z",
            "url": "https://files.pythonhosted.org/packages/2f/29/418828a771d0f793bd261e79e380f621e87c7ec013980e467015de1f4af1/romanrekhta-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3aac5a68e7aa0ccb715b2c41f3909db5c8106a12f53f5719ba8a273af6d801f4",
                "md5": "8b46cc37b533e240ebfa68799ab7ea78",
                "sha256": "f243b033bea40c21605b3247e8b3bd6a92deba97a1b33998bd5b40921d895d52"
            },
            "downloads": -1,
            "filename": "romanrekhta-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8b46cc37b533e240ebfa68799ab7ea78",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 4931,
            "upload_time": "2025-07-25T11:42:17",
            "upload_time_iso_8601": "2025-07-25T11:42:17.327391Z",
            "url": "https://files.pythonhosted.org/packages/3a/ac/5a68e7aa0ccb715b2c41f3909db5c8106a12f53f5719ba8a273af6d801f4/romanrekhta-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 11:42:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "is-ammar",
    "github_project": "romanRekhta",
    "github_not_found": true,
    "lcname": "romanrekhta"
}

Muhammad Ammar