punjabi-stemmer


Namepunjabi-stemmer JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/gurpejsingh13/Punjabi_Stemmer.git
SummaryA Python library for stemming Punjabi language words, including preprocessing for noise removal.
upload_time2024-03-13 13:35:30
maintainer
docs_urlNone
authorGurpej Singh
requires_python>=3.7
licenseMIT
keywords stemmer punjabi nlp punjabi language gurmukhi language natural language processing text processing noise removal stemming brute force algorithm suffix striping under-stemming over-stemming stemming algorithm
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PunjabiStemmer
## Introduction
The PunjabiStemmer introduces a groundbreaking advancement in Natural Language Processing (NLP) for Punjabi, a major regional language in India. It combines rule-based and dictionary-based methodologies to address the morphological complexity of Punjabi, setting a new benchmark in linguistic technology.

Featuring over 300 suffix rules and a comprehensive database of over 50,000 words, the stemmer excels in handling various grammatical scenarios while preserving semantic integrity. This hybrid approach allows for meticulous processing across nouns, pronouns, adjectives, adverbs, verbs, and more, maintaining the essence of words and supporting diverse NLP applications.

Now available on PyPI, PunjabiStemmer is a monumental leap in text processing for regional languages, inviting collaboration and innovation in the field.
This Punjabi Stemmer was developed by Gurpej Singh, a researcher and software engineer dedicated to advancing Natural Language Processing for Punjabi, as part of his research work. 
## Main Features

The Punjabi_Stemmer package offers several functionalities, including:
1. **Stemming single word for testing.**
2. **Stemming a Sentence or Paragraph.**
3. **Stemming Content from a Large Text File.**
4. **Hybrid approach( Rule-Based + Dictionary Based Approach):** Incorporates over 300 specific rules and a comprehensive database of over 50,000 words, to precisely manage the linguistic diversity of Punjabi, effectively dealing with suffixes, prefixes, and more.
5. **High Accuracy:** Designed to minimize overstemming and understemming errors, enhancing the reliability of subsequent NLP tasks.
6. **Open Source:** Fully open-source, encouraging contributions, improvements, and customization to meet diverse needs.


You can install Punjabi_Stemmer directly from PyPI

## Installation
Before installing Punjabi Stemmer, ensure you have Python 3.7 or newer installed on your system. Punjabi Stemmer also require some other libraries installation:
```python
pip install regex
pip install os-sys
pip install collection
```
Install `Punjabi_Stemmer` using pip:

```python
pip install punjabi_stemmer
```

# Usage
Here's how to use the Punjabi_Stemmer Package in your Python projects:

## Stemming a Single Word
To stem a single Punjabi word, use the stem_word method:
```python
from Punjabi_Stemmer.Stemmer import PunjabiStemmer

# Now you can simply initialize the stemmer without specifying file paths
stemmer = PunjabiStemmer()

word = "ਭੱਜਣਾ"  # Example Punjabi word
stemmed_word = stemmer.stem_word(word)
print(f"Original: {word}, Stemmed: {stemmed_word}")

```
Output
```
ਭੱਜ
```
This will print the original word and its stemmed version.

## Stemming a Sentence or Paragraph
To stem a longer piece of text, such as a sentence or paragraph, use the stem_text method:
```python
from Punjabi_Stemmer.Stemmer import PunjabiStemmer

# Now you can simply initialize the stemmer without specifying file paths
stemmer = PunjabiStemmer()

text = "ਪੜਾਉਂਦਾ ਪੜਾਉਂਦੀ ਪੜਾਉਂਦੇ  ਪੜਾਉਣੀਆਂ  ਪੜਾਉਣੀ  ਪੜਾਉਣੇ ਪੜਾਂਦਾ ਪੜਾਂਦੀ"
stemmed_text = stemmer.stem_text(text)
print(f"Original: {text}\nStemmed: {stemmed_text}")

```
Output
```
ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ ਪੜਾ
```
This will output the original text and the stemmed version.


## Stemming Content from a Text File
To process text from a file, ensuring all the content is automatically preprocessed and stemmed, then outputted to another file, you can use the stem_file method:
```python
from Punjabi_Stemmer.Stemmer import PunjabiStemmer

# Now you can simply initialize the stemmer without specifying file paths
stemmer = PunjabiStemmer()

input_file_path = 'path/to/your/input.txt'  # Path to your input file
output_file_path = 'path/to/your/output.txt'  # Path where you want to save the output

# Stem the content of the input file and save it to the output file
stemmer.stem_file(input_file_path, output_file_path)

```
Output
```
Processed text has been saved to D:/output.txt
```
Make sure the input file exists at the specified path; otherwise, you'll receive an error message.

These steps provide a comprehensive guide on how to use the Punjabi Stemmer package from basic to more advanced use cases, including processing individual words, text, and files. This should cover most needs users might have when working with Punjabi text data.

## Punjabi Stemmer Algorithm
```python
Step 1: Initialization

Load lists of pronouns, adverbs, post-positions, vocabulary, names, and suffixes.
Load the dictionary, organized alphabetically.

Step 2: Input Processing

Split the input text into individual words.
For each word in the split text:

Step 3: List Category Checking

a. If the length of the word is 1, output the word and move to the next one.
b. If the word is found in any loaded list (pronouns, adverbs, postpositions, vocabulary, names, suffixes), output the word and move to the next one.

Step 4: Apply Suffix Rule

a. If a suffix rule matches, apply it to stem the word.
b. If no suffix rule matches, output the original word and continue.

Step 5: Single-Character Check

If the length of the stemmed word is 1, output the original word and continue.

Step 6: Dictionary Validation and Further Stemming

a. If the stemmed word is validated in the dictionary, output the stemmed word.
b. If the stemmed word is not validated but the original word is:
i. Apply further suffix rules iteratively to the original word.
ii. If a valid stemmed word is found during iteration, output it.
iii. If no valid stem is found after all iterations, output the original word.
c. If neither the stemmed word nor the original word is validated in the dictionary, output the stemmed word.

```

## Rules Overview
In the development of the PunjabiStemmer, one of our core objectives was to create a highly accurate and versatile tool capable of navigating the rich morphological landscape of the Punjabi language. To achieve this, we have meticulously developed and implemented an expansive set of rules that serve as the foundation of our stemming process.

The PunjabiStemmer incorporates over 300 specific rules, designed to accurately process a wide array of grammatical scenarios. These rules are meticulously categorized to address different aspects of the language, including but not limited to, proper nouns and names, pronouns, verbs, adverbs, and adjectives. This structured approach allows the stemmer to precisely identify and handle the morphological nuances of Punjabi, significantly reducing errors related to overstemming and understemming.

Our research paper details about 160 of these rules, showcasing their application across various linguistic categories. To ensure comprehensive understanding and transparency, we've included the entire rule set in stemmer file in this repossitory https://github.com/gurpejsingh13/Punjabi_Stemmer.git

The rules are thoughtfully crafted and arranged from longest to smallest, optimizing both accuracy and efficiency in stemming. This organization reflects our meticulous approach to handling the complexities of Punjabi morphology.

We encourage users and developers to explore this detailed compilation of rules. It highlights the PunjabiStemmer's effectiveness and our dedication to advancing the field of Punjabi language processing.


## Contributing
Contributions to punjabi_stopwords are welcome! If you have suggestions for additional rules, or improvements to the existing list, please feel free to contribute.

## License
This project is licensed under the MIT License - see the LICENSE file for details.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gurpejsingh13/Punjabi_Stemmer.git",
    "name": "punjabi-stemmer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "stemmer,punjabi,nlp,punjabi language,Gurmukhi Language,natural language processing,text processing,noise removal,Stemming,Brute Force Algorithm,Suffix Striping,Under-stemming,Over-stemming,Stemming Algorithm",
    "author": "Gurpej Singh",
    "author_email": "gurpejsingh462@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/33/ae/f9a568ae5d8ce5b665ee09aface2309dbadba4c6a4cec46dd6ff7fcbc8c8/punjabi_stemmer-1.0.1.tar.gz",
    "platform": null,
    "description": "# PunjabiStemmer\n## Introduction\nThe PunjabiStemmer introduces a groundbreaking advancement in Natural Language Processing (NLP) for Punjabi, a major regional language in India. It combines rule-based and dictionary-based methodologies to address the morphological complexity of Punjabi, setting a new benchmark in linguistic technology.\n\nFeaturing over 300 suffix rules and a comprehensive database of over 50,000 words, the stemmer excels in handling various grammatical scenarios while preserving semantic integrity. This hybrid approach allows for meticulous processing across nouns, pronouns, adjectives, adverbs, verbs, and more, maintaining the essence of words and supporting diverse NLP applications.\n\nNow available on PyPI, PunjabiStemmer is a monumental leap in text processing for regional languages, inviting collaboration and innovation in the field.\nThis Punjabi Stemmer was developed by Gurpej Singh, a researcher and software engineer dedicated to advancing Natural Language Processing for Punjabi, as part of his research work. \n## Main Features\n\nThe Punjabi_Stemmer package offers several functionalities, including:\n1. **Stemming single word for testing.**\n2. **Stemming a Sentence or Paragraph.**\n3. **Stemming Content from a Large Text File.**\n4. **Hybrid approach( Rule-Based + Dictionary Based Approach):** Incorporates over 300 specific rules and a comprehensive database of over 50,000 words, to precisely manage the linguistic diversity of Punjabi, effectively dealing with suffixes, prefixes, and more.\n5. **High Accuracy:** Designed to minimize overstemming and understemming errors, enhancing the reliability of subsequent NLP tasks.\n6. **Open Source:** Fully open-source, encouraging contributions, improvements, and customization to meet diverse needs.\n\n\nYou can install Punjabi_Stemmer directly from PyPI\n\n## Installation\nBefore installing Punjabi Stemmer, ensure you have Python 3.7 or newer installed on your system. Punjabi Stemmer also require some other libraries installation:\n```python\npip install regex\npip install os-sys\npip install collection\n```\nInstall `Punjabi_Stemmer` using pip:\n\n```python\npip install punjabi_stemmer\n```\n\n# Usage\nHere's how to use the Punjabi_Stemmer Package in your Python projects:\n\n## Stemming a Single Word\nTo stem a single Punjabi word, use the stem_word method:\n```python\nfrom Punjabi_Stemmer.Stemmer import PunjabiStemmer\n\n# Now you can simply initialize the stemmer without specifying file paths\nstemmer = PunjabiStemmer()\n\nword = \"\u0a2d\u0a71\u0a1c\u0a23\u0a3e\"  # Example Punjabi word\nstemmed_word = stemmer.stem_word(word)\nprint(f\"Original: {word}, Stemmed: {stemmed_word}\")\n\n```\nOutput\n```\n\u0a2d\u0a71\u0a1c\n```\nThis will print the original word and its stemmed version.\n\n## Stemming a Sentence or Paragraph\nTo stem a longer piece of text, such as a sentence or paragraph, use the stem_text method:\n```python\nfrom Punjabi_Stemmer.Stemmer import PunjabiStemmer\n\n# Now you can simply initialize the stemmer without specifying file paths\nstemmer = PunjabiStemmer()\n\ntext = \"\u0a2a\u0a5c\u0a3e\u0a09\u0a02\u0a26\u0a3e \u0a2a\u0a5c\u0a3e\u0a09\u0a02\u0a26\u0a40 \u0a2a\u0a5c\u0a3e\u0a09\u0a02\u0a26\u0a47  \u0a2a\u0a5c\u0a3e\u0a09\u0a23\u0a40\u0a06\u0a02  \u0a2a\u0a5c\u0a3e\u0a09\u0a23\u0a40  \u0a2a\u0a5c\u0a3e\u0a09\u0a23\u0a47 \u0a2a\u0a5c\u0a3e\u0a02\u0a26\u0a3e \u0a2a\u0a5c\u0a3e\u0a02\u0a26\u0a40\"\nstemmed_text = stemmer.stem_text(text)\nprint(f\"Original: {text}\\nStemmed: {stemmed_text}\")\n\n```\nOutput\n```\n\u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e \u0a2a\u0a5c\u0a3e\n```\nThis will output the original text and the stemmed version.\n\n\n## Stemming Content from a Text File\nTo process text from a file, ensuring all the content is automatically preprocessed and stemmed, then outputted to another file, you can use the stem_file method:\n```python\nfrom Punjabi_Stemmer.Stemmer import PunjabiStemmer\n\n# Now you can simply initialize the stemmer without specifying file paths\nstemmer = PunjabiStemmer()\n\ninput_file_path = 'path/to/your/input.txt'  # Path to your input file\noutput_file_path = 'path/to/your/output.txt'  # Path where you want to save the output\n\n# Stem the content of the input file and save it to the output file\nstemmer.stem_file(input_file_path, output_file_path)\n\n```\nOutput\n```\nProcessed text has been saved to D:/output.txt\n```\nMake sure the input file exists at the specified path; otherwise, you'll receive an error message.\n\nThese steps provide a comprehensive guide on how to use the Punjabi Stemmer package from basic to more advanced use cases, including processing individual words, text, and files. This should cover most needs users might have when working with Punjabi text data.\n\n## Punjabi Stemmer Algorithm\n```python\nStep 1: Initialization\n\nLoad lists of pronouns, adverbs, post-positions, vocabulary, names, and suffixes.\nLoad the dictionary, organized alphabetically.\n\nStep 2: Input Processing\n\nSplit the input text into individual words.\nFor each word in the split text:\n\nStep 3: List Category Checking\n\na. If the length of the word is 1, output the word and move to the next one.\nb. If the word is found in any loaded list (pronouns, adverbs, postpositions, vocabulary, names, suffixes), output the word and move to the next one.\n\nStep 4: Apply Suffix Rule\n\na. If a suffix rule matches, apply it to stem the word.\nb. If no suffix rule matches, output the original word and continue.\n\nStep 5: Single-Character Check\n\nIf the length of the stemmed word is 1, output the original word and continue.\n\nStep 6: Dictionary Validation and Further Stemming\n\na. If the stemmed word is validated in the dictionary, output the stemmed word.\nb. If the stemmed word is not validated but the original word is:\ni. Apply further suffix rules iteratively to the original word.\nii. If a valid stemmed word is found during iteration, output it.\niii. If no valid stem is found after all iterations, output the original word.\nc. If neither the stemmed word nor the original word is validated in the dictionary, output the stemmed word.\n\n```\n\n## Rules Overview\nIn the development of the PunjabiStemmer, one of our core objectives was to create a highly accurate and versatile tool capable of navigating the rich morphological landscape of the Punjabi language. To achieve this, we have meticulously developed and implemented an expansive set of rules that serve as the foundation of our stemming process.\n\nThe PunjabiStemmer incorporates over 300 specific rules, designed to accurately process a wide array of grammatical scenarios. These rules are meticulously categorized to address different aspects of the language, including but not limited to, proper nouns and names, pronouns, verbs, adverbs, and adjectives. This structured approach allows the stemmer to precisely identify and handle the morphological nuances of Punjabi, significantly reducing errors related to overstemming and understemming.\n\nOur research paper details about 160 of these rules, showcasing their application across various linguistic categories. To ensure comprehensive understanding and transparency, we've included the entire rule set in stemmer file in this repossitory https://github.com/gurpejsingh13/Punjabi_Stemmer.git\n\nThe rules are thoughtfully crafted and arranged from longest to smallest, optimizing both accuracy and efficiency in stemming. This organization reflects our meticulous approach to handling the complexities of Punjabi morphology.\n\nWe encourage users and developers to explore this detailed compilation of rules. It highlights the PunjabiStemmer's effectiveness and our dedication to advancing the field of Punjabi language processing.\n\n\n## Contributing\nContributions to punjabi_stopwords are welcome! If you have suggestions for additional rules, or improvements to the existing list, please feel free to contribute.\n\n## License\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library for stemming Punjabi language words, including preprocessing for noise removal.",
    "version": "1.0.1",
    "project_urls": {
        "Homepage": "https://github.com/gurpejsingh13/Punjabi_Stemmer.git"
    },
    "split_keywords": [
        "stemmer",
        "punjabi",
        "nlp",
        "punjabi language",
        "gurmukhi language",
        "natural language processing",
        "text processing",
        "noise removal",
        "stemming",
        "brute force algorithm",
        "suffix striping",
        "under-stemming",
        "over-stemming",
        "stemming algorithm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cecb50b66eadc06d2480e7c0d66d77a666fc323e8cc9875c0f4315b447145e18",
                "md5": "e76fe01494f35e24a96f73db83ae9dff",
                "sha256": "423eb96dd4804bf493dd20a09af3b2f8d7a2d1caebdcb2a21d082833c94cb7d0"
            },
            "downloads": -1,
            "filename": "punjabi_stemmer-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e76fe01494f35e24a96f73db83ae9dff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 10870,
            "upload_time": "2024-03-13T13:35:27",
            "upload_time_iso_8601": "2024-03-13T13:35:27.623195Z",
            "url": "https://files.pythonhosted.org/packages/ce/cb/50b66eadc06d2480e7c0d66d77a666fc323e8cc9875c0f4315b447145e18/punjabi_stemmer-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "33aef9a568ae5d8ce5b665ee09aface2309dbadba4c6a4cec46dd6ff7fcbc8c8",
                "md5": "7b3661c7a29d1ff36b768fda2980211e",
                "sha256": "563dfa0c56450f3af3facfc39a5be8a4908068ea8368d26d8155f187442a19c6"
            },
            "downloads": -1,
            "filename": "punjabi_stemmer-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7b3661c7a29d1ff36b768fda2980211e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 364057,
            "upload_time": "2024-03-13T13:35:30",
            "upload_time_iso_8601": "2024-03-13T13:35:30.425675Z",
            "url": "https://files.pythonhosted.org/packages/33/ae/f9a568ae5d8ce5b665ee09aface2309dbadba4c6a4cec46dd6ff7fcbc8c8/punjabi_stemmer-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-13 13:35:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gurpejsingh13",
    "github_project": "Punjabi_Stemmer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "punjabi-stemmer"
}
        
Elapsed time: 0.47878s