bambara-normalizer


Namebambara-normalizer JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryA python package for normalizing Bambara text for NLP
upload_time2025-01-17 11:46:23
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords nlp bambara diacritic removal natural language processing text normalization text preprocessing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # bambara-normalizer

`bambara-normalizer` is a Python package for normalizing Bambara text, tailored for Natural Language Processing (NLP) tasks. The package provides tools to preprocess text by removing symbols, diacritics, and performing additional transformations required for various NLP applications.

## Features

- **BasicTextNormalizer**: A generic text normalization class that removes symbols, diacritics, and optionally splits letters.
- **BasicBambaraNormalizer**: Extends `BasicTextNormalizer` with specific rules for Bambara text, such as preserving hyphens in compound words and handling apostrophes.
- **BambaraASRNormalizer**: A specialized normalizer for Automatic Speech Recognition (ASR) tasks in Bambara, designed to retain parenthetical and bracketed text that might appear in spoken transcriptions.

## Installation

### Install from PyPI

To install the package, run:

```bash
pip install bambara-normalizer
```

### Install from Source

To install the package from source, clone the repository and build the package:

```bash
git clone https://github.com/diarray-hub/bambara-normalizer.git
cd bambara-normalizer
python -m build --wheel
pip install dist/bambara_normalizer-0.0.1-py3-none-any.whl
```

## Usage

### BasicTextNormalizer

```python
from bambara_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer(remove_diacritics=True, split_letters=False)
text = "Cliché text with symbols & diacritics!"
normalized_text = normalizer(text)
print(normalized_text)  # Output: "cliche text with symbols diacritics"
```

### BasicBambaraNormalizer

```python
from bambara_normalizer import BasicBambaraNormalizer

normalizer = BasicBambaraNormalizer()
text = "à tɔ́gɔ kó : sìrajɛ."
normalized_text = normalizer(text)
print(normalized_text)  # Output: "a togoko siraje"

# Example with hyphens
text_with_hyphens = "- bɛ̀n-kɛ́nɛfisɛ."
normalized_text = normalizer(text_with_hyphens)
print(normalized_text)  # Output: "bɛn-kɛ́nɛfisɛ"
```

### BambaraASRNormalizer

```python
from bambara_normalizer import BambaraASRNormalizer

normalizer = BambaraASRNormalizer()
text = "sìrajɛ, - í ni tìle !"
normalized_text = normalizer(text)
print(normalized_text)  # Output: "siraje i ni tile"

# Example with words in parenthesis and brackets
text_with_brackets = "(à kán) [kɛ̀nɛ]."
normalized_text = normalizer(text_with_brackets)
print(normalized_text)  # Output: "a kán kɛ̀nɛ"
```

### BambaraASRNormalizer with Split Letters

```python
from bambara_normalizer import BambaraASRNormalizer

normalizer = BambaraASRNormalizer(split_letters=True)
text = "ǹsé, í ni tìle !"
normalized_text = normalizer(text)
print(normalized_text)  # Output: "n s e i ni tile"
```

## Customization

Each normalizer supports optional parameters for:

- **Removing diacritics**: Converts characters like `é` to `e`.
- **Splitting letters**: Converts `abc` to `a b c`.
- **Preserving specific symbols**: Customize which characters to retain (e.g., hyphens or apostrophes).

## Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Submit a pull request.

## License

This project is licensed under the MIT License. See the `LICENSE` file for details.

## Authors

- [Yacouba Diarra @ RobotsMali AI4D Lab](https://github.com/diarray-hub)

---

Feel free to reach out for any questions or support regarding the usage of this package!


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bambara-normalizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "NLP, bambara, diacritic removal, natural language processing, text normalization, text preprocessing",
    "author": null,
    "author_email": "Yacouba Diarra <diarray@robotsmali.org>",
    "download_url": "https://files.pythonhosted.org/packages/97/a8/ca209063c52ee8270d65802305e6dfa76f7f679fd7b7b40766f934fc2325/bambara_normalizer-0.0.1.tar.gz",
    "platform": null,
    "description": "# bambara-normalizer\n\n`bambara-normalizer` is a Python package for normalizing Bambara text, tailored for Natural Language Processing (NLP) tasks. The package provides tools to preprocess text by removing symbols, diacritics, and performing additional transformations required for various NLP applications.\n\n## Features\n\n- **BasicTextNormalizer**: A generic text normalization class that removes symbols, diacritics, and optionally splits letters.\n- **BasicBambaraNormalizer**: Extends `BasicTextNormalizer` with specific rules for Bambara text, such as preserving hyphens in compound words and handling apostrophes.\n- **BambaraASRNormalizer**: A specialized normalizer for Automatic Speech Recognition (ASR) tasks in Bambara, designed to retain parenthetical and bracketed text that might appear in spoken transcriptions.\n\n## Installation\n\n### Install from PyPI\n\nTo install the package, run:\n\n```bash\npip install bambara-normalizer\n```\n\n### Install from Source\n\nTo install the package from source, clone the repository and build the package:\n\n```bash\ngit clone https://github.com/diarray-hub/bambara-normalizer.git\ncd bambara-normalizer\npython -m build --wheel\npip install dist/bambara_normalizer-0.0.1-py3-none-any.whl\n```\n\n## Usage\n\n### BasicTextNormalizer\n\n```python\nfrom bambara_normalizer import BasicTextNormalizer\n\nnormalizer = BasicTextNormalizer(remove_diacritics=True, split_letters=False)\ntext = \"Clich\u00e9 text with symbols & diacritics!\"\nnormalized_text = normalizer(text)\nprint(normalized_text)  # Output: \"cliche text with symbols diacritics\"\n```\n\n### BasicBambaraNormalizer\n\n```python\nfrom bambara_normalizer import BasicBambaraNormalizer\n\nnormalizer = BasicBambaraNormalizer()\ntext = \"a\u0300 t\u0254\u0301g\u0254 ko\u0301 : si\u0300raj\u025b.\"\nnormalized_text = normalizer(text)\nprint(normalized_text)  # Output: \"a togoko siraje\"\n\n# Example with hyphens\ntext_with_hyphens = \"- b\u025b\u0300n-k\u025b\u0301n\u025bfis\u025b.\"\nnormalized_text = normalizer(text_with_hyphens)\nprint(normalized_text)  # Output: \"b\u025bn-k\u025b\u0301n\u025bfis\u025b\"\n```\n\n### BambaraASRNormalizer\n\n```python\nfrom bambara_normalizer import BambaraASRNormalizer\n\nnormalizer = BambaraASRNormalizer()\ntext = \"si\u0300raj\u025b, - i\u0301 ni ti\u0300le !\"\nnormalized_text = normalizer(text)\nprint(normalized_text)  # Output: \"siraje i ni tile\"\n\n# Example with words in parenthesis and brackets\ntext_with_brackets = \"(a\u0300 ka\u0301n) [k\u025b\u0300n\u025b].\"\nnormalized_text = normalizer(text_with_brackets)\nprint(normalized_text)  # Output: \"a ka\u0301n k\u025b\u0300n\u025b\"\n```\n\n### BambaraASRNormalizer with Split Letters\n\n```python\nfrom bambara_normalizer import BambaraASRNormalizer\n\nnormalizer = BambaraASRNormalizer(split_letters=True)\ntext = \"n\u0300se\u0301, i\u0301 ni ti\u0300le !\"\nnormalized_text = normalizer(text)\nprint(normalized_text)  # Output: \"n s e i ni tile\"\n```\n\n## Customization\n\nEach normalizer supports optional parameters for:\n\n- **Removing diacritics**: Converts characters like `\u00e9` to `e`.\n- **Splitting letters**: Converts `abc` to `a b c`.\n- **Preserving specific symbols**: Customize which characters to retain (e.g., hyphens or apostrophes).\n\n## Contributing\n\nContributions are welcome! Please follow these steps:\n\n1. Fork the repository.\n2. Create a new branch for your feature or bug fix.\n3. Submit a pull request.\n\n## License\n\nThis project is licensed under the MIT License. See the `LICENSE` file for details.\n\n## Authors\n\n- [Yacouba Diarra @ RobotsMali AI4D Lab](https://github.com/diarray-hub)\n\n---\n\nFeel free to reach out for any questions or support regarding the usage of this package!\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A python package for normalizing Bambara text for NLP",
    "version": "0.0.1",
    "project_urls": {
        "Issues": "https://github.com/diarray-hub/bambara-normalizer/issues",
        "Source": "https://github.com/diarray-hub/bambara-normalizer"
    },
    "split_keywords": [
        "nlp",
        " bambara",
        " diacritic removal",
        " natural language processing",
        " text normalization",
        " text preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d9cfdae29017a86c7e8b7302942024dffcb0d659f5382bff94f0b583f27f1b2b",
                "md5": "ebbc5429418088ab62f2f054e90bb017",
                "sha256": "e7d9a7e1e7f6c844409d5f118309a0e6787157256953a14a81d5e60580879fbb"
            },
            "downloads": -1,
            "filename": "bambara_normalizer-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ebbc5429418088ab62f2f054e90bb017",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 8613,
            "upload_time": "2025-01-17T11:46:20",
            "upload_time_iso_8601": "2025-01-17T11:46:20.693331Z",
            "url": "https://files.pythonhosted.org/packages/d9/cf/dae29017a86c7e8b7302942024dffcb0d659f5382bff94f0b583f27f1b2b/bambara_normalizer-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "97a8ca209063c52ee8270d65802305e6dfa76f7f679fd7b7b40766f934fc2325",
                "md5": "b8c215d87cf23eab8e0aeb44f5caefc1",
                "sha256": "22c89525358e061eb64357368eeb3693e24dd120302a72606a5518cfe6109ba5"
            },
            "downloads": -1,
            "filename": "bambara_normalizer-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b8c215d87cf23eab8e0aeb44f5caefc1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 5975,
            "upload_time": "2025-01-17T11:46:23",
            "upload_time_iso_8601": "2025-01-17T11:46:23.628848Z",
            "url": "https://files.pythonhosted.org/packages/97/a8/ca209063c52ee8270d65802305e6dfa76f7f679fd7b7b40766f934fc2325/bambara_normalizer-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-17 11:46:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "diarray-hub",
    "github_project": "bambara-normalizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "bambara-normalizer"
}
        
Elapsed time: 0.83485s