slaviclean


Nameslaviclean JSON
Version 0.0.6 PyPI version JSON
download
home_pageNone
SummaryText filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.
upload_time2025-02-13 09:59:43
maintainerNone
docs_urlNone
authorTetiana Lytvynenko
requires_python>=3.11
licenseMIT
keywords nlp tools natural language processing text processing linguistic tools profanity filter obscene filter slavic languages slavic profanity filter slavic text cleaner text sanitization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # slaviclean

[![Python Versions](https://img.shields.io/badge/Python%20Versions-%3E%3D3.11-informational)](https://pypi.org/project/nlp-flexi-tools/)
[![Version](https://img.shields.io/badge/Version-0.0.6-informational)](https://pypi.org/project/nlp-flexi-tools/)


**SlaviCleaner** is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. 
It detects, masks, and reports offensive words while providing different levels of filtering.

This module uses spaCy for natural language processing and employs advanced techniques for detecting profanities,
including handling obfuscated words, variants of swear words, and morphology forms.

### Features

- Detects and masks offensive words in slavic languages (Ukrainian, Russian).
- Handles obfuscated, substituted, and morphologically varied forms of profanity.
- Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).
- Allows the customization of filtering level with three options: `complete`, `basic`, `minimal`.
- Offers support for subtree-level profanity filtering.
- Handles masked and obfuscated profanity patterns effectively.

--- 

## Installation

To install **SlaviCleaner**, run:

```bash
pip install slaviclean
```

--- 

## Usage

### Initializing 

```python
from slaviclean import SlaviCleaner

scleaner = SlaviCleaner()
```
#### Initializing with preloads
You can preload the necessary language models for faster processing. 
The `preload` option loads the models for the supported languages (`uk`, `ru`, `surzhyk`).

```python
from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
```

### Core Functions

#### `get_available_languages()`
Retrieves a set of languages supported by the profanity filter.

- **Returns**:
  - A set of language codes (e.g., `{'uk', 'ru', 'surzhyk'}`).

- **Example**:

```python
from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
languages = scleaner.get_available_languages()

print(languages)  
# Output: {'uk', 'ru', 'surzhyk'}
```


#### `sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)`
Filters profanities from the given message and returns a detailed report.

- **Arguments**:
  - `message` (str): The input message to filter.
  - `lang` (str): The language of the message (supports `'uk'`, `'ru'`, and `'surzhyk'`, default is `'surzhyk'`).
  - `min_subtree_size` (float): Minimum size of the token subtree for dependency parsing (default is `3`).
  - `mask_symbol` (str): Symbol used to mask profanities (default is `'*'`).
  - `slevel` (str): Severity level of filtering (can be `'complete'`, `'basic'`, or `'minimal'`, default is `'complete'`).
  - `analyze_morph` (bool): Whether to analyze the morphology of words (default is `False`).

- **Returns**:
  - A `SanitizeReport` containing the masked message and list of detected profanities.

- **Example**:

```python
from slaviclean import SlaviCleaner

scleaner = SlaviCleaner(preload=True)
message = "От же ж, к у р в а, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась"
 
sanitize_report = scleaner.sanitize(message, lang='uk')

print(sanitize_report)  
# Output: 
#   SanitizeReport(
#      message='От же ж, курва, страхуй, бо об’ївся г***м супом, облив себе соком, ще й сумка, су4k@, відірвалась', 
#      masked_message='От же ж, *****, страхуй, бо об’ївся ***** супом, облив себе соком, ще й сумка, *****, відірвалась', 
#      profanities=[
#           Profanity(span=(9, 14), nearest='курва', tags=['vulgar', 'euphemism', 'loanword']), 
#           Profanity(span=(36, 41), nearest='г***м', tags=['masked']), 
#           Profanity(span=(79, 84), nearest='сучка', tags=['insulting', 'slur', 'vulgar'])])

```

### Available Severity Levels

- **`complete`**  
  Cleans all profanities, including euphemisms, vulgarities, and loanwords.  
- **`basic`**  
  Cleans more aggressive profanity, without including euphemisms.  
- **`minimal`**  
  Only cleans the most offensive words.


### Supported Languages

**SlaviCleaner** currently supports the following languages:
- **Ukrainian (`uk`)**
- **Russian (`ru`)**
- **Surzhyk (`surzhyk`)**

---

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## Acknowledgments

- The **spaCy** library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.
- The **pymorphy3** library is used for morphological analysis.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "slaviclean",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "nlp tools, natural language processing, text processing, linguistic tools, profanity filter, obscene filter, slavic languages, slavic profanity filter, slavic text cleaner, text sanitization",
    "author": "Tetiana Lytvynenko",
    "author_email": "lytvynenkotv@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/48/f1/46d2319d1fca5a5eb9a1059e87e7b1bf265b51f7533848de4aa83a848fc4/slaviclean-0.0.6.tar.gz",
    "platform": null,
    "description": "# slaviclean\n\n[![Python Versions](https://img.shields.io/badge/Python%20Versions-%3E%3D3.11-informational)](https://pypi.org/project/nlp-flexi-tools/)\n[![Version](https://img.shields.io/badge/Version-0.0.6-informational)](https://pypi.org/project/nlp-flexi-tools/)\n\n\n**SlaviCleaner** is a profanity filtering library designed for cleaning text from offensive language, specifically tailored for Ukrainian and Russian languages. \nIt detects, masks, and reports offensive words while providing different levels of filtering.\n\nThis module uses spaCy for natural language processing and employs advanced techniques for detecting profanities,\nincluding handling obfuscated words, variants of swear words, and morphology forms.\n\n### Features\n\n- Detects and masks offensive words in slavic languages (Ukrainian, Russian).\n- Handles obfuscated, substituted, and morphologically varied forms of profanity.\n- Provides detailed reporting of profanities detected, including the type of relationship (e.g., euphemism, vulgar, loanword).\n- Allows the customization of filtering level with three options: `complete`, `basic`, `minimal`.\n- Offers support for subtree-level profanity filtering.\n- Handles masked and obfuscated profanity patterns effectively.\n\n--- \n\n## Installation\n\nTo install **SlaviCleaner**, run:\n\n```bash\npip install slaviclean\n```\n\n--- \n\n## Usage\n\n### Initializing \n\n```python\nfrom slaviclean import SlaviCleaner\n\nscleaner = SlaviCleaner()\n```\n#### Initializing with preloads\nYou can preload the necessary language models for faster processing. \nThe `preload` option loads the models for the supported languages (`uk`, `ru`, `surzhyk`).\n\n```python\nfrom slaviclean import SlaviCleaner\n\nscleaner = SlaviCleaner(preload=True)\n```\n\n### Core Functions\n\n#### `get_available_languages()`\nRetrieves a set of languages supported by the profanity filter.\n\n- **Returns**:\n  - A set of language codes (e.g., `{'uk', 'ru', 'surzhyk'}`).\n\n- **Example**:\n\n```python\nfrom slaviclean import SlaviCleaner\n\nscleaner = SlaviCleaner(preload=True)\nlanguages = scleaner.get_available_languages()\n\nprint(languages)  \n# Output: {'uk', 'ru', 'surzhyk'}\n```\n\n\n#### `sanitize(message, lang, min_subtree_size, mask_symbol, slevel, analyze_morph)`\nFilters profanities from the given message and returns a detailed report.\n\n- **Arguments**:\n  - `message` (str): The input message to filter.\n  - `lang` (str): The language of the message (supports `'uk'`, `'ru'`, and `'surzhyk'`, default is `'surzhyk'`).\n  - `min_subtree_size` (float): Minimum size of the token subtree for dependency parsing (default is `3`).\n  - `mask_symbol` (str): Symbol used to mask profanities (default is `'*'`).\n  - `slevel` (str): Severity level of filtering (can be `'complete'`, `'basic'`, or `'minimal'`, default is `'complete'`).\n  - `analyze_morph` (bool): Whether to analyze the morphology of words (default is `False`).\n\n- **Returns**:\n  - A `SanitizeReport` containing the masked message and list of detected profanities.\n\n- **Example**:\n\n```python\nfrom slaviclean import SlaviCleaner\n\nscleaner = SlaviCleaner(preload=True)\nmessage = \"\u041e\u0442 \u0436\u0435 \u0436, \u043a \u0443 \u0440 \u0432 \u0430, \u0441\u0442\u0440\u0430\u0445\u0443\u0439, \u0431\u043e \u043e\u0431\u2019\u0457\u0432\u0441\u044f \u0433***\u043c \u0441\u0443\u043f\u043e\u043c, \u043e\u0431\u043b\u0438\u0432 \u0441\u0435\u0431\u0435 \u0441\u043e\u043a\u043e\u043c, \u0449\u0435 \u0439 \u0441\u0443\u043c\u043a\u0430, \u0441\u04434k@, \u0432\u0456\u0434\u0456\u0440\u0432\u0430\u043b\u0430\u0441\u044c\"\n \nsanitize_report = scleaner.sanitize(message, lang='uk')\n\nprint(sanitize_report)  \n# Output: \n#   SanitizeReport(\n#      message='\u041e\u0442 \u0436\u0435 \u0436, \u043a\u0443\u0440\u0432\u0430, \u0441\u0442\u0440\u0430\u0445\u0443\u0439, \u0431\u043e \u043e\u0431\u2019\u0457\u0432\u0441\u044f \u0433***\u043c \u0441\u0443\u043f\u043e\u043c, \u043e\u0431\u043b\u0438\u0432 \u0441\u0435\u0431\u0435 \u0441\u043e\u043a\u043e\u043c, \u0449\u0435 \u0439 \u0441\u0443\u043c\u043a\u0430, \u0441\u04434k@, \u0432\u0456\u0434\u0456\u0440\u0432\u0430\u043b\u0430\u0441\u044c', \n#      masked_message='\u041e\u0442 \u0436\u0435 \u0436, *****, \u0441\u0442\u0440\u0430\u0445\u0443\u0439, \u0431\u043e \u043e\u0431\u2019\u0457\u0432\u0441\u044f ***** \u0441\u0443\u043f\u043e\u043c, \u043e\u0431\u043b\u0438\u0432 \u0441\u0435\u0431\u0435 \u0441\u043e\u043a\u043e\u043c, \u0449\u0435 \u0439 \u0441\u0443\u043c\u043a\u0430, *****, \u0432\u0456\u0434\u0456\u0440\u0432\u0430\u043b\u0430\u0441\u044c', \n#      profanities=[\n#           Profanity(span=(9, 14), nearest='\u043a\u0443\u0440\u0432\u0430', tags=['vulgar', 'euphemism', 'loanword']), \n#           Profanity(span=(36, 41), nearest='\u0433***\u043c', tags=['masked']), \n#           Profanity(span=(79, 84), nearest='\u0441\u0443\u0447\u043a\u0430', tags=['insulting', 'slur', 'vulgar'])])\n\n```\n\n### Available Severity Levels\n\n- **`complete`**  \n  Cleans all profanities, including euphemisms, vulgarities, and loanwords.  \n- **`basic`**  \n  Cleans more aggressive profanity, without including euphemisms.  \n- **`minimal`**  \n  Only cleans the most offensive words.\n\n\n### Supported Languages\n\n**SlaviCleaner** currently supports the following languages:\n- **Ukrainian (`uk`)**\n- **Russian (`ru`)**\n- **Surzhyk (`surzhyk`)**\n\n---\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## Acknowledgments\n\n- The **spaCy** library is used for NLP tasks like tokenization, part-of-speech tagging, and dependency parsing.\n- The **pymorphy3** library is used for morphological analysis.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Text filter designed to cleanse text of profanity and offensive language, specifically tailored for Ukrainian, Russian, and Surzhik.",
    "version": "0.0.6",
    "project_urls": null,
    "split_keywords": [
        "nlp tools",
        " natural language processing",
        " text processing",
        " linguistic tools",
        " profanity filter",
        " obscene filter",
        " slavic languages",
        " slavic profanity filter",
        " slavic text cleaner",
        " text sanitization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "df62acdf477f62e041013c33e8035c7d8ecc7d6ccad0013c144adafb9ed39d5a",
                "md5": "7aabf08f64e1c096e29b50447c490982",
                "sha256": "a953de8a75abd035ca82f24d1336181ca00c94d814357b570dc9c8ca802db37c"
            },
            "downloads": -1,
            "filename": "slaviclean-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7aabf08f64e1c096e29b50447c490982",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 296619,
            "upload_time": "2025-02-13T09:59:39",
            "upload_time_iso_8601": "2025-02-13T09:59:39.305516Z",
            "url": "https://files.pythonhosted.org/packages/df/62/acdf477f62e041013c33e8035c7d8ecc7d6ccad0013c144adafb9ed39d5a/slaviclean-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "48f146d2319d1fca5a5eb9a1059e87e7b1bf265b51f7533848de4aa83a848fc4",
                "md5": "5eec2d8472eb9dd08e2f6e450eb416f4",
                "sha256": "9f070d74d80aa9d4939379d4d5781a7d4c1aa93aa45bf1d65cf857b0fc1107bd"
            },
            "downloads": -1,
            "filename": "slaviclean-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "5eec2d8472eb9dd08e2f6e450eb416f4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 286619,
            "upload_time": "2025-02-13T09:59:43",
            "upload_time_iso_8601": "2025-02-13T09:59:43.753861Z",
            "url": "https://files.pythonhosted.org/packages/48/f1/46d2319d1fca5a5eb9a1059e87e7b1bf265b51f7533848de4aa83a848fc4/slaviclean-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-13 09:59:43",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "slaviclean"
}
        
Elapsed time: 0.47780s