textfab


Nametextfab JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/Astromis/textfab
SummaryA tiny library for text preprocessing in NLP
upload_time2024-01-05 14:42:12
maintainer
docs_urlNone
authorIgor Buyanov
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements pymystem3 nltk omegaconf
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Textfab


I really tired rewriting all functions each time when I need to preprocess some text. Ridiculous thing is that when I need to preprocess the same text in different manners it sometimes become hard to traverse where and what I applied for a particular text. Recently, the augmentations has the same effect. 

This code is intended to end up this by organizing all kind of preprocess or augmenting functions in understandable conveyer-like structure. The basic idea is to represent all functions as process units with manager that can guide the text through the user-defined sequence of units to process the text according to user needs. Moreover the manager has a representation string where the user can see the organization of units. It should be helpful when working in Jupiter Notebook.

# Usage

The basic block of the `textfab` is units that does the work. The available units can be found in `units.py` module of by calling the next function:
```python
from textfab.utils import show_available_units
show_available_units()
```

All these units can be organized in `Fabric` class through which the text is processed. To create the `Fabric` object, you just need to define the sequence of units:
```python
from textfab.fabric import Fabric

config = ["swap_enter_to_space", "remove_punct", {"remove_custom_regex": {"regex" : "[A-Z]*"}}, "collapse_spaces"]
fab = Fabric(config)
print(fab)
# >>> Conveyer sequence:
# >>> swap_enter_to_space->
# >>> remove_punct->
# >>> remove_custom_regex:{'regex': '[A-Z]*'}->
# >>> collapse_spaces
```

It also can be the `OmegaConf` config, so it means the `textfab` can work with `Hydra`, which also means that you can log you preprocess steps in tools like DVC or ClearML. 
```python
config = OmegaConf.create(["swap_enter_to_space", "remove_punct", "collapse_spaces"])
fab = Fabric(config)
```

The fabric can be instantiated from config directly:
```python
fab = Fabric.from_config("configs/simple_config.yaml")
```

When the object is ready, simply call it on the text list. You can also specify the pool size when you process the text in order to enable multiprocess execution.
```python
# single process
fab(["This text, is\n\n for test"])
# multiprocess
fab(["This text, is\n\n for test"], pool_size=5)
```

By default, the fab watches on the amount integrity: the amount of output text must be the same as input. It's important when the particular text has the label. You don't want suddenly lose or create some object. Mind that for today it doesn't save you from situations when you unexpectedly remove in one place and add in another, where the shifts are possible. Sometimes you don't need this, for example, when you create a corpus for the language model training, so you can turn it off:
```python
fab(["This text, is\n\n for test"], ensure_amount_integrity=False)
```    

# Extension

The code is intended to be extendable in order to collect as more functions as can be. In order to add function, you need to use an abstract class of the appropriate unit. Also the next requirements must be met:

* The unit must do only one step
* The unit name must start with a verb
* The unit name has a snake case
* The unit must have a docstring
  * The parametrized unit doc must includes the parameters description. 
* The test for the unit must be presented 

There are four types of units with corresponding abstract classes:
* ProcessUnit - unit that consume text and produce the text as string.
* ChangingProcessUnit - these units can modify the objects e,g `str -> List[str]`, `List[str] -> List[List[str]]`, `List[str] -> str`. Generally, they consume anything and produce anything.
* ParamProcessUnit - this is parametrized version of the ProcessUnit. It initialized with parameters as dictionary.
* ParamChangingProcessUnit - this is parametrized version of the ChangingProcessUnit. It initialized with parameters as dictionary.

For easy reading and writing, the unit class implementations are named in the snake case (like_this_writing). By this time they don't play much role in functionality and serve more for the orientation while writing the code.

When you work in the Notebook, you can define a custom unit and use it object in the config. Note that It's important to implement `process` and `__str__` methods:

```python
class custom_unit(ProcessUnit):
        
    def process(self, text):
        return text
    
    def __str__(self):
        return "test"

custom_u = custom_unit()
Fabric(config = ["swap_enter_to_space", "remove_punct", "collapse_spaces", custom_u])
```
The possibility of reading the units from custom scripts is in development.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Astromis/textfab",
    "name": "textfab",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Igor Buyanov",
    "author_email": "buyanov.igor.o@yandex.ru",
    "download_url": "https://files.pythonhosted.org/packages/84/3c/03aff4b35bd577947535da29792703e2070082b7c18beafd0160c638a970/textfab-1.0.0.tar.gz",
    "platform": null,
    "description": "# Textfab\n\n\nI really tired rewriting all functions each time when I need to preprocess some text. Ridiculous thing is that when I need to preprocess the same text in different manners it sometimes become hard to traverse where and what I applied for a particular text. Recently, the augmentations has the same effect. \n\nThis code is intended to end up this by organizing all kind of preprocess or augmenting functions in understandable conveyer-like structure. The basic idea is to represent all functions as process units with manager that can guide the text through the user-defined sequence of units to process the text according to user needs. Moreover the manager has a representation string where the user can see the organization of units. It should be helpful when working in Jupiter Notebook.\n\n# Usage\n\nThe basic block of the `textfab` is units that does the work. The available units can be found in `units.py` module of by calling the next function:\n```python\nfrom textfab.utils import show_available_units\nshow_available_units()\n```\n\nAll these units can be organized in `Fabric` class through which the text is processed. To create the `Fabric` object, you just need to define the sequence of units:\n```python\nfrom textfab.fabric import Fabric\n\nconfig = [\"swap_enter_to_space\", \"remove_punct\", {\"remove_custom_regex\": {\"regex\" : \"[A-Z]*\"}}, \"collapse_spaces\"]\nfab = Fabric(config)\nprint(fab)\n# >>> Conveyer sequence:\n# >>> swap_enter_to_space->\n# >>> remove_punct->\n# >>> remove_custom_regex:{'regex': '[A-Z]*'}->\n# >>> collapse_spaces\n```\n\nIt also can be the `OmegaConf` config, so it means the `textfab` can work with `Hydra`, which also means that you can log you preprocess steps in tools like DVC or ClearML. \n```python\nconfig = OmegaConf.create([\"swap_enter_to_space\", \"remove_punct\", \"collapse_spaces\"])\nfab = Fabric(config)\n```\n\nThe fabric can be instantiated from config directly:\n```python\nfab = Fabric.from_config(\"configs/simple_config.yaml\")\n```\n\nWhen the object is ready, simply call it on the text list. You can also specify the pool size when you process the text in order to enable multiprocess execution.\n```python\n# single process\nfab([\"This text, is\\n\\n for test\"])\n# multiprocess\nfab([\"This text, is\\n\\n for test\"], pool_size=5)\n```\n\nBy default, the fab watches on the amount integrity: the amount of output text must be the same as input. It's important when the particular text has the label. You don't want suddenly lose or create some object. Mind that for today it doesn't save you from situations when you unexpectedly remove in one place and add in another, where the shifts are possible. Sometimes you don't need this, for example, when you create a corpus for the language model training, so you can turn it off:\n```python\nfab([\"This text, is\\n\\n for test\"], ensure_amount_integrity=False)\n```    \n\n# Extension\n\nThe code is intended to be extendable in order to collect as more functions as can be. In order to add function, you need to use an abstract class of the appropriate unit. Also the next requirements must be met:\n\n* The unit must do only one step\n* The unit name must start with a verb\n* The unit name has a snake case\n* The unit must have a docstring\n  * The parametrized unit doc must includes the parameters description. \n* The test for the unit must be presented \n\nThere are four types of units with corresponding abstract classes:\n* ProcessUnit - unit that consume text and produce the text as string.\n* ChangingProcessUnit - these units can modify the objects e,g `str -> List[str]`, `List[str] -> List[List[str]]`, `List[str] -> str`. Generally, they consume anything and produce anything.\n* ParamProcessUnit - this is parametrized version of the ProcessUnit. It initialized with parameters as dictionary.\n* ParamChangingProcessUnit - this is parametrized version of the ChangingProcessUnit. It initialized with parameters as dictionary.\n\nFor easy reading and writing, the unit class implementations are named in the snake case (like_this_writing). By this time they don't play much role in functionality and serve more for the orientation while writing the code.\n\nWhen you work in the Notebook, you can define a custom unit and use it object in the config. Note that It's important to implement `process` and `__str__` methods:\n\n```python\nclass custom_unit(ProcessUnit):\n        \n    def process(self, text):\n        return text\n    \n    def __str__(self):\n        return \"test\"\n\ncustom_u = custom_unit()\nFabric(config = [\"swap_enter_to_space\", \"remove_punct\", \"collapse_spaces\", custom_u])\n```\nThe possibility of reading the units from custom scripts is in development.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A tiny library for text preprocessing in NLP",
    "version": "1.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/Astromis/textfab/issues",
        "Homepage": "https://github.com/Astromis/textfab"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6a87d20b54fa6830ad90ebdf40975419259658622ba00122e8a47032ee57c75b",
                "md5": "7606896051fadf95e03126f6e7924a44",
                "sha256": "3171224064d5c01954e9593feab93df78fe806aea5cc7ab80295f0a943806cdc"
            },
            "downloads": -1,
            "filename": "textfab-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7606896051fadf95e03126f6e7924a44",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11560,
            "upload_time": "2024-01-05T14:42:10",
            "upload_time_iso_8601": "2024-01-05T14:42:10.802900Z",
            "url": "https://files.pythonhosted.org/packages/6a/87/d20b54fa6830ad90ebdf40975419259658622ba00122e8a47032ee57c75b/textfab-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "843c03aff4b35bd577947535da29792703e2070082b7c18beafd0160c638a970",
                "md5": "e794904941e664af7b30424b4c4ad4b0",
                "sha256": "dabd73a20b4ecc217a317a21abd331b47512a3595b3a8cbe3b3b5903e041e626"
            },
            "downloads": -1,
            "filename": "textfab-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e794904941e664af7b30424b4c4ad4b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 14490,
            "upload_time": "2024-01-05T14:42:12",
            "upload_time_iso_8601": "2024-01-05T14:42:12.292563Z",
            "url": "https://files.pythonhosted.org/packages/84/3c/03aff4b35bd577947535da29792703e2070082b7c18beafd0160c638a970/textfab-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-05 14:42:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Astromis",
    "github_project": "textfab",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pymystem3",
            "specs": [
                [
                    ">=",
                    "0.2.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.6.7"
                ]
            ]
        },
        {
            "name": "omegaconf",
            "specs": [
                [
                    ">=",
                    "2.3.0"
                ]
            ]
        }
    ],
    "lcname": "textfab"
}
        
Elapsed time: 0.17783s