data-preprocessors


Namedata-preprocessors JSON
Version 0.58.0 PyPI version JSON
download
home_pagehttps://github.com/MusfiqDehan/data-preprocessors
SummaryAn easy to use tool for Data Preprocessing specially for Text Preprocessing
upload_time2024-09-02 23:00:42
maintainerNone
docs_urlNone
authorMd. Musfiqur Rahaman
requires_python<4.0,>=3.7.1
licenseMIT
keywords nlp data-preprocessors data-preprocessing text-preprocessing data-science textfile musfiqdehan
VCS
bugtrack_url
requirements bnlp-toolkit click colorama cython gensim importlib-metadata joblib nltk numpy pandas python-crfsuite python-dateutil pytz regex scipy sentencepiece six sklearn-crfsuite smart-open tabulate tqdm typing-extensions wasabi zipp
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
    
<img src="https://github.com/MusfiqDehan/data-preprocessors/raw/master/branding/logo.png">

<p>Data Preprocessors</p>

<sub>An easy-to-use tool for Data Preprocessing especially for Text Preprocessing</sub>

<!-- Badges -->

<!-- [<img src="https://deepnote.com/buttons/launch-in-deepnote-small.svg">](PROJECT_URL) -->
    
[![](https://img.shields.io/pypi/v/data-preprocessors.svg)](https://pypi.org/project/data-preprocessors/)
[![Downloads](https://img.shields.io/pypi/dm/data-preprocessors)](https://pepy.tech/project/data-preprocessors)
    
<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mJuRfIz__uS3xoFaBsFn5mkLE418RU19?usp=sharing)
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/keras-team/keras-io/blob/master/examples/vision/ipynb/mnist_convnet.ipynb) -->

</div>

## **Table of Contents**

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Features](#features)
    - [Split Textfile](#split-textfile)
    - [Build Parallel Corpus](#build-parallel-corpus)
    - [Separate Parallel Corpus](#separate-parallel-corpus)
    - [Decontruct Words of Sentence](#deconstruct-word-of-sentence)
    - [Remove Punctuation](#remove-punctuation)
    - [Space Punctuation](#space-punctuation)
    - [Text File to List](#text-file-to-list)
    - [Text File to Dataframe](#text-file-to-dataframe)
    - [List to Text File](#list-to-text-file)
    - [Remove File](#remove-file)
    - [Count Characters of a Sentence](#count-characters-of-a-sentence)
    - [Count Words of Sentence](#count-characters-of-a-sentence)
    - [Count No of Lines in a Text File](#count-no-of-lines-in-a-text-file)
    - [Convert Excel to Multiple Text Files](#convert-excel-to-multiple-text-files)
    - [Merge Multiple Text Files](#merge-multiple-text-files)
    - **[Apply Any Function in a Full Text File](#apply-a-function-in-whole-text-file)**

    

## **Installation**
Install the latest stable release<br>
**For windows**<br>
```
pip install -U data-preprocessors
```

**For Linux/WSL2**<br>
```
pip3 install -U data-preprocessors
```

## **Quick Start**

```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

>> bla bla bla bla
```

## **Features**

### Split Textfile

This function will split your textfile into train, test and validate. Three separate text files. By changing `shuffle` and `seed` value, you can randomly shuffle the lines of your text files.

```python
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
    main_file_path="example.txt",
    train_file_path="splitted/train.txt",
    val_file_path="splitted/val.txt",
    test_file_path="splitted/test.txt",
    train_size=0.6,
    val_size=0.2,
    test_size=0.2,
    shuffle=True,
    seed=42
)

# Total lines:  500
# Train set size:  300
# Validation set size:  100
# Test set size:  100
```

### Separate Parallel Corpus

By using this function, you will be able to easily separate `src_tgt_file` into separated `src_file` and `tgt_file`.

```python
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")
```

### Decontracting Words from Sentence

```python
tp.decontracting_words(sentence)
```

### Remove Punctuation

By using this function, you will be able to remove the punction of a single line of a text file.

```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)

# bla bla bla bla
```

### Space Punctuation

By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.

```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)

# bla bla bla bla
```

### Text File to List

Convert any text file into list.

```python
 mylist= tp.text2list(myfile_path="myfile.txt")
```

### List to Text File

Convert any list into a text file (filename.txt)

```python
tp.list2text(mylist=mylist, myfile_path="myfile.txt")
```

### Count Characters of a Sentence

This function will help to count the total characters of a sentence.

```python
tp.count_chars(myfile="file.txt")
```

### Convert Excel to Multiple Text Files

This function will help to Convert an Excel file's columns into multiple text files.

```python
tp.excel2multitext(excel_file_path="",
                    column_names=None,
                    src_file="",
                    tgt_file="",
                    aligns_file="",
                    separator="|||",
                    src_tgt_file="",
                    )
```

### Apply a function in whole text file

In the place of `function_name` you can use any function and that function will be applied in the full/whole text file.

```python
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
    function_name, 
    myfile_path="myfile.txt", 
    modified_file_path="modified_file.txt"
)
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/MusfiqDehan/data-preprocessors",
    "name": "data-preprocessors",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.7.1",
    "maintainer_email": null,
    "keywords": "nlp, data-preprocessors, data-preprocessing, text-preprocessing, data-science, textfile, musfiqdehan",
    "author": "Md. Musfiqur Rahaman",
    "author_email": "musfiqur.rahaman@northsouth.edu",
    "download_url": "https://files.pythonhosted.org/packages/61/62/a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982/data_preprocessors-0.58.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n    \n<img src=\"https://github.com/MusfiqDehan/data-preprocessors/raw/master/branding/logo.png\">\n\n<p>Data Preprocessors</p>\n\n<sub>An easy-to-use tool for Data Preprocessing especially for Text Preprocessing</sub>\n\n<!-- Badges -->\n\n<!-- [<img src=\"https://deepnote.com/buttons/launch-in-deepnote-small.svg\">](PROJECT_URL) -->\n    \n[![](https://img.shields.io/pypi/v/data-preprocessors.svg)](https://pypi.org/project/data-preprocessors/)\n[![Downloads](https://img.shields.io/pypi/dm/data-preprocessors)](https://pepy.tech/project/data-preprocessors)\n    \n<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mJuRfIz__uS3xoFaBsFn5mkLE418RU19?usp=sharing)\n[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/keras-team/keras-io/blob/master/examples/vision/ipynb/mnist_convnet.ipynb) -->\n\n</div>\n\n## **Table of Contents**\n\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Features](#features)\n    - [Split Textfile](#split-textfile)\n    - [Build Parallel Corpus](#build-parallel-corpus)\n    - [Separate Parallel Corpus](#separate-parallel-corpus)\n    - [Decontruct Words of Sentence](#deconstruct-word-of-sentence)\n    - [Remove Punctuation](#remove-punctuation)\n    - [Space Punctuation](#space-punctuation)\n    - [Text File to List](#text-file-to-list)\n    - [Text File to Dataframe](#text-file-to-dataframe)\n    - [List to Text File](#list-to-text-file)\n    - [Remove File](#remove-file)\n    - [Count Characters of a Sentence](#count-characters-of-a-sentence)\n    - [Count Words of Sentence](#count-characters-of-a-sentence)\n    - [Count No of Lines in a Text File](#count-no-of-lines-in-a-text-file)\n    - [Convert Excel to Multiple Text Files](#convert-excel-to-multiple-text-files)\n    - [Merge Multiple Text Files](#merge-multiple-text-files)\n    - **[Apply Any Function in a Full Text File](#apply-a-function-in-whole-text-file)**\n\n    \n\n## **Installation**\nInstall the latest stable release<br>\n**For windows**<br>\n```\npip install -U data-preprocessors\n```\n\n**For Linux/WSL2**<br>\n```\npip3 install -U data-preprocessors\n```\n\n## **Quick Start**\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\nsentence = \"bla! bla- ?bla ?bla.\"\nsentence = tp.remove_punc(sentence)\nprint(sentence)\n\n>> bla bla bla bla\n```\n\n## **Features**\n\n### Split Textfile\n\nThis function will split your textfile into train, test and validate. Three separate text files. By changing `shuffle` and `seed` value, you can randomly shuffle the lines of your text files.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\ntp.split_textfile(\n    main_file_path=\"example.txt\",\n    train_file_path=\"splitted/train.txt\",\n    val_file_path=\"splitted/val.txt\",\n    test_file_path=\"splitted/test.txt\",\n    train_size=0.6,\n    val_size=0.2,\n    test_size=0.2,\n    shuffle=True,\n    seed=42\n)\n\n# Total lines:  500\n# Train set size:  300\n# Validation set size:  100\n# Test set size:  100\n```\n\n### Separate Parallel Corpus\n\nBy using this function, you will be able to easily separate `src_tgt_file` into separated `src_file` and `tgt_file`.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\ntp.separate_parallel_corpus(src_tgt_file=\"\", separator=\"|||\", src_file=\"\", tgt_file=\"\")\n```\n\n### Decontracting Words from Sentence\n\n```python\ntp.decontracting_words(sentence)\n```\n\n### Remove Punctuation\n\nBy using this function, you will be able to remove the punction of a single line of a text file.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\nsentence = \"bla! bla- ?bla ?bla.\"\nsentence = tp.remove_punc(sentence)\nprint(sentence)\n\n# bla bla bla bla\n```\n\n### Space Punctuation\n\nBy using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\nsentence = \"bla! bla- ?bla ?bla.\"\nsentence = tp.space_punc(sentence)\nprint(sentence)\n\n# bla bla bla bla\n```\n\n### Text File to List\n\nConvert any text file into list.\n\n```python\n mylist= tp.text2list(myfile_path=\"myfile.txt\")\n```\n\n### List to Text File\n\nConvert any list into a text file (filename.txt)\n\n```python\ntp.list2text(mylist=mylist, myfile_path=\"myfile.txt\")\n```\n\n### Count Characters of a Sentence\n\nThis function will help to count the total characters of a sentence.\n\n```python\ntp.count_chars(myfile=\"file.txt\")\n```\n\n### Convert Excel to Multiple Text Files\n\nThis function will help to Convert an Excel file's columns into multiple text files.\n\n```python\ntp.excel2multitext(excel_file_path=\"\",\n                    column_names=None,\n                    src_file=\"\",\n                    tgt_file=\"\",\n                    aligns_file=\"\",\n                    separator=\"|||\",\n                    src_tgt_file=\"\",\n                    )\n```\n\n### Apply a function in whole text file\n\nIn the place of `function_name` you can use any function and that function will be applied in the full/whole text file.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\ntp.apply_whole(\n    function_name, \n    myfile_path=\"myfile.txt\", \n    modified_file_path=\"modified_file.txt\"\n)\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An easy to use tool for Data Preprocessing specially for Text Preprocessing",
    "version": "0.58.0",
    "project_urls": {
        "Homepage": "https://github.com/MusfiqDehan/data-preprocessors",
        "Repository": "https://github.com/MusfiqDehan/data-preprocessors"
    },
    "split_keywords": [
        "nlp",
        " data-preprocessors",
        " data-preprocessing",
        " text-preprocessing",
        " data-science",
        " textfile",
        " musfiqdehan"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "932e8ee9098af9bb8e92c3d769074739bcb27e63f7b74e24a635a7f669d54d39",
                "md5": "ae589fa9a023da3834e68c1b8c2aa1d9",
                "sha256": "0e87feaa660ff5388fdadeaf4c142d42d96389d0df45c460b1d042786cf65cb9"
            },
            "downloads": -1,
            "filename": "data_preprocessors-0.58.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ae589fa9a023da3834e68c1b8c2aa1d9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.7.1",
            "size": 8649,
            "upload_time": "2024-09-02T23:00:41",
            "upload_time_iso_8601": "2024-09-02T23:00:41.291170Z",
            "url": "https://files.pythonhosted.org/packages/93/2e/8ee9098af9bb8e92c3d769074739bcb27e63f7b74e24a635a7f669d54d39/data_preprocessors-0.58.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6162a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982",
                "md5": "a1224838c8642c9d643ec462b5c71492",
                "sha256": "47cc200da7a7e0428c94193f02a9a005fff686723e02e23e458c8f217a22e483"
            },
            "downloads": -1,
            "filename": "data_preprocessors-0.58.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a1224838c8642c9d643ec462b5c71492",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.7.1",
            "size": 9114,
            "upload_time": "2024-09-02T23:00:42",
            "upload_time_iso_8601": "2024-09-02T23:00:42.625123Z",
            "url": "https://files.pythonhosted.org/packages/61/62/a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982/data_preprocessors-0.58.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-02 23:00:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MusfiqDehan",
    "github_project": "data-preprocessors",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "bnlp-toolkit",
            "specs": []
        },
        {
            "name": "click",
            "specs": []
        },
        {
            "name": "colorama",
            "specs": []
        },
        {
            "name": "cython",
            "specs": []
        },
        {
            "name": "gensim",
            "specs": []
        },
        {
            "name": "importlib-metadata",
            "specs": []
        },
        {
            "name": "joblib",
            "specs": []
        },
        {
            "name": "nltk",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "python-crfsuite",
            "specs": []
        },
        {
            "name": "python-dateutil",
            "specs": []
        },
        {
            "name": "pytz",
            "specs": []
        },
        {
            "name": "regex",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "sentencepiece",
            "specs": []
        },
        {
            "name": "six",
            "specs": []
        },
        {
            "name": "sklearn-crfsuite",
            "specs": []
        },
        {
            "name": "smart-open",
            "specs": []
        },
        {
            "name": "tabulate",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "typing-extensions",
            "specs": []
        },
        {
            "name": "wasabi",
            "specs": []
        },
        {
            "name": "zipp",
            "specs": []
        }
    ],
    "lcname": "data-preprocessors"
}
        
Elapsed time: 1.81970s