<div align="center">
<img src="https://github.com/MusfiqDehan/data-preprocessors/raw/master/branding/logo.png">
<p>Data Preprocessors</p>
<sub>An easy-to-use tool for Data Preprocessing especially for Text Preprocessing</sub>
<!-- Badges -->
<!-- [<img src="https://deepnote.com/buttons/launch-in-deepnote-small.svg">](PROJECT_URL) -->
[![](https://img.shields.io/pypi/v/data-preprocessors.svg)](https://pypi.org/project/data-preprocessors/)
[![Downloads](https://img.shields.io/pypi/dm/data-preprocessors)](https://pepy.tech/project/data-preprocessors)
<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mJuRfIz__uS3xoFaBsFn5mkLE418RU19?usp=sharing)
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/keras-team/keras-io/blob/master/examples/vision/ipynb/mnist_convnet.ipynb) -->
</div>
## **Table of Contents**
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Features](#features)
- [Split Textfile](#split-textfile)
- [Build Parallel Corpus](#build-parallel-corpus)
- [Separate Parallel Corpus](#separate-parallel-corpus)
- [Decontruct Words of Sentence](#deconstruct-word-of-sentence)
- [Remove Punctuation](#remove-punctuation)
- [Space Punctuation](#space-punctuation)
- [Text File to List](#text-file-to-list)
- [Text File to Dataframe](#text-file-to-dataframe)
- [List to Text File](#list-to-text-file)
- [Remove File](#remove-file)
- [Count Characters of a Sentence](#count-characters-of-a-sentence)
- [Count Words of Sentence](#count-characters-of-a-sentence)
- [Count No of Lines in a Text File](#count-no-of-lines-in-a-text-file)
- [Convert Excel to Multiple Text Files](#convert-excel-to-multiple-text-files)
- [Merge Multiple Text Files](#merge-multiple-text-files)
- **[Apply Any Function in a Full Text File](#apply-a-function-in-whole-text-file)**
## **Installation**
Install the latest stable release<br>
**For windows**<br>
```
pip install -U data-preprocessors
```
**For Linux/WSL2**<br>
```
pip3 install -U data-preprocessors
```
## **Quick Start**
```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
>> bla bla bla bla
```
## **Features**
### Split Textfile
This function will split your textfile into train, test and validate. Three separate text files. By changing `shuffle` and `seed` value, you can randomly shuffle the lines of your text files.
```python
from data_preprocessors import text_preprocessor as tp
tp.split_textfile(
main_file_path="example.txt",
train_file_path="splitted/train.txt",
val_file_path="splitted/val.txt",
test_file_path="splitted/test.txt",
train_size=0.6,
val_size=0.2,
test_size=0.2,
shuffle=True,
seed=42
)
# Total lines: 500
# Train set size: 300
# Validation set size: 100
# Test set size: 100
```
### Separate Parallel Corpus
By using this function, you will be able to easily separate `src_tgt_file` into separated `src_file` and `tgt_file`.
```python
from data_preprocessors import text_preprocessor as tp
tp.separate_parallel_corpus(src_tgt_file="", separator="|||", src_file="", tgt_file="")
```
### Decontracting Words from Sentence
```python
tp.decontracting_words(sentence)
```
### Remove Punctuation
By using this function, you will be able to remove the punction of a single line of a text file.
```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.remove_punc(sentence)
print(sentence)
# bla bla bla bla
```
### Space Punctuation
By using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.
```python
from data_preprocessors import text_preprocessor as tp
sentence = "bla! bla- ?bla ?bla."
sentence = tp.space_punc(sentence)
print(sentence)
# bla bla bla bla
```
### Text File to List
Convert any text file into list.
```python
mylist= tp.text2list(myfile_path="myfile.txt")
```
### List to Text File
Convert any list into a text file (filename.txt)
```python
tp.list2text(mylist=mylist, myfile_path="myfile.txt")
```
### Count Characters of a Sentence
This function will help to count the total characters of a sentence.
```python
tp.count_chars(myfile="file.txt")
```
### Convert Excel to Multiple Text Files
This function will help to Convert an Excel file's columns into multiple text files.
```python
tp.excel2multitext(excel_file_path="",
column_names=None,
src_file="",
tgt_file="",
aligns_file="",
separator="|||",
src_tgt_file="",
)
```
### Apply a function in whole text file
In the place of `function_name` you can use any function and that function will be applied in the full/whole text file.
```python
from data_preprocessors import text_preprocessor as tp
tp.apply_whole(
function_name,
myfile_path="myfile.txt",
modified_file_path="modified_file.txt"
)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/MusfiqDehan/data-preprocessors",
"name": "data-preprocessors",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.7.1",
"maintainer_email": null,
"keywords": "nlp, data-preprocessors, data-preprocessing, text-preprocessing, data-science, textfile, musfiqdehan",
"author": "Md. Musfiqur Rahaman",
"author_email": "musfiqur.rahaman@northsouth.edu",
"download_url": "https://files.pythonhosted.org/packages/61/62/a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982/data_preprocessors-0.58.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n \n<img src=\"https://github.com/MusfiqDehan/data-preprocessors/raw/master/branding/logo.png\">\n\n<p>Data Preprocessors</p>\n\n<sub>An easy-to-use tool for Data Preprocessing especially for Text Preprocessing</sub>\n\n<!-- Badges -->\n\n<!-- [<img src=\"https://deepnote.com/buttons/launch-in-deepnote-small.svg\">](PROJECT_URL) -->\n \n[![](https://img.shields.io/pypi/v/data-preprocessors.svg)](https://pypi.org/project/data-preprocessors/)\n[![Downloads](https://img.shields.io/pypi/dm/data-preprocessors)](https://pepy.tech/project/data-preprocessors)\n \n<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mJuRfIz__uS3xoFaBsFn5mkLE418RU19?usp=sharing)\n[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/keras-team/keras-io/blob/master/examples/vision/ipynb/mnist_convnet.ipynb) -->\n\n</div>\n\n## **Table of Contents**\n\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Features](#features)\n - [Split Textfile](#split-textfile)\n - [Build Parallel Corpus](#build-parallel-corpus)\n - [Separate Parallel Corpus](#separate-parallel-corpus)\n - [Decontruct Words of Sentence](#deconstruct-word-of-sentence)\n - [Remove Punctuation](#remove-punctuation)\n - [Space Punctuation](#space-punctuation)\n - [Text File to List](#text-file-to-list)\n - [Text File to Dataframe](#text-file-to-dataframe)\n - [List to Text File](#list-to-text-file)\n - [Remove File](#remove-file)\n - [Count Characters of a Sentence](#count-characters-of-a-sentence)\n - [Count Words of Sentence](#count-characters-of-a-sentence)\n - [Count No of Lines in a Text File](#count-no-of-lines-in-a-text-file)\n - [Convert Excel to Multiple Text Files](#convert-excel-to-multiple-text-files)\n - [Merge Multiple Text Files](#merge-multiple-text-files)\n - **[Apply Any Function in a Full Text File](#apply-a-function-in-whole-text-file)**\n\n \n\n## **Installation**\nInstall the latest stable release<br>\n**For windows**<br>\n```\npip install -U data-preprocessors\n```\n\n**For Linux/WSL2**<br>\n```\npip3 install -U data-preprocessors\n```\n\n## **Quick Start**\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\nsentence = \"bla! bla- ?bla ?bla.\"\nsentence = tp.remove_punc(sentence)\nprint(sentence)\n\n>> bla bla bla bla\n```\n\n## **Features**\n\n### Split Textfile\n\nThis function will split your textfile into train, test and validate. Three separate text files. By changing `shuffle` and `seed` value, you can randomly shuffle the lines of your text files.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\ntp.split_textfile(\n main_file_path=\"example.txt\",\n train_file_path=\"splitted/train.txt\",\n val_file_path=\"splitted/val.txt\",\n test_file_path=\"splitted/test.txt\",\n train_size=0.6,\n val_size=0.2,\n test_size=0.2,\n shuffle=True,\n seed=42\n)\n\n# Total lines: 500\n# Train set size: 300\n# Validation set size: 100\n# Test set size: 100\n```\n\n### Separate Parallel Corpus\n\nBy using this function, you will be able to easily separate `src_tgt_file` into separated `src_file` and `tgt_file`.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\ntp.separate_parallel_corpus(src_tgt_file=\"\", separator=\"|||\", src_file=\"\", tgt_file=\"\")\n```\n\n### Decontracting Words from Sentence\n\n```python\ntp.decontracting_words(sentence)\n```\n\n### Remove Punctuation\n\nBy using this function, you will be able to remove the punction of a single line of a text file.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\nsentence = \"bla! bla- ?bla ?bla.\"\nsentence = tp.remove_punc(sentence)\nprint(sentence)\n\n# bla bla bla bla\n```\n\n### Space Punctuation\n\nBy using this function, you will be able to add one space to the both side of the punction so that it will easier to tokenize the sentence. This will apply on a single line of a text file. But if we want, we can use it in a full twxt file.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\nsentence = \"bla! bla- ?bla ?bla.\"\nsentence = tp.space_punc(sentence)\nprint(sentence)\n\n# bla bla bla bla\n```\n\n### Text File to List\n\nConvert any text file into list.\n\n```python\n mylist= tp.text2list(myfile_path=\"myfile.txt\")\n```\n\n### List to Text File\n\nConvert any list into a text file (filename.txt)\n\n```python\ntp.list2text(mylist=mylist, myfile_path=\"myfile.txt\")\n```\n\n### Count Characters of a Sentence\n\nThis function will help to count the total characters of a sentence.\n\n```python\ntp.count_chars(myfile=\"file.txt\")\n```\n\n### Convert Excel to Multiple Text Files\n\nThis function will help to Convert an Excel file's columns into multiple text files.\n\n```python\ntp.excel2multitext(excel_file_path=\"\",\n column_names=None,\n src_file=\"\",\n tgt_file=\"\",\n aligns_file=\"\",\n separator=\"|||\",\n src_tgt_file=\"\",\n )\n```\n\n### Apply a function in whole text file\n\nIn the place of `function_name` you can use any function and that function will be applied in the full/whole text file.\n\n```python\nfrom data_preprocessors import text_preprocessor as tp\ntp.apply_whole(\n function_name, \n myfile_path=\"myfile.txt\", \n modified_file_path=\"modified_file.txt\"\n)\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "An easy to use tool for Data Preprocessing specially for Text Preprocessing",
"version": "0.58.0",
"project_urls": {
"Homepage": "https://github.com/MusfiqDehan/data-preprocessors",
"Repository": "https://github.com/MusfiqDehan/data-preprocessors"
},
"split_keywords": [
"nlp",
" data-preprocessors",
" data-preprocessing",
" text-preprocessing",
" data-science",
" textfile",
" musfiqdehan"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "932e8ee9098af9bb8e92c3d769074739bcb27e63f7b74e24a635a7f669d54d39",
"md5": "ae589fa9a023da3834e68c1b8c2aa1d9",
"sha256": "0e87feaa660ff5388fdadeaf4c142d42d96389d0df45c460b1d042786cf65cb9"
},
"downloads": -1,
"filename": "data_preprocessors-0.58.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ae589fa9a023da3834e68c1b8c2aa1d9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.7.1",
"size": 8649,
"upload_time": "2024-09-02T23:00:41",
"upload_time_iso_8601": "2024-09-02T23:00:41.291170Z",
"url": "https://files.pythonhosted.org/packages/93/2e/8ee9098af9bb8e92c3d769074739bcb27e63f7b74e24a635a7f669d54d39/data_preprocessors-0.58.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6162a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982",
"md5": "a1224838c8642c9d643ec462b5c71492",
"sha256": "47cc200da7a7e0428c94193f02a9a005fff686723e02e23e458c8f217a22e483"
},
"downloads": -1,
"filename": "data_preprocessors-0.58.0.tar.gz",
"has_sig": false,
"md5_digest": "a1224838c8642c9d643ec462b5c71492",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.7.1",
"size": 9114,
"upload_time": "2024-09-02T23:00:42",
"upload_time_iso_8601": "2024-09-02T23:00:42.625123Z",
"url": "https://files.pythonhosted.org/packages/61/62/a1b31149cb6e27a9c79ead9c234c04e873fcad837615f4876051573fe982/data_preprocessors-0.58.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-02 23:00:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "MusfiqDehan",
"github_project": "data-preprocessors",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "bnlp-toolkit",
"specs": []
},
{
"name": "click",
"specs": []
},
{
"name": "colorama",
"specs": []
},
{
"name": "cython",
"specs": []
},
{
"name": "gensim",
"specs": []
},
{
"name": "importlib-metadata",
"specs": []
},
{
"name": "joblib",
"specs": []
},
{
"name": "nltk",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "python-crfsuite",
"specs": []
},
{
"name": "python-dateutil",
"specs": []
},
{
"name": "pytz",
"specs": []
},
{
"name": "regex",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
},
{
"name": "six",
"specs": []
},
{
"name": "sklearn-crfsuite",
"specs": []
},
{
"name": "smart-open",
"specs": []
},
{
"name": "tabulate",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "typing-extensions",
"specs": []
},
{
"name": "wasabi",
"specs": []
},
{
"name": "zipp",
"specs": []
}
],
"lcname": "data-preprocessors"
}