# Text cleaning Pipeline
[](https://github.com/szymonrucinski/pippi-lang/actions/workflows/build-pkg.yml) [](https://github.com/szymonrucinski/pippi-lang/actions/workflows/check-style.yml)[](https://github.com/szymonrucinski/pippi-lang/actions/workflows/run-tests.yml)
___
## Description
This code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation.
*Future code will include text-transformations like word-embedding and word-vectorization.*
### Example
Elegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.
``` python
from pippi import (
TransformLettersSize,
RemoveStopWords,
Lemmatize,
RemovePunctuation,
RemoveHTMLTags,
)
from sklearn.pipeline import Pipeline
import pandas as pd
pipeline = Pipeline(
steps=[
("remove_stop_words", RemoveStopWords(columns=["review","sentiment"])),
("remove_html_tags", RemoveHTMLTags(columns=df.columns.to_list())),
("uppercase_letters", TransformLettersSize(columns=["sentiment"], case_transform="upper")),
("remove_punctuation", RemovePunctuation(columns=["review"])),
]
)
output = pipeline.fit_transform(df)
df = pd.DataFrame(output, columns=["review", "sentiment"])
```
Pipeline Visualization:
``` markdown
[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] -> [RemovePunctuation]
```
Raw data
{
"_id": null,
"home_page": "https://github.com/szymonrucinski/pippi-lang",
"name": "pippi-lang",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,stream,sockets",
"author": "Szymon Ruci\u0144ski",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/d9/d5/13e4af263ad2bd7b9f5d5e88ec0fcb9256176c3b8976b52f229007008ad9/pippi-lang-0.0.2.tar.gz",
"platform": null,
"description": "\n# Text cleaning Pipeline \n\n[](https://github.com/szymonrucinski/pippi-lang/actions/workflows/build-pkg.yml) [](https://github.com/szymonrucinski/pippi-lang/actions/workflows/check-style.yml)[](https://github.com/szymonrucinski/pippi-lang/actions/workflows/run-tests.yml)\n___\n## Description\nThis code contains a pipeline for pre-processing text data for sentiment analysis. It includes steps for removing stop words, HTML tags, changing letter size, and removing punctuation.\n*Future code will include text-transformations like word-embedding and word-vectorization.*\n\n### Example\nElegant data pipelines are a key component of any data science project. They allow you to automate the process of cleaning, transforming, and analyzing data. This code is a simple example of how to create a pipeline for text data using cutom transformers and the sklearn Pipeline class.\n\n``` python\n\nfrom pippi import (\n TransformLettersSize,\n RemoveStopWords,\n Lemmatize,\n RemovePunctuation,\n RemoveHTMLTags,\n)\nfrom sklearn.pipeline import Pipeline\nimport pandas as pd\n\n pipeline = Pipeline(\n steps=[\n (\"remove_stop_words\", RemoveStopWords(columns=[\"review\",\"sentiment\"])),\n (\"remove_html_tags\", RemoveHTMLTags(columns=df.columns.to_list())),\n (\"uppercase_letters\", TransformLettersSize(columns=[\"sentiment\"], case_transform=\"upper\")),\n (\"remove_punctuation\", RemovePunctuation(columns=[\"review\"])),\n ]\n )\n output = pipeline.fit_transform(df)\n df = pd.DataFrame(output, columns=[\"review\", \"sentiment\"])\n\n```\nPipeline Visualization:\n\n``` markdown\n[RemoveStopWords] -> [RemoveHTMLTags] -> [TransformLettersSize] -> [RemovePunctuation]\n```\n\n",
"bugtrack_url": null,
"license": "",
"summary": "A simple package to create elegant nlp pipelines using sklearn.",
"version": "0.0.2",
"split_keywords": [
"python",
"stream",
"sockets"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "279b916682ccd746db633150d8454d62e50d4d2096464afd52ab2535d94d3e87",
"md5": "60f5a0cb1d08db5922d3c0e4c24076cb",
"sha256": "a3f510d06413b43a8a8a5ec7d1702ce608593bbe8c355bfd018565352ca5e79b"
},
"downloads": -1,
"filename": "pippi_lang-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "60f5a0cb1d08db5922d3c0e4c24076cb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9292,
"upload_time": "2023-02-10T23:56:57",
"upload_time_iso_8601": "2023-02-10T23:56:57.012647Z",
"url": "https://files.pythonhosted.org/packages/27/9b/916682ccd746db633150d8454d62e50d4d2096464afd52ab2535d94d3e87/pippi_lang-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d9d513e4af263ad2bd7b9f5d5e88ec0fcb9256176c3b8976b52f229007008ad9",
"md5": "9cb87a17e16817d7a8234f3bebd4d5f3",
"sha256": "01548a042fad6770b8551647e6eead9ed85a520074afb96d758a6ba7b2529343"
},
"downloads": -1,
"filename": "pippi-lang-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "9cb87a17e16817d7a8234f3bebd4d5f3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6300,
"upload_time": "2023-02-10T23:56:59",
"upload_time_iso_8601": "2023-02-10T23:56:59.731157Z",
"url": "https://files.pythonhosted.org/packages/d9/d5/13e4af263ad2bd7b9f5d5e88ec0fcb9256176c3b8976b52f229007008ad9/pippi-lang-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-10 23:56:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "szymonrucinski",
"github_project": "pippi-lang",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pippi-lang"
}