# NLPretext
<p align="center">
<img src="/references/logo_nlpretext.png" />
</p>
<div align="center">
[![CI status](https://github.com/artefactory/NLPretext/actions/workflows/ci.yml/badge.svg?branch%3Amain&event%3Apush)](https://github.com/artefactory/NLPretext/actions/workflows/ci.yml?query=branch%3Amain)
[![CD status](https://github.com/artefactory/NLPretext/actions/workflows/cd.yml/badge.svg?event%3Arelease)](https://github.com/artefactory/NLPretext/actions/workflows/cd.yml?query=event%3Arelease)
[![Python Version](https://img.shields.io/badge/Python-3.8-informational.svg)](#supported-python-versions)
[![Dependencies Status](https://img.shields.io/badge/dependabots-active-informational.svg)](https://github.com/artefactory/NLPretext}/pulls?utf8=%E2%9C%93&q=is%3Apr%20author%3Aapp%2Fdependabot)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-informational.svg)](https://github.com/PyCQA/bandit)
[![Pre-commit](https://img.shields.io/badge/pre--commit-enabled-informational?logo=pre-commit&logoColor=white)](https://github.com/artefactory/NLPretext}/blob/main/.pre-commit-config.yaml)
[![Semantic Versions](https://img.shields.io/badge/%F0%9F%9A%80-semantic%20versions-informational.svg)](https://github.com/artefactory/NLPretext/releases)
[![Documentation](https://img.shields.io/badge/doc-sphinx-informational.svg)](https://github.com/artefactory/NLPretext}/tree/main/docs)
[![License](https://img.shields.io/badge/License-Apache%20Software%20License%202.0-informational.svg)](https://github.com/artefactory/NLPretext}/blob/main/LICENSE)
All the goto functions you need to handle NLP use-cases, integrated in NLPretext
</div>
# TL;DR
> *Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web?* :tired_face:
> *Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post?* :disappointed_relieved:
**NLPretext got you covered!** :rocket:
NLPretext packages in a **unique** library all the text **preprocessing** functions you need to **ease** your NLP project.
:mag: Quickly explore below our preprocessing pipelines and individual functions referential.
* [Default preprocessing pipeline](#default_pipeline)
* [Custom preprocessing pipeline](#custom_pipeline)
* [Replacing phone numbers](#replace_phone_numbers)
* [Removing hashtags](#remove_hashtags)
* [Extracting emojis](#extract_emojis)
* [Data augmentation](#data_augmentation)
Cannot find what you were looking for? Feel free to open an [issue]((https://github.com/artefactory/nlpretext/issues) ).
# Installation
### Supported Python Versions
- Main version supported : `3.8`
- Other supported versions : `3.9`, `3.10`
We strongly advise you to do the remaining steps in a virtual environnement.
To install this library from PyPi, run the following command:
```bash
pip install nlpretext
```
or with `Poetry`
```bash
poetry add nlpretext
```
# Usage
## Default pipeline <a name="default_pipeline"></a>
Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:
```python
from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I recommend π #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life!!! I recommend"
```
## Create your custom pipeline <a name="custom_pipeline"></a>
Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:
```python
from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I recommend π #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"
```
Take a look at all the functions that are available [here](https://github.com/artefactory/NLPretext/tree/master/nlpretext) in the ```preprocess.py``` scripts in the different folders: basic, social, token.
## Load text data
Pre-processing text data is useful only if you have loaded data to process! Importing text data as strings in your code can be really simple if you have short texts contained in a local .txt, but it can quickly become difficult if you want to load a lot of texts, stored in multiple formats and divided in multiple files. Hopefully, you can use NLPretext's TextLoader class to easily import text data.
Our TextLoader class makes use of Dask, so be sure to install the library if you want to use it, as mentioned above.
```python
from nlpretext.textloader import TextLoader
files_path = "local_folder/texts/text.txt"
text_loader = TextLoader()
text_dataframe = text_loader.read_text(files_path)
print(text_dataframe.text.values.tolist())
# ["I just got the best dinner in my life!!!", "I recommend", "It was awesome"]
```
As TextLoader uses dask to load data, file path can be provided as string, list of strings, with or without wildcards. It also supports imports from cloud providers, if your machine is authentified on a project.
```python
text_loader = TextLoader(text_column="name_of_text_column_in_your_data")
local_file_path = "local_folder/texts/text.csv" # File from local folder
local_corpus_path = ["local_folder/texts/text_1.csv", "local_folder/texts/text_2.csv", "local_folder/texts/text_3.csv"] # Multiple files from local folder
gcs_file_path = "gs://my-bucket/texts/text.json" # File from GCS
s3_file_path = "s3://my-bucket/texts/text.json" # File from S3
hdfs_file_path = "hdfs://folder/texts/text.txt" # File from HDFS
azure_file_path = "az://my-bucket/texts/text.parquet" # File from Azure
gcs_corpus_path = "gs://my-bucket/texts/text_*.json" # Multiple files from GCS with wildcard
text_dataframe_1 = text_loader.read_text(local_file_path)
text_dataframe_2 = text_loader.read_text(local_corpus_path)
text_dataframe_3 = text_loader.read_text(gcs_file_path)
text_dataframe_4 = text_loader.read_text(s3_file_path)
text_dataframe_5 = text_loader.read_text(hdfs_file_path)
text_dataframe_6 = text_loader.read_text(azure_file_path)
text_dataframe_7 = text_loader.read_text(gcs_corpus_path)
```
You can also specify a Preprocessor if you want your data to be directly pre-processed when loaded.
```python
text_loader = TextLoader(text_column="text_col")
preprocessor = Preprocessor()
file_path = "local_folder/texts/text.csv" # File from local folder
raw_text_dataframe = text_loader.read_text(local_file_path)
preprocessed_text_dataframe = text_loader.read_text(local_file_path, preprocessor=preprocessor)
print(raw_text_dataframe.text_col.values.tolist())
# ["These texts are not preprocessed", "This is bad ## "]
print(preprocessed_text_dataframe.text_col.values.tolist())
# ["These texts are not preprocessed", "This is bad"]
```
## Individual Functions
### Replacing emails <a name="replace_emails"></a>
```python
from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to obama@whitehouse.gov"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"
```
### Replacing phone numbers <a name="replace_phone_numbers"></a>
```python
from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"
```
### Removing Hashtags <a name="remove_hashtags"></a>
```python
from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"
```
### Extracting emojis <a name="extract_emojis"></a>
```python
from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin π"
example = extract_emojis(example)
print(example)
# [':grinning_face:']
```
## Data augmentation <a name="data_augmentation"></a>
The augmentation module helps you to **generate new texts** based on your given examples by modifying some words in the initial ones and to **keep associated entities unchanged**, if any, in the case of **NER tasks**. If you want words other than entities to remain unchanged, you can specify it within the `stopwords` argument. Modifications depend on the chosen method, the ones currently supported by the module are **substitutions with synonyms** using Wordnet or BERT from the [`nlpaug`](https://github.com/makcedward/nlpaug) library.
```python
from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=βwordnet_synonymβ, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."
```
# π Releases
You can see the list of available releases on the [GitHub Releases](https://github.com/artefactory/NLPretext}/releases) page.
We follow [Semantic Versions](https://semver.org/) specification.
We use [`Release Drafter`](https://github.com/marketplace/actions/release-drafter). As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when youβre ready. With the categories option, you can categorize pull requests in release notes using labels.
For Pull Requests, these labels are configured, by default:
| **Label** | **Title in Releases** |
| :-----------------------------------: | :---------------------: |
| `enhancement`, `feature` | π Features |
| `bug`, `refactoring`, `bugfix`, `fix` | π§ Fixes & Refactoring |
| `build`, `ci`, `testing` | π¦ Build System & CI/CD |
| `breaking` | π₯ Breaking Changes |
| `documentation` | π Documentation |
| `dependencies` | β¬οΈ Dependencies updates |
GitHub creates the `bug`, `enhancement`, and `documentation` labels automatically. Dependabot creates the `dependencies` label. Create the remaining labels on the Issues tab of the GitHub repository, when needed.## π‘ License
[![License](https://img.shields.io/github/license/artefactory/NLPretext)](https://github.com/artefactory/NLPretext}/blob/main/LICENSE)
This project is licensed under the terms of the `Apache Software License 2.0` license. See [LICENSE](https://github.com/artefactory/NLPretext}/blob/main/LICENSE) for more details.## π Citation
```
@misc{nlpretext,
author = {artefactory},
title = {All the goto functions you need to handle NLP use-cases, integrated in NLPretext},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/artefactory/NLPretext}}}
}
```
# Project Organization
------------
βββ LICENSE
βββ CONTRIBUTING.md <- Contribution guidelines
βββ CODE_OF_CONDUCT.md <- Code of conduct guidelines
βββ Makefile
βββ README.md <- The top-level README for developers using this project.
βββ .github/workflows <- Where the CI and CD lives
βββ datasets/external <- Bash scripts to download external datasets
βββ docker <- All you need to build a Docker image from that package
βββ docs <- Sphinx HTML documentation
βββ nlpretext <- Main Package. This is where the code lives
βΒ Β βββ preprocessor.py <- Main preprocessing script
βΒ Β βββ text_loader.py <- Main loading script
βΒ Β βββ augmentation <- Text augmentation script
βΒ Β βββ basic <- Basic text preprocessing
βΒ Β βββ cli <- Command lines that can be used
βΒ Β βββ social <- Social text preprocessing
βΒ Β βββ token <- Token text preprocessing
βΒ Β βββ _config <- Where the configuration and constants live
βΒ Β βββ _utils <- Where preprocessing utils scripts lives
βββ tests <- Where the tests lives
βββ pyproject.toml <- Package configuration
βββ poetry.lock
βββ setup.cfg <- Configuration for plugins and other utils
# Credits
- [textacy](https://github.com/chartbeat-labs/textacy) for the following basic preprocessing functions:
- `fix_bad_unicode`
- `normalize_whitespace`
- `unpack_english_contractions`
- `replace_urls`
- `replace_emails`
- `replace_numbers`
- `replace_currency_symbols`
- `remove_punct`
- `remove_accents`
- `replace_phone_numbers` *(with some modifications of our own)*
Raw data
{
"_id": null,
"home_page": "https://github.com/artefactory/NLPretext",
"name": "nlpretext",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "artefactory",
"author_email": "rafaelle.aygalenq@artefact.com",
"download_url": "https://files.pythonhosted.org/packages/c6/bd/a1114b190b34672311202181af6992ade6338a1df65ec76bf32ca6ad485c/nlpretext-1.2.0.tar.gz",
"platform": null,
"description": "# NLPretext\n\n<p align=\"center\">\n <img src=\"/references/logo_nlpretext.png\" />\n</p>\n\n<div align=\"center\">\n\n[![CI status](https://github.com/artefactory/NLPretext/actions/workflows/ci.yml/badge.svg?branch%3Amain&event%3Apush)](https://github.com/artefactory/NLPretext/actions/workflows/ci.yml?query=branch%3Amain)\n[![CD status](https://github.com/artefactory/NLPretext/actions/workflows/cd.yml/badge.svg?event%3Arelease)](https://github.com/artefactory/NLPretext/actions/workflows/cd.yml?query=event%3Arelease)\n[![Python Version](https://img.shields.io/badge/Python-3.8-informational.svg)](#supported-python-versions)\n[![Dependencies Status](https://img.shields.io/badge/dependabots-active-informational.svg)](https://github.com/artefactory/NLPretext}/pulls?utf8=%E2%9C%93&q=is%3Apr%20author%3Aapp%2Fdependabot)\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Security: bandit](https://img.shields.io/badge/security-bandit-informational.svg)](https://github.com/PyCQA/bandit)\n[![Pre-commit](https://img.shields.io/badge/pre--commit-enabled-informational?logo=pre-commit&logoColor=white)](https://github.com/artefactory/NLPretext}/blob/main/.pre-commit-config.yaml)\n[![Semantic Versions](https://img.shields.io/badge/%F0%9F%9A%80-semantic%20versions-informational.svg)](https://github.com/artefactory/NLPretext/releases)\n[![Documentation](https://img.shields.io/badge/doc-sphinx-informational.svg)](https://github.com/artefactory/NLPretext}/tree/main/docs)\n[![License](https://img.shields.io/badge/License-Apache%20Software%20License%202.0-informational.svg)](https://github.com/artefactory/NLPretext}/blob/main/LICENSE)\n\nAll the goto functions you need to handle NLP use-cases, integrated in NLPretext\n\n</div>\n\n# TL;DR\n\n\n> *Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web?* :tired_face: \n\n> *Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post?* :disappointed_relieved:\n\n\n**NLPretext got you covered!** :rocket:\n\nNLPretext packages in a **unique** library all the text **preprocessing** functions you need to **ease** your NLP project. \n\n\n:mag: Quickly explore below our preprocessing pipelines and individual functions referential.\n\n* [Default preprocessing pipeline](#default_pipeline)\n* [Custom preprocessing pipeline](#custom_pipeline)\n* [Replacing phone numbers](#replace_phone_numbers)\n* [Removing hashtags](#remove_hashtags)\n* [Extracting emojis](#extract_emojis)\n* [Data augmentation](#data_augmentation)\n\n\nCannot find what you were looking for? Feel free to open an [issue]((https://github.com/artefactory/nlpretext/issues) ).\n\n\n\n# Installation\n\n### Supported Python Versions\n\n- Main version supported : `3.8`\n- Other supported versions : `3.9`, `3.10`\n\n\nWe strongly advise you to do the remaining steps in a virtual environnement.\n\nTo install this library from PyPi, run the following command:\n\n```bash\npip install nlpretext\n```\n\nor with `Poetry`\n\n```bash\npoetry add nlpretext\n```\n\n\n# Usage\n\n## Default pipeline <a name=\"default_pipeline\"></a>\n\nNeed to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:\n\n```python\nfrom nlpretext import Preprocessor\ntext = \"I just got the best dinner in my life @latourdargent !!! I recommend \ud83d\ude00 #food #paris \\n\"\npreprocessor = Preprocessor()\ntext = preprocessor.run(text)\nprint(text)\n# \"I just got the best dinner in my life!!! I recommend\"\n```\n\n## Create your custom pipeline <a name=\"custom_pipeline\"></a>\n\nAnother possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:\n\n```python\nfrom nlpretext import Preprocessor\nfrom nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,\nremove_stopwords, lower_text)\nfrom nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji\ntext = \"I just got the best dinner in my life @latourdargent !!! I recommend \ud83d\ude00 #food #paris \\n\"\npreprocessor = Preprocessor()\npreprocessor.pipe(lower_text)\npreprocessor.pipe(remove_mentions)\npreprocessor.pipe(remove_hashtag)\npreprocessor.pipe(remove_emoji)\npreprocessor.pipe(remove_eol_characters)\npreprocessor.pipe(remove_stopwords, args={'lang': 'en'})\npreprocessor.pipe(remove_punct)\npreprocessor.pipe(normalize_whitespace)\ntext = preprocessor.run(text)\nprint(text)\n# \"dinner life recommend\"\n```\n\nTake a look at all the functions that are available [here](https://github.com/artefactory/NLPretext/tree/master/nlpretext) in the ```preprocess.py``` scripts in the different folders: basic, social, token.\n\n\n## Load text data\n\nPre-processing text data is useful only if you have loaded data to process! Importing text data as strings in your code can be really simple if you have short texts contained in a local .txt, but it can quickly become difficult if you want to load a lot of texts, stored in multiple formats and divided in multiple files. Hopefully, you can use NLPretext's TextLoader class to easily import text data.\nOur TextLoader class makes use of Dask, so be sure to install the library if you want to use it, as mentioned above.\n\n```python\nfrom nlpretext.textloader import TextLoader\nfiles_path = \"local_folder/texts/text.txt\"\ntext_loader = TextLoader()\ntext_dataframe = text_loader.read_text(files_path)\nprint(text_dataframe.text.values.tolist())\n# [\"I just got the best dinner in my life!!!\", \"I recommend\", \"It was awesome\"]\n```\n\nAs TextLoader uses dask to load data, file path can be provided as string, list of strings, with or without wildcards. It also supports imports from cloud providers, if your machine is authentified on a project.\n\n```python\ntext_loader = TextLoader(text_column=\"name_of_text_column_in_your_data\")\n\nlocal_file_path = \"local_folder/texts/text.csv\" # File from local folder\nlocal_corpus_path = [\"local_folder/texts/text_1.csv\", \"local_folder/texts/text_2.csv\", \"local_folder/texts/text_3.csv\"] # Multiple files from local folder\n\ngcs_file_path = \"gs://my-bucket/texts/text.json\" # File from GCS\ns3_file_path = \"s3://my-bucket/texts/text.json\" # File from S3\nhdfs_file_path = \"hdfs://folder/texts/text.txt\" # File from HDFS\nazure_file_path = \"az://my-bucket/texts/text.parquet\" # File from Azure\n\ngcs_corpus_path = \"gs://my-bucket/texts/text_*.json\" # Multiple files from GCS with wildcard\n\ntext_dataframe_1 = text_loader.read_text(local_file_path)\ntext_dataframe_2 = text_loader.read_text(local_corpus_path)\ntext_dataframe_3 = text_loader.read_text(gcs_file_path)\ntext_dataframe_4 = text_loader.read_text(s3_file_path)\ntext_dataframe_5 = text_loader.read_text(hdfs_file_path)\ntext_dataframe_6 = text_loader.read_text(azure_file_path)\ntext_dataframe_7 = text_loader.read_text(gcs_corpus_path)\n\n```\n\nYou can also specify a Preprocessor if you want your data to be directly pre-processed when loaded.\n```python\ntext_loader = TextLoader(text_column=\"text_col\")\npreprocessor = Preprocessor()\n\nfile_path = \"local_folder/texts/text.csv\" # File from local folder\n\nraw_text_dataframe = text_loader.read_text(local_file_path)\npreprocessed_text_dataframe = text_loader.read_text(local_file_path, preprocessor=preprocessor)\n\nprint(raw_text_dataframe.text_col.values.tolist())\n# [\"These texts are not preprocessed\", \"This is bad ## \"]\n\nprint(preprocessed_text_dataframe.text_col.values.tolist())\n# [\"These texts are not preprocessed\", \"This is bad\"]\n```\n\n\n## Individual Functions\n\n### Replacing emails <a name=\"replace_emails\"></a>\n\n```python\nfrom nlpretext.basic.preprocess import replace_emails\nexample = \"I have forwarded this email to obama@whitehouse.gov\"\nexample = replace_emails(example, replace_with=\"*EMAIL*\")\nprint(example)\n# \"I have forwarded this email to *EMAIL*\"\n```\n\n### Replacing phone numbers <a name=\"replace_phone_numbers\"></a>\n\n```python\nfrom nlpretext.basic.preprocess import replace_phone_numbers\nexample = \"My phone number is 0606060606\"\nexample = replace_phone_numbers(example, country_to_detect=[\"FR\"], replace_with=\"*PHONE*\")\nprint(example)\n# \"My phone number is *PHONE*\"\n```\n\n### Removing Hashtags <a name=\"remove_hashtags\"></a>\n\n```python\nfrom nlpretext.social.preprocess import remove_hashtag\nexample = \"This restaurant was amazing #food #foodie #foodstagram #dinner\"\nexample = remove_hashtag(example)\nprint(example)\n# \"This restaurant was amazing\"\n```\n\n### Extracting emojis <a name=\"extract_emojis\"></a>\n\n```python\nfrom nlpretext.social.preprocess import extract_emojis\nexample = \"I take care of my skin \ud83d\ude00\"\nexample = extract_emojis(example)\nprint(example)\n# [':grinning_face:']\n```\n\n## Data augmentation <a name=\"data_augmentation\"></a>\n\nThe augmentation module helps you to **generate new texts** based on your given examples by modifying some words in the initial ones and to **keep associated entities unchanged**, if any, in the case of **NER tasks**. If you want words other than entities to remain unchanged, you can specify it within the `stopwords` argument. Modifications depend on the chosen method, the ones currently supported by the module are **substitutions with synonyms** using Wordnet or BERT from the [`nlpaug`](https://github.com/makcedward/nlpaug) library. \n\n```python\nfrom nlpretext.augmentation.text_augmentation import augment_text\nexample = \"I want to buy a small black handbag please.\"\nentities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]\nexample = augment_text(example, method=\u201dwordnet_synonym\u201d, entities=entities)\nprint(example)\n# \"I need to buy a small black pocketbook please.\"\n```\n\n\n\n\n# \ud83d\udcc8 Releases\n\nYou can see the list of available releases on the [GitHub Releases](https://github.com/artefactory/NLPretext}/releases) page.\n\nWe follow [Semantic Versions](https://semver.org/) specification.\n\nWe use [`Release Drafter`](https://github.com/marketplace/actions/release-drafter). As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you\u2019re ready. With the categories option, you can categorize pull requests in release notes using labels.\n\nFor Pull Requests, these labels are configured, by default:\n\n| **Label** | **Title in Releases** |\n| :-----------------------------------: | :---------------------: |\n| `enhancement`, `feature` | \ud83d\ude80 Features |\n| `bug`, `refactoring`, `bugfix`, `fix` | \ud83d\udd27 Fixes & Refactoring |\n| `build`, `ci`, `testing` | \ud83d\udce6 Build System & CI/CD |\n| `breaking` | \ud83d\udca5 Breaking Changes |\n| `documentation` | \ud83d\udcdd Documentation |\n| `dependencies` | \u2b06\ufe0f Dependencies updates |\n\n\nGitHub creates the `bug`, `enhancement`, and `documentation` labels automatically. Dependabot creates the `dependencies` label. Create the remaining labels on the Issues tab of the GitHub repository, when needed.## \ud83d\udee1 License\n\n[![License](https://img.shields.io/github/license/artefactory/NLPretext)](https://github.com/artefactory/NLPretext}/blob/main/LICENSE)\n\nThis project is licensed under the terms of the `Apache Software License 2.0` license. See [LICENSE](https://github.com/artefactory/NLPretext}/blob/main/LICENSE) for more details.## \ud83d\udcc3 Citation\n\n```\n@misc{nlpretext,\n author = {artefactory},\n title = {All the goto functions you need to handle NLP use-cases, integrated in NLPretext},\n year = {2021},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/artefactory/NLPretext}}}\n}\n```\n\n\n# Project Organization\n------------\n\n \u251c\u2500\u2500 LICENSE\n \u251c\u2500\u2500 CONTRIBUTING.md <- Contribution guidelines\n \u251c\u2500\u2500 CODE_OF_CONDUCT.md <- Code of conduct guidelines\n \u251c\u2500\u2500 Makefile\n \u251c\u2500\u2500 README.md <- The top-level README for developers using this project.\n \u251c\u2500\u2500 .github/workflows <- Where the CI and CD lives\n \u251c\u2500\u2500 datasets/external <- Bash scripts to download external datasets\n \u251c\u2500\u2500 docker <- All you need to build a Docker image from that package\n \u251c\u2500\u2500 docs <- Sphinx HTML documentation\n \u251c\u2500\u2500 nlpretext <- Main Package. This is where the code lives\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 preprocessor.py <- Main preprocessing script\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 text_loader.py <- Main loading script\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 augmentation <- Text augmentation script\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 basic <- Basic text preprocessing\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 cli <- Command lines that can be used\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 social <- Social text preprocessing\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 token <- Token text preprocessing\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 _config <- Where the configuration and constants live\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 _utils <- Where preprocessing utils scripts lives\n \u251c\u2500\u2500 tests <- Where the tests lives\n \u251c\u2500\u2500 pyproject.toml <- Package configuration\n \u251c\u2500\u2500 poetry.lock \n \u2514\u2500\u2500 setup.cfg <- Configuration for plugins and other utils\n\n# Credits\n\n- [textacy](https://github.com/chartbeat-labs/textacy) for the following basic preprocessing functions:\n - `fix_bad_unicode`\n - `normalize_whitespace`\n - `unpack_english_contractions`\n - `replace_urls`\n - `replace_emails`\n - `replace_numbers`\n - `replace_currency_symbols`\n - `remove_punct`\n - `remove_accents`\n - `replace_phone_numbers` *(with some modifications of our own)*\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "All the goto functions you need to handle NLP use-cases, integrated in NLPretext",
"version": "1.2.0",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fec6d4a190a759f0e4777e79718827cd4a4472660d7aa06c4899a5bd3182b6e3",
"md5": "64ad39bce3533702f2f95b10a2849185",
"sha256": "d72db3b6061357bd8db7080845e40b49d3316c408354073b83c554382377e443"
},
"downloads": -1,
"filename": "nlpretext-1.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "64ad39bce3533702f2f95b10a2849185",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 86260,
"upload_time": "2023-02-07T10:19:25",
"upload_time_iso_8601": "2023-02-07T10:19:25.509623Z",
"url": "https://files.pythonhosted.org/packages/fe/c6/d4a190a759f0e4777e79718827cd4a4472660d7aa06c4899a5bd3182b6e3/nlpretext-1.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c6bda1114b190b34672311202181af6992ade6338a1df65ec76bf32ca6ad485c",
"md5": "492dc4d40b0f3a8cd86c4e751305b90f",
"sha256": "409e381526b7e3ba1e8d9b9e41009c578c85f5aa7a203deb41cbc9e72319b91b"
},
"downloads": -1,
"filename": "nlpretext-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "492dc4d40b0f3a8cd86c4e751305b90f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 82819,
"upload_time": "2023-02-07T10:19:26",
"upload_time_iso_8601": "2023-02-07T10:19:26.997545Z",
"url": "https://files.pythonhosted.org/packages/c6/bd/a1114b190b34672311202181af6992ade6338a1df65ec76bf32ca6ad485c/nlpretext-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-02-07 10:19:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "artefactory",
"github_project": "NLPretext",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "nlpretext"
}