# NLPUtilsBERT
Imagine you're a data scientist tasked with building a text classification model. You start by cleaning and exploring
the data, creating visualizations, training models, and evaluating performance. Each step requires writing boilerplate
code, debugging, and fine-tuning.
NLPUtilsBERT eliminates this pain by providing ready-to-use tools for:
Text EDA: Quickly generate insights like word frequency plots and word clouds.
Text Classification: Fine-tune pre-trained models with a few lines of code, saving hours of development.
Evaluation: Automatically generate key metrics like accuracy, F1 score, and confusion matrices.
With NLPUtilsBERT, you focus on the results, not the repetitive code. Whether you're a beginner or an experienced data
scientist, this package accelerates your workflow and ensures high-quality outcomes.
---
## Package Description
**NLPUtilsBERT** is a Python package for **text analysis** and **classification**, combining:
1. **Text Exploratory Data Analysis (EDA):** Tools to explore, visualize, and understand text data, including
tokenization, word frequency analysis, and visualizations (word clouds, bar charts).
2. **Text Classification:** Simplifies text classification using **PyTorch** and **Hugging Face Transformers**, with
support for model training, evaluation, and predictions.
---
## Features
- **EDA:** Tokenization, word frequency analysis, word clouds, and visualizations.
- **Classification:** Fine-tune pre-trained models, early stopping, and checkpointing.
- **Evaluation Metrics:** Accuracy, F1 score, confusion matrix, and ROC curve.
- **Visualization:** Training loss, confusion matrix, and evaluation metric plots.
---
## Installation
Install the package via pip:
```bash
pip install NLPUtilsBERT
```
---
## Data Requirements
Prepare a CSV file with two columns:
- **category:** Contains class labels.
- **text:** Contains cleaned text data.
Example:
| category | text |
|--------------|------------------------------------|
| sports | I love playing football. |
| business | This is my new business venture. |
| technology | My Chrome browser is not working. |
---
## Directory Structure
```
/EDA
/plots # Stores category frequency and word cloud plots
/MODELS
/saved_model # Model files
/saved_tokenizer # Tokenizer files
/checkpoints # Training checkpoints
/plots # Evaluation result plots
```
---
## Evaluation Metrics
- **Accuracy:** Percentage of correct predictions.
- **F1 Score:** Weighted average of precision and recall.
- **Confusion Matrix:** Prediction accuracy across classes.
- **ROC Curve:** Trade-off between true positive and false positive rates.
---
## Usage
### Text EDA
Perform Exploratory Data Analysis on text data:
```python
import pandas as pd
from NLPUtilsBERT.Utils_NLP_EDA import TextEDA
# Load dataset
dataset_path = "path/to/your/file.csv"
df = pd.read_csv(dataset_path)
# Perform EDA
eda = TextEDA(dataframe=df,
text_column="text",
label_column="category",
eda_folder="EDA",
show_plots=False)
eda.perform_eda()
```
---
### Text Classification
Train, evaluate, and predict using a text classification model:
```python
import pandas as pd
from NLPUtilsBERT.Utils_TextClassification_BERT import TextClassificationModel
# Configuration
dataset_path = "path/to/your/file.csv"
pretrained_model_name = 'bert-base-uncased' # Options: 'bert-base-uncased', 'distilbert-base-uncased'
batch_size = 16
learning_rate = 1e-7
num_train_epochs = 50
early_stopping_patience = 5
weight_decay = 0.01
test_size = 0.2
val_size = 0.3
resume_from_checkpoints = True
random_state = 73
MODEL_FOLDER = "MODEL"
# Load dataset
df = pd.read_csv(dataset_path)
# Initialize and train the model
text_classifier = TextClassificationModel(pretrained_model_name=pretrained_model_name,
batch_size=batch_size,
learning_rate=learning_rate,
num_train_epochs=num_train_epochs,
weight_decay=weight_decay,
model_folder=MODEL_FOLDER,
early_stopping_patience=early_stopping_patience,
test_size=test_size,
val_size=val_size,
random_state=random_state,
resume_from_checkpoints=resume_from_checkpoints)
ds_train, ds_val, ds_test = text_classifier.create_datasets(df, target_column="category")
text_classifier.train(ds_train, ds_val)
# Evaluate the model
eval_results = text_classifier.evaluate(ds_test)
print('Evaluation results:', eval_results)
# Make predictions
classifier = TextClassificationModel(model_folder=MODEL_FOLDER)
classifier.load_model()
text = "I love playing football." ; print(f"\n{text} : {classifier.predict(text)}")
text = "This is my business place." ; print(f"\n{text} : {classifier.predict(text)}")
text = "My Chrome browser is giving issues." ; print(f"\n{text} : {classifier.predict(text)}")
```
---
## System Requirements
- **Python Version:** >= 3.11.9
- **Intended Audience:** Data Scientists
- **Operating System:** OS Independent
---
## Development and Contributions
Contributions are welcome!
- **Development Status:** TO BE UPDATED
- **How to Contribute:** Fork the repository, make changes, and submit a pull request.
---
## Version History
- **0.0.1:** Initial commit
---
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---
## Future Improvements
- Add support for more pre-trained models.
- Integrate additional visualization options.
- Enhance multi-label classification capabilities.
- Add class for Named Entity Recognition (NER) to extract entities like names, organizations, and locations.
---
## Acknowledgements
- Hugging Face Transformers [BERT]
- spaCy
- scikit-learn
---
## FAQ
**Coming Soon**
Raw data
{
"_id": null,
"home_page": "https://github.com/AeroVikas/NLPUtilsBERT.git",
"name": "NLPUtilsBERT",
"maintainer": "Vikas Goel",
"docs_url": null,
"requires_python": ">=3.11.9",
"maintainer_email": "vikas.aero@gmail.com",
"keywords": "Furuness, Login, login, terminal",
"author": "Vikas Goel",
"author_email": "vikas.aero@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/a0/8f/9b6fceedeef000bf246ea8edb7477093fa901b9038ca17a452f345e05662/nlputilsbert-0.0.2.tar.gz",
"platform": null,
"description": "# NLPUtilsBERT\r\n\r\nImagine you're a data scientist tasked with building a text classification model. You start by cleaning and exploring\r\nthe data, creating visualizations, training models, and evaluating performance. Each step requires writing boilerplate\r\ncode, debugging, and fine-tuning.\r\n\r\nNLPUtilsBERT eliminates this pain by providing ready-to-use tools for:\r\n\r\nText EDA: Quickly generate insights like word frequency plots and word clouds.\r\nText Classification: Fine-tune pre-trained models with a few lines of code, saving hours of development.\r\nEvaluation: Automatically generate key metrics like accuracy, F1 score, and confusion matrices.\r\nWith NLPUtilsBERT, you focus on the results, not the repetitive code. Whether you're a beginner or an experienced data\r\nscientist, this package accelerates your workflow and ensures high-quality outcomes.\r\n\r\n---\r\n\r\n## Package Description\r\n\r\n**NLPUtilsBERT** is a Python package for **text analysis** and **classification**, combining:\r\n\r\n1. **Text Exploratory Data Analysis (EDA):** Tools to explore, visualize, and understand text data, including\r\n tokenization, word frequency analysis, and visualizations (word clouds, bar charts).\r\n2. **Text Classification:** Simplifies text classification using **PyTorch** and **Hugging Face Transformers**, with\r\n support for model training, evaluation, and predictions.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **EDA:** Tokenization, word frequency analysis, word clouds, and visualizations.\r\n- **Classification:** Fine-tune pre-trained models, early stopping, and checkpointing.\r\n- **Evaluation Metrics:** Accuracy, F1 score, confusion matrix, and ROC curve.\r\n- **Visualization:** Training loss, confusion matrix, and evaluation metric plots.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nInstall the package via pip:\r\n\r\n```bash\r\npip install NLPUtilsBERT\r\n```\r\n\r\n---\r\n\r\n## Data Requirements\r\n\r\nPrepare a CSV file with two columns:\r\n\r\n- **category:** Contains class labels.\r\n- **text:** Contains cleaned text data.\r\n\r\nExample: \r\n| category | text | \r\n|--------------|------------------------------------| \r\n| sports | I love playing football. | \r\n| business | This is my new business venture. | \r\n| technology | My Chrome browser is not working. |\r\n\r\n---\r\n\r\n## Directory Structure\r\n\r\n```\r\n/EDA\r\n /plots # Stores category frequency and word cloud plots\r\n \r\n/MODELS\r\n /saved_model # Model files\r\n /saved_tokenizer # Tokenizer files\r\n /checkpoints # Training checkpoints\r\n /plots # Evaluation result plots\r\n```\r\n\r\n---\r\n\r\n## Evaluation Metrics\r\n\r\n- **Accuracy:** Percentage of correct predictions.\r\n- **F1 Score:** Weighted average of precision and recall.\r\n- **Confusion Matrix:** Prediction accuracy across classes.\r\n- **ROC Curve:** Trade-off between true positive and false positive rates.\r\n\r\n---\r\n\r\n## Usage\r\n\r\n### Text EDA\r\n\r\nPerform Exploratory Data Analysis on text data:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom NLPUtilsBERT.Utils_NLP_EDA import TextEDA\r\n\r\n# Load dataset\r\ndataset_path = \"path/to/your/file.csv\"\r\ndf = pd.read_csv(dataset_path)\r\n\r\n# Perform EDA\r\neda = TextEDA(dataframe=df,\r\n text_column=\"text\",\r\n label_column=\"category\",\r\n eda_folder=\"EDA\",\r\n show_plots=False)\r\neda.perform_eda()\r\n```\r\n\r\n---\r\n\r\n### Text Classification\r\n\r\nTrain, evaluate, and predict using a text classification model:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom NLPUtilsBERT.Utils_TextClassification_BERT import TextClassificationModel\r\n\r\n# Configuration\r\ndataset_path = \"path/to/your/file.csv\"\r\npretrained_model_name = 'bert-base-uncased' # Options: 'bert-base-uncased', 'distilbert-base-uncased'\r\nbatch_size = 16\r\nlearning_rate = 1e-7\r\nnum_train_epochs = 50\r\nearly_stopping_patience = 5\r\nweight_decay = 0.01\r\ntest_size = 0.2\r\nval_size = 0.3\r\nresume_from_checkpoints = True\r\nrandom_state = 73\r\nMODEL_FOLDER = \"MODEL\"\r\n\r\n# Load dataset\r\ndf = pd.read_csv(dataset_path)\r\n\r\n# Initialize and train the model\r\ntext_classifier = TextClassificationModel(pretrained_model_name=pretrained_model_name,\r\n batch_size=batch_size,\r\n learning_rate=learning_rate,\r\n num_train_epochs=num_train_epochs,\r\n weight_decay=weight_decay,\r\n model_folder=MODEL_FOLDER,\r\n early_stopping_patience=early_stopping_patience,\r\n test_size=test_size,\r\n val_size=val_size,\r\n random_state=random_state,\r\n resume_from_checkpoints=resume_from_checkpoints)\r\n\r\nds_train, ds_val, ds_test = text_classifier.create_datasets(df, target_column=\"category\")\r\ntext_classifier.train(ds_train, ds_val)\r\n\r\n# Evaluate the model\r\neval_results = text_classifier.evaluate(ds_test)\r\nprint('Evaluation results:', eval_results)\r\n\r\n# Make predictions\r\nclassifier = TextClassificationModel(model_folder=MODEL_FOLDER)\r\nclassifier.load_model()\r\n\r\ntext = \"I love playing football.\" ; print(f\"\\n{text} : {classifier.predict(text)}\")\r\ntext = \"This is my business place.\" ; print(f\"\\n{text} : {classifier.predict(text)}\")\r\ntext = \"My Chrome browser is giving issues.\" ; print(f\"\\n{text} : {classifier.predict(text)}\")\r\n```\r\n\r\n---\r\n\r\n## System Requirements\r\n\r\n- **Python Version:** >= 3.11.9\r\n- **Intended Audience:** Data Scientists\r\n- **Operating System:** OS Independent\r\n\r\n---\r\n\r\n## Development and Contributions\r\n\r\nContributions are welcome!\r\n\r\n- **Development Status:** TO BE UPDATED\r\n- **How to Contribute:** Fork the repository, make changes, and submit a pull request.\r\n\r\n---\r\n\r\n## Version History\r\n\r\n- **0.0.1:** Initial commit\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n## Future Improvements\r\n\r\n- Add support for more pre-trained models.\r\n- Integrate additional visualization options.\r\n- Enhance multi-label classification capabilities.\r\n- Add class for Named Entity Recognition (NER) to extract entities like names, organizations, and locations.\r\n\r\n---\r\n\r\n## Acknowledgements\r\n\r\n- Hugging Face Transformers [BERT]\r\n- spaCy\r\n- scikit-learn\r\n\r\n---\r\n\r\n## FAQ\r\n\r\n**Coming Soon** \r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Common Utilities",
"version": "0.0.2",
"project_urls": {
"Bug Tracker": "https://github.com/pypa/sampleproject/issues",
"Download": "https://github.com/AeroVikas/NLPUtilsBERT.git",
"Homepage": "https://github.com/AeroVikas/NLPUtilsBERT.git"
},
"split_keywords": [
"furuness",
" login",
" login",
" terminal"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "998776f3d9fd21ba149a810f8d9e959eade1f1e3f6c7e34608c7c15b4e657c97",
"md5": "5d36c182b751f05681be0864439648d4",
"sha256": "12a7f966fd0f97aa11a980a9acc73375b0a7b8e605ef2b9747e295a4c4d7c56b"
},
"downloads": -1,
"filename": "NLPUtilsBERT-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5d36c182b751f05681be0864439648d4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11.9",
"size": 11582,
"upload_time": "2025-01-02T14:59:23",
"upload_time_iso_8601": "2025-01-02T14:59:23.001645Z",
"url": "https://files.pythonhosted.org/packages/99/87/76f3d9fd21ba149a810f8d9e959eade1f1e3f6c7e34608c7c15b4e657c97/NLPUtilsBERT-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a08f9b6fceedeef000bf246ea8edb7477093fa901b9038ca17a452f345e05662",
"md5": "57e89b4e6a95941b358a3c314dbc1084",
"sha256": "f0502f8b303d31e51693625fd92d769ad27d8881d9caa344b8b3abba33a90010"
},
"downloads": -1,
"filename": "nlputilsbert-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "57e89b4e6a95941b358a3c314dbc1084",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11.9",
"size": 11058,
"upload_time": "2025-01-02T14:59:25",
"upload_time_iso_8601": "2025-01-02T14:59:25.583450Z",
"url": "https://files.pythonhosted.org/packages/a0/8f/9b6fceedeef000bf246ea8edb7477093fa901b9038ca17a452f345e05662/nlputilsbert-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-02 14:59:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AeroVikas",
"github_project": "NLPUtilsBERT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": [
[
"==",
"2.2.3"
]
]
},
{
"name": "seaborn",
"specs": [
[
"==",
"0.13.2"
]
]
},
{
"name": "spacy",
"specs": [
[
"==",
"3.8.3"
]
]
},
{
"name": "wordcloud",
"specs": [
[
"==",
"1.9.4"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "scikit_learn",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "transformers",
"specs": [
[
"==",
"4.47.1"
]
]
},
{
"name": "torchvision",
"specs": [
[
"==",
"0.20.1"
]
]
},
{
"name": "accelerate",
"specs": [
[
"==",
"1.2.1"
]
]
}
],
"lcname": "nlputilsbert"
}