NLPUtilsBERT


NameNLPUtilsBERT JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/AeroVikas/NLPUtilsBERT.git
SummaryCommon Utilities
upload_time2025-01-02 14:59:25
maintainerVikas Goel
docs_urlNone
authorVikas Goel
requires_python>=3.11.9
licenseMIT
keywords furuness login login terminal
VCS
bugtrack_url
requirements pandas seaborn spacy wordcloud torch scikit_learn transformers torchvision accelerate
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NLPUtilsBERT

Imagine you're a data scientist tasked with building a text classification model. You start by cleaning and exploring
the data, creating visualizations, training models, and evaluating performance. Each step requires writing boilerplate
code, debugging, and fine-tuning.

NLPUtilsBERT eliminates this pain by providing ready-to-use tools for:

Text EDA: Quickly generate insights like word frequency plots and word clouds.
Text Classification: Fine-tune pre-trained models with a few lines of code, saving hours of development.
Evaluation: Automatically generate key metrics like accuracy, F1 score, and confusion matrices.
With NLPUtilsBERT, you focus on the results, not the repetitive code. Whether you're a beginner or an experienced data
scientist, this package accelerates your workflow and ensures high-quality outcomes.

---

## Package Description

**NLPUtilsBERT** is a Python package for **text analysis** and **classification**, combining:

1. **Text Exploratory Data Analysis (EDA):** Tools to explore, visualize, and understand text data, including
   tokenization, word frequency analysis, and visualizations (word clouds, bar charts).
2. **Text Classification:** Simplifies text classification using **PyTorch** and **Hugging Face Transformers**, with
   support for model training, evaluation, and predictions.

---

## Features

- **EDA:** Tokenization, word frequency analysis, word clouds, and visualizations.
- **Classification:** Fine-tune pre-trained models, early stopping, and checkpointing.
- **Evaluation Metrics:** Accuracy, F1 score, confusion matrix, and ROC curve.
- **Visualization:** Training loss, confusion matrix, and evaluation metric plots.

---

## Installation

Install the package via pip:

```bash
pip install NLPUtilsBERT
```

---

## Data Requirements

Prepare a CSV file with two columns:

- **category:** Contains class labels.
- **text:** Contains cleaned text data.

Example:  
| category | text |  
|--------------|------------------------------------|  
| sports | I love playing football. |  
| business | This is my new business venture. |  
| technology | My Chrome browser is not working. |

---

## Directory Structure

```
/EDA
    /plots              # Stores category frequency and word cloud plots
    
/MODELS
    /saved_model        # Model files
    /saved_tokenizer    # Tokenizer files
    /checkpoints        # Training checkpoints
    /plots              # Evaluation result plots
```

---

## Evaluation Metrics

- **Accuracy:** Percentage of correct predictions.
- **F1 Score:** Weighted average of precision and recall.
- **Confusion Matrix:** Prediction accuracy across classes.
- **ROC Curve:** Trade-off between true positive and false positive rates.

---

## Usage

### Text EDA

Perform Exploratory Data Analysis on text data:

```python
import pandas as pd
from NLPUtilsBERT.Utils_NLP_EDA import TextEDA

# Load dataset
dataset_path = "path/to/your/file.csv"
df = pd.read_csv(dataset_path)

# Perform EDA
eda = TextEDA(dataframe=df,
              text_column="text",
              label_column="category",
              eda_folder="EDA",
              show_plots=False)
eda.perform_eda()
```

---

### Text Classification

Train, evaluate, and predict using a text classification model:

```python
import pandas as pd
from NLPUtilsBERT.Utils_TextClassification_BERT import TextClassificationModel

# Configuration
dataset_path = "path/to/your/file.csv"
pretrained_model_name = 'bert-base-uncased'  # Options: 'bert-base-uncased', 'distilbert-base-uncased'
batch_size = 16
learning_rate = 1e-7
num_train_epochs = 50
early_stopping_patience = 5
weight_decay = 0.01
test_size = 0.2
val_size = 0.3
resume_from_checkpoints = True
random_state = 73
MODEL_FOLDER = "MODEL"

# Load dataset
df = pd.read_csv(dataset_path)

# Initialize and train the model
text_classifier = TextClassificationModel(pretrained_model_name=pretrained_model_name,
                                          batch_size=batch_size,
                                          learning_rate=learning_rate,
                                          num_train_epochs=num_train_epochs,
                                          weight_decay=weight_decay,
                                          model_folder=MODEL_FOLDER,
                                          early_stopping_patience=early_stopping_patience,
                                          test_size=test_size,
                                          val_size=val_size,
                                          random_state=random_state,
                                          resume_from_checkpoints=resume_from_checkpoints)

ds_train, ds_val, ds_test = text_classifier.create_datasets(df, target_column="category")
text_classifier.train(ds_train, ds_val)

# Evaluate the model
eval_results = text_classifier.evaluate(ds_test)
print('Evaluation results:', eval_results)

# Make predictions
classifier = TextClassificationModel(model_folder=MODEL_FOLDER)
classifier.load_model()

text = "I love playing football."               ; print(f"\n{text} : {classifier.predict(text)}")
text = "This is my business place."             ; print(f"\n{text} : {classifier.predict(text)}")
text = "My Chrome browser is giving issues."    ; print(f"\n{text} : {classifier.predict(text)}")
```

---

## System Requirements

- **Python Version:** >= 3.11.9
- **Intended Audience:** Data Scientists
- **Operating System:** OS Independent

---

## Development and Contributions

Contributions are welcome!

- **Development Status:** TO BE UPDATED
- **How to Contribute:** Fork the repository, make changes, and submit a pull request.

---

## Version History

- **0.0.1:** Initial commit

---

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

---

## Future Improvements

- Add support for more pre-trained models.
- Integrate additional visualization options.
- Enhance multi-label classification capabilities.
- Add class for Named Entity Recognition (NER) to extract entities like names, organizations, and locations.

---

## Acknowledgements

- Hugging Face Transformers [BERT]
- spaCy
- scikit-learn

---

## FAQ

**Coming Soon**  

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AeroVikas/NLPUtilsBERT.git",
    "name": "NLPUtilsBERT",
    "maintainer": "Vikas Goel",
    "docs_url": null,
    "requires_python": ">=3.11.9",
    "maintainer_email": "vikas.aero@gmail.com",
    "keywords": "Furuness, Login, login, terminal",
    "author": "Vikas Goel",
    "author_email": "vikas.aero@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a0/8f/9b6fceedeef000bf246ea8edb7477093fa901b9038ca17a452f345e05662/nlputilsbert-0.0.2.tar.gz",
    "platform": null,
    "description": "# NLPUtilsBERT\r\n\r\nImagine you're a data scientist tasked with building a text classification model. You start by cleaning and exploring\r\nthe data, creating visualizations, training models, and evaluating performance. Each step requires writing boilerplate\r\ncode, debugging, and fine-tuning.\r\n\r\nNLPUtilsBERT eliminates this pain by providing ready-to-use tools for:\r\n\r\nText EDA: Quickly generate insights like word frequency plots and word clouds.\r\nText Classification: Fine-tune pre-trained models with a few lines of code, saving hours of development.\r\nEvaluation: Automatically generate key metrics like accuracy, F1 score, and confusion matrices.\r\nWith NLPUtilsBERT, you focus on the results, not the repetitive code. Whether you're a beginner or an experienced data\r\nscientist, this package accelerates your workflow and ensures high-quality outcomes.\r\n\r\n---\r\n\r\n## Package Description\r\n\r\n**NLPUtilsBERT** is a Python package for **text analysis** and **classification**, combining:\r\n\r\n1. **Text Exploratory Data Analysis (EDA):** Tools to explore, visualize, and understand text data, including\r\n   tokenization, word frequency analysis, and visualizations (word clouds, bar charts).\r\n2. **Text Classification:** Simplifies text classification using **PyTorch** and **Hugging Face Transformers**, with\r\n   support for model training, evaluation, and predictions.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- **EDA:** Tokenization, word frequency analysis, word clouds, and visualizations.\r\n- **Classification:** Fine-tune pre-trained models, early stopping, and checkpointing.\r\n- **Evaluation Metrics:** Accuracy, F1 score, confusion matrix, and ROC curve.\r\n- **Visualization:** Training loss, confusion matrix, and evaluation metric plots.\r\n\r\n---\r\n\r\n## Installation\r\n\r\nInstall the package via pip:\r\n\r\n```bash\r\npip install NLPUtilsBERT\r\n```\r\n\r\n---\r\n\r\n## Data Requirements\r\n\r\nPrepare a CSV file with two columns:\r\n\r\n- **category:** Contains class labels.\r\n- **text:** Contains cleaned text data.\r\n\r\nExample:  \r\n| category | text |  \r\n|--------------|------------------------------------|  \r\n| sports | I love playing football. |  \r\n| business | This is my new business venture. |  \r\n| technology | My Chrome browser is not working. |\r\n\r\n---\r\n\r\n## Directory Structure\r\n\r\n```\r\n/EDA\r\n    /plots              # Stores category frequency and word cloud plots\r\n    \r\n/MODELS\r\n    /saved_model        # Model files\r\n    /saved_tokenizer    # Tokenizer files\r\n    /checkpoints        # Training checkpoints\r\n    /plots              # Evaluation result plots\r\n```\r\n\r\n---\r\n\r\n## Evaluation Metrics\r\n\r\n- **Accuracy:** Percentage of correct predictions.\r\n- **F1 Score:** Weighted average of precision and recall.\r\n- **Confusion Matrix:** Prediction accuracy across classes.\r\n- **ROC Curve:** Trade-off between true positive and false positive rates.\r\n\r\n---\r\n\r\n## Usage\r\n\r\n### Text EDA\r\n\r\nPerform Exploratory Data Analysis on text data:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom NLPUtilsBERT.Utils_NLP_EDA import TextEDA\r\n\r\n# Load dataset\r\ndataset_path = \"path/to/your/file.csv\"\r\ndf = pd.read_csv(dataset_path)\r\n\r\n# Perform EDA\r\neda = TextEDA(dataframe=df,\r\n              text_column=\"text\",\r\n              label_column=\"category\",\r\n              eda_folder=\"EDA\",\r\n              show_plots=False)\r\neda.perform_eda()\r\n```\r\n\r\n---\r\n\r\n### Text Classification\r\n\r\nTrain, evaluate, and predict using a text classification model:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom NLPUtilsBERT.Utils_TextClassification_BERT import TextClassificationModel\r\n\r\n# Configuration\r\ndataset_path = \"path/to/your/file.csv\"\r\npretrained_model_name = 'bert-base-uncased'  # Options: 'bert-base-uncased', 'distilbert-base-uncased'\r\nbatch_size = 16\r\nlearning_rate = 1e-7\r\nnum_train_epochs = 50\r\nearly_stopping_patience = 5\r\nweight_decay = 0.01\r\ntest_size = 0.2\r\nval_size = 0.3\r\nresume_from_checkpoints = True\r\nrandom_state = 73\r\nMODEL_FOLDER = \"MODEL\"\r\n\r\n# Load dataset\r\ndf = pd.read_csv(dataset_path)\r\n\r\n# Initialize and train the model\r\ntext_classifier = TextClassificationModel(pretrained_model_name=pretrained_model_name,\r\n                                          batch_size=batch_size,\r\n                                          learning_rate=learning_rate,\r\n                                          num_train_epochs=num_train_epochs,\r\n                                          weight_decay=weight_decay,\r\n                                          model_folder=MODEL_FOLDER,\r\n                                          early_stopping_patience=early_stopping_patience,\r\n                                          test_size=test_size,\r\n                                          val_size=val_size,\r\n                                          random_state=random_state,\r\n                                          resume_from_checkpoints=resume_from_checkpoints)\r\n\r\nds_train, ds_val, ds_test = text_classifier.create_datasets(df, target_column=\"category\")\r\ntext_classifier.train(ds_train, ds_val)\r\n\r\n# Evaluate the model\r\neval_results = text_classifier.evaluate(ds_test)\r\nprint('Evaluation results:', eval_results)\r\n\r\n# Make predictions\r\nclassifier = TextClassificationModel(model_folder=MODEL_FOLDER)\r\nclassifier.load_model()\r\n\r\ntext = \"I love playing football.\"               ; print(f\"\\n{text} : {classifier.predict(text)}\")\r\ntext = \"This is my business place.\"             ; print(f\"\\n{text} : {classifier.predict(text)}\")\r\ntext = \"My Chrome browser is giving issues.\"    ; print(f\"\\n{text} : {classifier.predict(text)}\")\r\n```\r\n\r\n---\r\n\r\n## System Requirements\r\n\r\n- **Python Version:** >= 3.11.9\r\n- **Intended Audience:** Data Scientists\r\n- **Operating System:** OS Independent\r\n\r\n---\r\n\r\n## Development and Contributions\r\n\r\nContributions are welcome!\r\n\r\n- **Development Status:** TO BE UPDATED\r\n- **How to Contribute:** Fork the repository, make changes, and submit a pull request.\r\n\r\n---\r\n\r\n## Version History\r\n\r\n- **0.0.1:** Initial commit\r\n\r\n---\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n## Future Improvements\r\n\r\n- Add support for more pre-trained models.\r\n- Integrate additional visualization options.\r\n- Enhance multi-label classification capabilities.\r\n- Add class for Named Entity Recognition (NER) to extract entities like names, organizations, and locations.\r\n\r\n---\r\n\r\n## Acknowledgements\r\n\r\n- Hugging Face Transformers [BERT]\r\n- spaCy\r\n- scikit-learn\r\n\r\n---\r\n\r\n## FAQ\r\n\r\n**Coming Soon**  \r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Common Utilities",
    "version": "0.0.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/pypa/sampleproject/issues",
        "Download": "https://github.com/AeroVikas/NLPUtilsBERT.git",
        "Homepage": "https://github.com/AeroVikas/NLPUtilsBERT.git"
    },
    "split_keywords": [
        "furuness",
        " login",
        " login",
        " terminal"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "998776f3d9fd21ba149a810f8d9e959eade1f1e3f6c7e34608c7c15b4e657c97",
                "md5": "5d36c182b751f05681be0864439648d4",
                "sha256": "12a7f966fd0f97aa11a980a9acc73375b0a7b8e605ef2b9747e295a4c4d7c56b"
            },
            "downloads": -1,
            "filename": "NLPUtilsBERT-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5d36c182b751f05681be0864439648d4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11.9",
            "size": 11582,
            "upload_time": "2025-01-02T14:59:23",
            "upload_time_iso_8601": "2025-01-02T14:59:23.001645Z",
            "url": "https://files.pythonhosted.org/packages/99/87/76f3d9fd21ba149a810f8d9e959eade1f1e3f6c7e34608c7c15b4e657c97/NLPUtilsBERT-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a08f9b6fceedeef000bf246ea8edb7477093fa901b9038ca17a452f345e05662",
                "md5": "57e89b4e6a95941b358a3c314dbc1084",
                "sha256": "f0502f8b303d31e51693625fd92d769ad27d8881d9caa344b8b3abba33a90010"
            },
            "downloads": -1,
            "filename": "nlputilsbert-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "57e89b4e6a95941b358a3c314dbc1084",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11.9",
            "size": 11058,
            "upload_time": "2025-01-02T14:59:25",
            "upload_time_iso_8601": "2025-01-02T14:59:25.583450Z",
            "url": "https://files.pythonhosted.org/packages/a0/8f/9b6fceedeef000bf246ea8edb7477093fa901b9038ca17a452f345e05662/nlputilsbert-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-02 14:59:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AeroVikas",
    "github_project": "NLPUtilsBERT",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "==",
                    "0.13.2"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "==",
                    "3.8.3"
                ]
            ]
        },
        {
            "name": "wordcloud",
            "specs": [
                [
                    "==",
                    "1.9.4"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.5.1"
                ]
            ]
        },
        {
            "name": "scikit_learn",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.47.1"
                ]
            ]
        },
        {
            "name": "torchvision",
            "specs": [
                [
                    "==",
                    "0.20.1"
                ]
            ]
        },
        {
            "name": "accelerate",
            "specs": [
                [
                    "==",
                    "1.2.1"
                ]
            ]
        }
    ],
    "lcname": "nlputilsbert"
}
        
Elapsed time: 1.17709s