ScratchNLP

Name	ScratchNLP JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/shanmukh05/scratch_nlp
Summary	Library with NLP Algorithms implemented from scratch
upload_time	2024-12-01 15:35:36
maintainer	None
docs_url	None
author	Shanmukha Sainath
requires_python	>=3.12
license	None
keywords	nlp implementation machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Logo](https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/logo.png)

# Scratch NLP 🧠

Library with foundational NLP Algorithms implemented from scratch using PyTorch.

## Table of Contents 📋

- [Documentation](#documentation-📝)
- [Installation](#installation-⬇️)
- [Features](#features-🛠️)
- [Examples](#examples-🌟)
- [Contributing](#contributing-🤝)
- [Acknowledgements](#acknowledgements-💡)
- [About Me](#about-me-👤)
- [Lessons Learned](#lessons-learned-📌)
- [License](#license-⚖️)
- [Feedback](#feedback-📣)


## Documentation 📝

[Documentation](https://shanmukh05.github.io/scratch_nlp/)


## Installation ⬇️

### Install using pip

```bash
   pip install scratch-nlp
```
    
### Install Manually for development

Clone the repo

```bash
  gh repo clone shanmukh05/scratch_nlp
```

Install dependencies

```bash
  pip install -r requirements.txt
```


## Features 🛠️

- Algorithms
  - Bag of Words
  - Ngram
  - TF-IDF
  - Hidden Markov Model
  - Word2Vec
  - GloVe
  - RNN (Many to One)
  - LSTM (One to Many)
  - GRU (Many to Many Synced)
  - Seq2Seq + Attention (Many to Many)
  - Transformer
  - BERT
  - GPT-2

- Tokenization
  - BypePair Encoding
  - WordPiece Tokenizer

- Metrics
  - BLEU
  - ROUGE (-N, -L, -S)
  - Perplexity
  - METEOR
  - CIDER

- Datasets
  - IMDB Reviews Dataset
  - Flickr Dataset
  - NLTK POS Datasets (treebank, brown, conll2000)
  - SQuAD QA Dataset
  - Genius Lyrics Dataset
  - LAMBADA Dataset
  - Wiki en dataset
  - English to Telugu Translation Dataset

- Tasks
  - Sentiment Classification
  - POS Tagging
  - Image Captioning
  - Machine Translation
  - Question Answering
  - Text Generation

### Implementation Details

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Task</th>
<th>Tokenization</th>
<th>Output</th>
<th>Dataset</th>
</tr>
</thead>
<tbody><tr>
<td><strong>BOW</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>Ngram</strong></td>
<td>Text Representation</td>
<td>Preprocessed Words</td>
<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Top K ngrams Piechart ong</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>TF-IDF</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Text Label, Vector npy files</li><li>TF PCA Pairplot png</li><li>TF-IDF PCA Pairplot png</li><li>IDF csv</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>HMM</strong></td>
<td>POS Tagging</td>
<td>Preprocessed words</td>
<td><ul><li>Data Analysis png (sent len, POS tags count)</li><li>Emission Matrix TSNE html</li><li>Emission matrix csv</li><li>Test Predictions conf matrix, clf report png</li><li>Transition Matrix csv, png</li></ul></td>
<td>NLTK Treebank</td>
</tr>
<tr>
<td><strong>Word2Vec</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>GloVe</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Top K Cooccurence Matrix png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>RNN</strong></td>
<td>Sentiment Classification</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Training History png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>LSTM</strong></td>
<td>Image Captioning</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Training History png</li></ul></td>
<td>Flickr 8k</td>
</tr>
<tr>
<td><strong>GRU</strong></td>
<td>POS Tagging</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>NLTK Treebank, Broown, Conll2000</td>
</tr>
<tr>
<td><strong>Seq2Seq + Attention</strong></td>
<td>Machine Translation</td>
<td>Tokenization</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Source, Target Word Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>English to Telugu Translation</td>
</tr>
<tr>
<td><strong>Transformer</strong></td>
<td>Lyrics Generation</td>
<td>BytePairEncoding</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>Genius Lyrics</td>
</tr>
<tr>
<td><strong>BERT</strong></td>
<td>NSP Pretraining, QA Finetuning</td>
<td>WordPiece</td>
<td><ul><li>Best Model pt (pretrain, finetune)</li><li>Training History json (pretrain, finetune)</li><li>Token Embeddings TSNE html</li><li>Finetune Test predictions csv</li><li>Training History png (pretrain, finetune)</li></ul></td>
<td>Wiki en, SQuAD v1</td>
</tr>
<tr>
<td><strong>GPT-2</strong></td>
<td>Sentence Completition</td>
<td>BytePairEncoding</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>LAMBADA</td>
</tr>
</tbody></table>



## Examples 🌟

Run Train and Inference directly through import
```python
import yaml
from scratch_nlp.src.core.gpt import gpt

with open(config_path, "r") as stream:
  config_dict = yaml.safe_load(stream)

gpt = gpt.GPT(config_dict)
gpt.run()
```

Run through CLI
```bash
  cd src
  python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'
```

## Contributing 🤝

Contributions are always welcome!

See [CONTRIBUTING.md](CONTRIBUTING.md) for ways to get started.

## Acknowledgements 💡

I have referred to so many online resources to create this project. I'm adding all the resources to [RESOURCES.md](RESOURCES.md). Thanks to all who has created those blogs/code/datasets 😊.

Thanks to [CS224N](https://web.stanford.edu/class/cs224n/) course which gave me motivation to start this project

## About Me 👤
I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur. 

### Connect with me

<a href="https://linktr.ee/shanmukh05" target="blank"><img src="https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/connect.png" alt="@shanmukh05" width="200"/></a>

## Lessons Learned 📌

Most of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project

- NLP Algorithms
- Research paper Implementation
- Designing Project structure
- Documentation 
- GitHub pages
- PIP packaging       

## License ⚖️

[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://choosealicense.com/licenses/mit/)

## Feedback 📣

If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shanmukh05/scratch_nlp",
    "name": "ScratchNLP",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "NLP, Implementation, Machine Learning",
    "author": "Shanmukha Sainath",
    "author_email": "Shanmukha Sainath <venkatashanmukhasainathg@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/db/a4/6028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286/scratchnlp-1.0.0.tar.gz",
    "platform": null,
    "description": "![Logo](https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/logo.png)\n\n# Scratch NLP \ud83e\udde0\n\nLibrary with foundational NLP Algorithms implemented from scratch using PyTorch.\n\n## Table of Contents \ud83d\udccb\n\n- [Documentation](#documentation-\ud83d\udcdd)\n- [Installation](#installation-\u2b07\ufe0f)\n- [Features](#features-\ud83d\udee0\ufe0f)\n- [Examples](#examples-\ud83c\udf1f)\n- [Contributing](#contributing-\ud83e\udd1d)\n- [Acknowledgements](#acknowledgements-\ud83d\udca1)\n- [About Me](#about-me-\ud83d\udc64)\n- [Lessons Learned](#lessons-learned-\ud83d\udccc)\n- [License](#license-\u2696\ufe0f)\n- [Feedback](#feedback-\ud83d\udce3)\n\n\n## Documentation \ud83d\udcdd\n\n[Documentation](https://shanmukh05.github.io/scratch_nlp/)\n\n\n## Installation \u2b07\ufe0f\n\n### Install using pip\n\n```bash\n   pip install scratch-nlp\n```\n    \n### Install Manually for development\n\nClone the repo\n\n```bash\n  gh repo clone shanmukh05/scratch_nlp\n```\n\nInstall dependencies\n\n```bash\n  pip install -r requirements.txt\n```\n\n\n## Features \ud83d\udee0\ufe0f\n\n- Algorithms\n  - Bag of Words\n  - Ngram\n  - TF-IDF\n  - Hidden Markov Model\n  - Word2Vec\n  - GloVe\n  - RNN (Many to One)\n  - LSTM (One to Many)\n  - GRU (Many to Many Synced)\n  - Seq2Seq + Attention (Many to Many)\n  - Transformer\n  - BERT\n  - GPT-2\n\n- Tokenization\n  - BypePair Encoding\n  - WordPiece Tokenizer\n\n- Metrics\n  - BLEU\n  - ROUGE (-N, -L, -S)\n  - Perplexity\n  - METEOR\n  - CIDER\n\n- Datasets\n  - IMDB Reviews Dataset\n  - Flickr Dataset\n  - NLTK POS Datasets (treebank, brown, conll2000)\n  - SQuAD QA Dataset\n  - Genius Lyrics Dataset\n  - LAMBADA Dataset\n  - Wiki en dataset\n  - English to Telugu Translation Dataset\n\n- Tasks\n  - Sentiment Classification\n  - POS Tagging\n  - Image Captioning\n  - Machine Translation\n  - Question Answering\n  - Text Generation\n\n### Implementation Details\n\n<table>\n<thead>\n<tr>\n<th>Algorithm</th>\n<th>Task</th>\n<th>Tokenization</th>\n<th>Output</th>\n<th>Dataset</th>\n</tr>\n</thead>\n<tbody><tr>\n<td><strong>BOW</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>Ngram</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed Words</td>\n<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Top K ngrams Piechart ong</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>TF-IDF</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Text Label, Vector npy files</li><li>TF PCA Pairplot png</li><li>TF-IDF PCA Pairplot png</li><li>IDF csv</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>HMM</strong></td>\n<td>POS Tagging</td>\n<td>Preprocessed words</td>\n<td><ul><li>Data Analysis png (sent len, POS tags count)</li><li>Emission Matrix TSNE html</li><li>Emission matrix csv</li><li>Test Predictions conf matrix, clf report png</li><li>Transition Matrix csv, png</li></ul></td>\n<td>NLTK Treebank</td>\n</tr>\n<tr>\n<td><strong>Word2Vec</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>GloVe</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Top K Cooccurence Matrix png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>RNN</strong></td>\n<td>Sentiment Classification</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Training History png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>LSTM</strong></td>\n<td>Image Captioning</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Training History png</li></ul></td>\n<td>Flickr 8k</td>\n</tr>\n<tr>\n<td><strong>GRU</strong></td>\n<td>POS Tagging</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>NLTK Treebank, Broown, Conll2000</td>\n</tr>\n<tr>\n<td><strong>Seq2Seq + Attention</strong></td>\n<td>Machine Translation</td>\n<td>Tokenization</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Source, Target Word Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>English to Telugu Translation</td>\n</tr>\n<tr>\n<td><strong>Transformer</strong></td>\n<td>Lyrics Generation</td>\n<td>BytePairEncoding</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>Genius Lyrics</td>\n</tr>\n<tr>\n<td><strong>BERT</strong></td>\n<td>NSP Pretraining, QA Finetuning</td>\n<td>WordPiece</td>\n<td><ul><li>Best Model pt (pretrain, finetune)</li><li>Training History json (pretrain, finetune)</li><li>Token Embeddings TSNE html</li><li>Finetune Test predictions csv</li><li>Training History png (pretrain, finetune)</li></ul></td>\n<td>Wiki en, SQuAD v1</td>\n</tr>\n<tr>\n<td><strong>GPT-2</strong></td>\n<td>Sentence Completition</td>\n<td>BytePairEncoding</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>LAMBADA</td>\n</tr>\n</tbody></table>\n\n\n\n## Examples \ud83c\udf1f\n\nRun Train and Inference directly through import\n```python\nimport yaml\nfrom scratch_nlp.src.core.gpt import gpt\n\nwith open(config_path, \"r\") as stream:\n  config_dict = yaml.safe_load(stream)\n\ngpt = gpt.GPT(config_dict)\ngpt.run()\n```\n\nRun through CLI\n```bash\n  cd src\n  python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'\n```\n\n## Contributing \ud83e\udd1d\n\nContributions are always welcome!\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for ways to get started.\n\n## Acknowledgements \ud83d\udca1\n\nI have referred to so many online resources to create this project. I'm adding all the resources to [RESOURCES.md](RESOURCES.md). Thanks to all who has created those blogs/code/datasets \ud83d\ude0a.\n\nThanks to [CS224N](https://web.stanford.edu/class/cs224n/) course which gave me motivation to start this project\n\n## About Me \ud83d\udc64\nI am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur. \n\n### Connect with me\n\n<a href=\"https://linktr.ee/shanmukh05\" target=\"blank\"><img src=\"https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/connect.png\" alt=\"@shanmukh05\" width=\"200\"/></a>\n\n## Lessons Learned \ud83d\udccc\n\nMost of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project\n\n- NLP Algorithms\n- Research paper Implementation\n- Designing Project structure\n- Documentation \n- GitHub pages\n- PIP packaging       \n\n## License \u2696\ufe0f\n\n[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://choosealicense.com/licenses/mit/)\n\n## Feedback \ud83d\udce3\n\nIf you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Library with NLP Algorithms implemented from scratch",
    "version": "1.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/shanmukh05/scratch_nlp/issues",
        "Documentation": "https://shanmukh05.github.io/scratch_nlp/",
        "Homepage": "https://github.com/shanmukh05/scratch_nlp",
        "Source Code": "https://github.com/shanmukh05/scratch_nlp"
    },
    "split_keywords": [
        "nlp",
        " implementation",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57734b82d2613c12003664f9ba9faa420497b61f8bbd9a655857039b16080d94",
                "md5": "d6f811b0f66bbf6368dc84d8271245ca",
                "sha256": "483ce8b05611202931059f0f4152b6650065d61d150c9f32ca7573ccdcc0022d"
            },
            "downloads": -1,
            "filename": "ScratchNLP-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d6f811b0f66bbf6368dc84d8271245ca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 90381,
            "upload_time": "2024-12-01T15:35:34",
            "upload_time_iso_8601": "2024-12-01T15:35:34.886373Z",
            "url": "https://files.pythonhosted.org/packages/57/73/4b82d2613c12003664f9ba9faa420497b61f8bbd9a655857039b16080d94/ScratchNLP-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dba46028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286",
                "md5": "74c4531a5d0536475e0fd5faa38c13de",
                "sha256": "f993cbe16c62e277c3bc12619c6465530eb72c3901c23f75d053668b99582c6c"
            },
            "downloads": -1,
            "filename": "scratchnlp-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "74c4531a5d0536475e0fd5faa38c13de",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 62922,
            "upload_time": "2024-12-01T15:35:36",
            "upload_time_iso_8601": "2024-12-01T15:35:36.970021Z",
            "url": "https://files.pythonhosted.org/packages/db/a4/6028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286/scratchnlp-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-01 15:35:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shanmukh05",
    "github_project": "scratch_nlp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "scratchnlp"
}

Shanmukha Sainath