![Logo](https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/logo.png)
# Scratch NLP 🧠
Library with foundational NLP Algorithms implemented from scratch using PyTorch.
## Table of Contents 📋
- [Documentation](#documentation-📝)
- [Installation](#installation-⬇️)
- [Features](#features-🛠️)
- [Examples](#examples-🌟)
- [Contributing](#contributing-🤝)
- [Acknowledgements](#acknowledgements-💡)
- [About Me](#about-me-👤)
- [Lessons Learned](#lessons-learned-📌)
- [License](#license-⚖️)
- [Feedback](#feedback-📣)
## Documentation 📝
[Documentation](https://shanmukh05.github.io/scratch_nlp/)
## Installation ⬇️
### Install using pip
```bash
pip install scratch-nlp
```
### Install Manually for development
Clone the repo
```bash
gh repo clone shanmukh05/scratch_nlp
```
Install dependencies
```bash
pip install -r requirements.txt
```
## Features 🛠️
- Algorithms
- Bag of Words
- Ngram
- TF-IDF
- Hidden Markov Model
- Word2Vec
- GloVe
- RNN (Many to One)
- LSTM (One to Many)
- GRU (Many to Many Synced)
- Seq2Seq + Attention (Many to Many)
- Transformer
- BERT
- GPT-2
- Tokenization
- BypePair Encoding
- WordPiece Tokenizer
- Metrics
- BLEU
- ROUGE (-N, -L, -S)
- Perplexity
- METEOR
- CIDER
- Datasets
- IMDB Reviews Dataset
- Flickr Dataset
- NLTK POS Datasets (treebank, brown, conll2000)
- SQuAD QA Dataset
- Genius Lyrics Dataset
- LAMBADA Dataset
- Wiki en dataset
- English to Telugu Translation Dataset
- Tasks
- Sentiment Classification
- POS Tagging
- Image Captioning
- Machine Translation
- Question Answering
- Text Generation
### Implementation Details
<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Task</th>
<th>Tokenization</th>
<th>Output</th>
<th>Dataset</th>
</tr>
</thead>
<tbody><tr>
<td><strong>BOW</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>Ngram</strong></td>
<td>Text Representation</td>
<td>Preprocessed Words</td>
<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Top K ngrams Piechart ong</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>TF-IDF</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Text Label, Vector npy files</li><li>TF PCA Pairplot png</li><li>TF-IDF PCA Pairplot png</li><li>IDF csv</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>HMM</strong></td>
<td>POS Tagging</td>
<td>Preprocessed words</td>
<td><ul><li>Data Analysis png (sent len, POS tags count)</li><li>Emission Matrix TSNE html</li><li>Emission matrix csv</li><li>Test Predictions conf matrix, clf report png</li><li>Transition Matrix csv, png</li></ul></td>
<td>NLTK Treebank</td>
</tr>
<tr>
<td><strong>Word2Vec</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>GloVe</strong></td>
<td>Text Representation</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Top K Cooccurence Matrix png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>RNN</strong></td>
<td>Sentiment Classification</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Training History png</li></ul></td>
<td>IMDB Reviews</td>
</tr>
<tr>
<td><strong>LSTM</strong></td>
<td>Image Captioning</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Training History png</li></ul></td>
<td>Flickr 8k</td>
</tr>
<tr>
<td><strong>GRU</strong></td>
<td>POS Tagging</td>
<td>Preprocessed words</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>NLTK Treebank, Broown, Conll2000</td>
</tr>
<tr>
<td><strong>Seq2Seq + Attention</strong></td>
<td>Machine Translation</td>
<td>Tokenization</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Source, Target Word Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>English to Telugu Translation</td>
</tr>
<tr>
<td><strong>Transformer</strong></td>
<td>Lyrics Generation</td>
<td>BytePairEncoding</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>Genius Lyrics</td>
</tr>
<tr>
<td><strong>BERT</strong></td>
<td>NSP Pretraining, QA Finetuning</td>
<td>WordPiece</td>
<td><ul><li>Best Model pt (pretrain, finetune)</li><li>Training History json (pretrain, finetune)</li><li>Token Embeddings TSNE html</li><li>Finetune Test predictions csv</li><li>Training History png (pretrain, finetune)</li></ul></td>
<td>Wiki en, SQuAD v1</td>
</tr>
<tr>
<td><strong>GPT-2</strong></td>
<td>Sentence Completition</td>
<td>BytePairEncoding</td>
<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>
<td>LAMBADA</td>
</tr>
</tbody></table>
## Examples 🌟
Run Train and Inference directly through import
```python
import yaml
from scratch_nlp.src.core.gpt import gpt
with open(config_path, "r") as stream:
config_dict = yaml.safe_load(stream)
gpt = gpt.GPT(config_dict)
gpt.run()
```
Run through CLI
```bash
cd src
python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'
```
## Contributing 🤝
Contributions are always welcome!
See [CONTRIBUTING.md](CONTRIBUTING.md) for ways to get started.
## Acknowledgements 💡
I have referred to so many online resources to create this project. I'm adding all the resources to [RESOURCES.md](RESOURCES.md). Thanks to all who has created those blogs/code/datasets 😊.
Thanks to [CS224N](https://web.stanford.edu/class/cs224n/) course which gave me motivation to start this project
## About Me 👤
I am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur.
### Connect with me
<a href="https://linktr.ee/shanmukh05" target="blank"><img src="https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/connect.png" alt="@shanmukh05" width="200"/></a>
## Lessons Learned 📌
Most of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project
- NLP Algorithms
- Research paper Implementation
- Designing Project structure
- Documentation
- GitHub pages
- PIP packaging
## License ⚖️
[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://choosealicense.com/licenses/mit/)
## Feedback 📣
If you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com
Raw data
{
"_id": null,
"home_page": "https://github.com/shanmukh05/scratch_nlp",
"name": "ScratchNLP",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "NLP, Implementation, Machine Learning",
"author": "Shanmukha Sainath",
"author_email": "Shanmukha Sainath <venkatashanmukhasainathg@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/db/a4/6028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286/scratchnlp-1.0.0.tar.gz",
"platform": null,
"description": "![Logo](https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/logo.png)\n\n# Scratch NLP \ud83e\udde0\n\nLibrary with foundational NLP Algorithms implemented from scratch using PyTorch.\n\n## Table of Contents \ud83d\udccb\n\n- [Documentation](#documentation-\ud83d\udcdd)\n- [Installation](#installation-\u2b07\ufe0f)\n- [Features](#features-\ud83d\udee0\ufe0f)\n- [Examples](#examples-\ud83c\udf1f)\n- [Contributing](#contributing-\ud83e\udd1d)\n- [Acknowledgements](#acknowledgements-\ud83d\udca1)\n- [About Me](#about-me-\ud83d\udc64)\n- [Lessons Learned](#lessons-learned-\ud83d\udccc)\n- [License](#license-\u2696\ufe0f)\n- [Feedback](#feedback-\ud83d\udce3)\n\n\n## Documentation \ud83d\udcdd\n\n[Documentation](https://shanmukh05.github.io/scratch_nlp/)\n\n\n## Installation \u2b07\ufe0f\n\n### Install using pip\n\n```bash\n pip install scratch-nlp\n```\n \n### Install Manually for development\n\nClone the repo\n\n```bash\n gh repo clone shanmukh05/scratch_nlp\n```\n\nInstall dependencies\n\n```bash\n pip install -r requirements.txt\n```\n\n\n## Features \ud83d\udee0\ufe0f\n\n- Algorithms\n - Bag of Words\n - Ngram\n - TF-IDF\n - Hidden Markov Model\n - Word2Vec\n - GloVe\n - RNN (Many to One)\n - LSTM (One to Many)\n - GRU (Many to Many Synced)\n - Seq2Seq + Attention (Many to Many)\n - Transformer\n - BERT\n - GPT-2\n\n- Tokenization\n - BypePair Encoding\n - WordPiece Tokenizer\n\n- Metrics\n - BLEU\n - ROUGE (-N, -L, -S)\n - Perplexity\n - METEOR\n - CIDER\n\n- Datasets\n - IMDB Reviews Dataset\n - Flickr Dataset\n - NLTK POS Datasets (treebank, brown, conll2000)\n - SQuAD QA Dataset\n - Genius Lyrics Dataset\n - LAMBADA Dataset\n - Wiki en dataset\n - English to Telugu Translation Dataset\n\n- Tasks\n - Sentiment Classification\n - POS Tagging\n - Image Captioning\n - Machine Translation\n - Question Answering\n - Text Generation\n\n### Implementation Details\n\n<table>\n<thead>\n<tr>\n<th>Algorithm</th>\n<th>Task</th>\n<th>Tokenization</th>\n<th>Output</th>\n<th>Dataset</th>\n</tr>\n</thead>\n<tbody><tr>\n<td><strong>BOW</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>Ngram</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed Words</td>\n<td><ul><li>Text Label, Vector npy files</li><li>Top K Vocab Frequency Histogram png</li><li>Top K ngrams Piechart ong</li><li>Vocab frequency csv</li><li>Wordcloud png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>TF-IDF</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Text Label, Vector npy files</li><li>TF PCA Pairplot png</li><li>TF-IDF PCA Pairplot png</li><li>IDF csv</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>HMM</strong></td>\n<td>POS Tagging</td>\n<td>Preprocessed words</td>\n<td><ul><li>Data Analysis png (sent len, POS tags count)</li><li>Emission Matrix TSNE html</li><li>Emission matrix csv</li><li>Test Predictions conf matrix, clf report png</li><li>Transition Matrix csv, png</li></ul></td>\n<td>NLTK Treebank</td>\n</tr>\n<tr>\n<td><strong>Word2Vec</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>GloVe</strong></td>\n<td>Text Representation</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Top K Cooccurence Matrix png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>RNN</strong></td>\n<td>Sentiment Classification</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Training History png</li></ul></td>\n<td>IMDB Reviews</td>\n</tr>\n<tr>\n<td><strong>LSTM</strong></td>\n<td>Image Captioning</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Training History png</li></ul></td>\n<td>Flickr 8k</td>\n</tr>\n<tr>\n<td><strong>GRU</strong></td>\n<td>POS Tagging</td>\n<td>Preprocessed words</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Word Embeddings TSNE html</li><li>Confusion Matrix png</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>NLTK Treebank, Broown, Conll2000</td>\n</tr>\n<tr>\n<td><strong>Seq2Seq + Attention</strong></td>\n<td>Machine Translation</td>\n<td>Tokenization</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Source, Target Word Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>English to Telugu Translation</td>\n</tr>\n<tr>\n<td><strong>Transformer</strong></td>\n<td>Lyrics Generation</td>\n<td>BytePairEncoding</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>Genius Lyrics</td>\n</tr>\n<tr>\n<td><strong>BERT</strong></td>\n<td>NSP Pretraining, QA Finetuning</td>\n<td>WordPiece</td>\n<td><ul><li>Best Model pt (pretrain, finetune)</li><li>Training History json (pretrain, finetune)</li><li>Token Embeddings TSNE html</li><li>Finetune Test predictions csv</li><li>Training History png (pretrain, finetune)</li></ul></td>\n<td>Wiki en, SQuAD v1</td>\n</tr>\n<tr>\n<td><strong>GPT-2</strong></td>\n<td>Sentence Completition</td>\n<td>BytePairEncoding</td>\n<td><ul><li>Best Model pt</li><li>Training History json</li><li>Token Embeddings TSNE html</li><li>Test predictions csv</li><li>Training History png</li></ul></td>\n<td>LAMBADA</td>\n</tr>\n</tbody></table>\n\n\n\n## Examples \ud83c\udf1f\n\nRun Train and Inference directly through import\n```python\nimport yaml\nfrom scratch_nlp.src.core.gpt import gpt\n\nwith open(config_path, \"r\") as stream:\n config_dict = yaml.safe_load(stream)\n\ngpt = gpt.GPT(config_dict)\ngpt.run()\n```\n\nRun through CLI\n```bash\n cd src\n python main.py --config_path '<config_path>' --algo '<algo name>' --log_folder '<output folder>'\n```\n\n## Contributing \ud83e\udd1d\n\nContributions are always welcome!\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for ways to get started.\n\n## Acknowledgements \ud83d\udca1\n\nI have referred to so many online resources to create this project. I'm adding all the resources to [RESOURCES.md](RESOURCES.md). Thanks to all who has created those blogs/code/datasets \ud83d\ude0a.\n\nThanks to [CS224N](https://web.stanford.edu/class/cs224n/) course which gave me motivation to start this project\n\n## About Me \ud83d\udc64\nI am Shanmukha Sainath, working as AI Engineer at KLA Corporation. I have done my Bachelors from Department of Electronics and Electrical Communication Engineering department with Minor in Computer Science Engineering and Micro in Artificial Intelligence and Applications from IIT Kharagpur. \n\n### Connect with me\n\n<a href=\"https://linktr.ee/shanmukh05\" target=\"blank\"><img src=\"https://raw.githubusercontent.com/shanmukh05/scratch_nlp/main/assets/connect.png\" alt=\"@shanmukh05\" width=\"200\"/></a>\n\n## Lessons Learned \ud83d\udccc\n\nMost of the things present in this project are pretty new to me. I'm listing down my major learnings when creating this project\n\n- NLP Algorithms\n- Research paper Implementation\n- Designing Project structure\n- Documentation \n- GitHub pages\n- PIP packaging \n\n## License \u2696\ufe0f\n\n[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://choosealicense.com/licenses/mit/)\n\n## Feedback \ud83d\udce3\n\nIf you have any feedback, please reach out to me at venkatashanmukhasainathg@gmail.com\n",
"bugtrack_url": null,
"license": null,
"summary": "Library with NLP Algorithms implemented from scratch",
"version": "1.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/shanmukh05/scratch_nlp/issues",
"Documentation": "https://shanmukh05.github.io/scratch_nlp/",
"Homepage": "https://github.com/shanmukh05/scratch_nlp",
"Source Code": "https://github.com/shanmukh05/scratch_nlp"
},
"split_keywords": [
"nlp",
" implementation",
" machine learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "57734b82d2613c12003664f9ba9faa420497b61f8bbd9a655857039b16080d94",
"md5": "d6f811b0f66bbf6368dc84d8271245ca",
"sha256": "483ce8b05611202931059f0f4152b6650065d61d150c9f32ca7573ccdcc0022d"
},
"downloads": -1,
"filename": "ScratchNLP-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d6f811b0f66bbf6368dc84d8271245ca",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 90381,
"upload_time": "2024-12-01T15:35:34",
"upload_time_iso_8601": "2024-12-01T15:35:34.886373Z",
"url": "https://files.pythonhosted.org/packages/57/73/4b82d2613c12003664f9ba9faa420497b61f8bbd9a655857039b16080d94/ScratchNLP-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "dba46028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286",
"md5": "74c4531a5d0536475e0fd5faa38c13de",
"sha256": "f993cbe16c62e277c3bc12619c6465530eb72c3901c23f75d053668b99582c6c"
},
"downloads": -1,
"filename": "scratchnlp-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "74c4531a5d0536475e0fd5faa38c13de",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 62922,
"upload_time": "2024-12-01T15:35:36",
"upload_time_iso_8601": "2024-12-01T15:35:36.970021Z",
"url": "https://files.pythonhosted.org/packages/db/a4/6028271368a93b124ecb507e0c009dc08beba4ab1212d2d626a4cffbb286/scratchnlp-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-01 15:35:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "shanmukh05",
"github_project": "scratch_nlp",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "scratchnlp"
}