# torchFastText : Efficient text classification with PyTorch
A flexible PyTorch implementation of FastText for text classification with support for categorical features.
## Features
- Supports text classification with FastText architecture
- Handles both text and categorical features
- N-gram tokenization
- Flexible optimizer and scheduler options
- GPU and CPU support
- Model checkpointing and early stopping
- Prediction and model explanation capabilities
## Installation
```bash
pip install torchFastText
```
## Key Components
- `build()`: Constructs the FastText model architecture
- `train()`: Trains the model with built-in callbacks and logging
- `predict()`: Generates class predictions
- `predict_and_explain()`: Provides predictions with feature attributions
## Subpackages
- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.
Run `pip install torchFastText[preprocess]` or `pip install torchFastText[explainability]` to download these optional dependencies.
## Quick Start
```python
from torchFastText import torchFastText
# Initialize the model
model = torchFastText(
num_tokens=1000000,
embedding_dim=100,
min_count=5,
min_n=3,
max_n=6,
len_word_ngrams=True,
sparse=True
)
# Train the model
model.train(
X_train=train_data,
y_train=train_labels,
X_val=val_data,
y_val=val_labels,
num_epochs=10,
batch_size=64
)
# Make predictions
predictions = model.predict(test_data)
```
where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.
Please make sure `y_train` contains at least one time each possible label.
## Dependencies
- PyTorch Lightning
- NumPy
## Categorical features
If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:
- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).
Default is `None`.
<a name="figure-1"></a>

*Figure 1: The 'sum' architecture*
<a name="figure-2"></a>

*Figure 2: The 'average and concatenate' architecture*
<a name="figure-3"></a>

*Figure 3: The 'concatenate all' architecture*
## Documentation
For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
MIT
## References
Inspired by the original FastText paper [1] and implementation.
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
```
@InProceedings{joulin2017bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
month={April},
year={2017},
publisher={Association for Computational Linguistics},
pages={427--431},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "torchFastText",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "fastText, text classification, NLP, automatic coding, deep learning",
"author": "Tom Seimandi",
"author_email": "tom.seimandi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/fd/cd/0afeee442dced94e14a3e3567dd164496a3e7ddb8f8ef76dba5544d70ccd/torchfasttext-0.0.2.tar.gz",
"platform": null,
"description": "# torchFastText : Efficient text classification with PyTorch\n\nA flexible PyTorch implementation of FastText for text classification with support for categorical features.\n\n## Features\n\n- Supports text classification with FastText architecture\n- Handles both text and categorical features\n- N-gram tokenization\n- Flexible optimizer and scheduler options\n- GPU and CPU support\n- Model checkpointing and early stopping\n- Prediction and model explanation capabilities\n\n## Installation\n\n```bash\npip install torchFastText\n```\n\n## Key Components\n\n- `build()`: Constructs the FastText model architecture\n- `train()`: Trains the model with built-in callbacks and logging\n- `predict()`: Generates class predictions\n- `predict_and_explain()`: Provides predictions with feature attributions\n\n## Subpackages\n\n- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.\n- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.\n\nRun `pip install torchFastText[preprocess]` or `pip install torchFastText[explainability]` to download these optional dependencies.\n\n\n## Quick Start\n\n```python\nfrom torchFastText import torchFastText\n\n# Initialize the model\nmodel = torchFastText(\n num_tokens=1000000,\n embedding_dim=100,\n min_count=5,\n min_n=3,\n max_n=6,\n len_word_ngrams=True,\n sparse=True\n)\n\n# Train the model\nmodel.train(\n X_train=train_data,\n y_train=train_labels,\n X_val=val_data,\n y_val=val_labels,\n num_epochs=10,\n batch_size=64\n)\n# Make predictions\npredictions = model.predict(test_data)\n```\n\nwhere ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.\n\nPlease make sure `y_train` contains at least one time each possible label.\n\n## Dependencies\n\n- PyTorch Lightning\n- NumPy\n\n## Categorical features\n\nIf any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:\n\n- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).\n- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).\n- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).\n\nDefault is `None`.\n\n<a name=\"figure-1\"></a>\n \n*Figure 1: The 'sum' architecture*\n\n<a name=\"figure-2\"></a>\n \n*Figure 2: The 'average and concatenate' architecture*\n\n<a name=\"figure-3\"></a>\n \n*Figure 3: The 'concatenate all' architecture*\n\n## Documentation\n\nFor detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nMIT\n\n\n## References\n\nInspired by the original FastText paper [1] and implementation.\n\n[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)\n\n```\n@InProceedings{joulin2017bag,\n title={Bag of Tricks for Efficient Text Classification},\n author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},\n booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},\n month={April},\n year={2017},\n publisher={Association for Computational Linguistics},\n pages={427--431},\n}\n```\n\n",
"bugtrack_url": null,
"license": null,
"summary": "An implementation of the https://github.com/facebookresearch/fastText supervised learning algorithm for text classification using Pytorch.",
"version": "0.0.2",
"project_urls": null,
"split_keywords": [
"fasttext",
" text classification",
" nlp",
" automatic coding",
" deep learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3dcfcc022d8345240ee46bc14b674be264cc4925a27b3cee40cfc64920b5f2be",
"md5": "b1e0fc06329898ffdc1e5bb69a53a2ee",
"sha256": "54b6df8322c7c466f4f6bfe76c728c6e99860059c9c469a1eb60ebdf1e12c164"
},
"downloads": -1,
"filename": "torchfasttext-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b1e0fc06329898ffdc1e5bb69a53a2ee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 28761,
"upload_time": "2025-02-12T09:33:42",
"upload_time_iso_8601": "2025-02-12T09:33:42.627105Z",
"url": "https://files.pythonhosted.org/packages/3d/cf/cc022d8345240ee46bc14b674be264cc4925a27b3cee40cfc64920b5f2be/torchfasttext-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "fdcd0afeee442dced94e14a3e3567dd164496a3e7ddb8f8ef76dba5544d70ccd",
"md5": "83d2cdbb3b4aa2bc914b1acc9478acd4",
"sha256": "8ec021e64df2980a7db4d25a06b9b959bbc1680f5057c2d00a789500b31e3fcc"
},
"downloads": -1,
"filename": "torchfasttext-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "83d2cdbb3b4aa2bc914b1acc9478acd4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 25005,
"upload_time": "2025-02-12T09:33:45",
"upload_time_iso_8601": "2025-02-12T09:33:45.326671Z",
"url": "https://files.pythonhosted.org/packages/fd/cd/0afeee442dced94e14a3e3567dd164496a3e7ddb8f8ef76dba5544d70ccd/torchfasttext-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-12 09:33:45",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "torchfasttext"
}