torchFastText

Name	torchFastText JSON
Version	0.0.2 JSON
	download
home_page	None
Summary	An implementation of the https://github.com/facebookresearch/fastText supervised learning algorithm for text classification using Pytorch.
upload_time	2025-02-12 09:33:45
maintainer	None
docs_url	None
author	Tom Seimandi
requires_python	>=3.10
license	None
keywords	fasttext text classification nlp automatic coding deep learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # torchFastText : Efficient text classification with PyTorch

A flexible PyTorch implementation of FastText for text classification with support for categorical features.

## Features

- Supports text classification with FastText architecture
- Handles both text and categorical features
- N-gram tokenization
- Flexible optimizer and scheduler options
- GPU and CPU support
- Model checkpointing and early stopping
- Prediction and model explanation capabilities

## Installation

```bash
pip install torchFastText
```

## Key Components

- `build()`: Constructs the FastText model architecture
- `train()`: Trains the model with built-in callbacks and logging
- `predict()`: Generates class predictions
- `predict_and_explain()`: Provides predictions with feature attributions

## Subpackages

- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.

Run `pip install torchFastText[preprocess]` or `pip install torchFastText[explainability]` to download these optional dependencies.


## Quick Start

```python
from torchFastText import torchFastText

# Initialize the model
model = torchFastText(
    num_tokens=1000000,
    embedding_dim=100,
    min_count=5,
    min_n=3,
    max_n=6,
    len_word_ngrams=True,
    sparse=True
)

# Train the model
model.train(
    X_train=train_data,
    y_train=train_labels,
    X_val=val_data,
    y_val=val_labels,
    num_epochs=10,
    batch_size=64
)
# Make predictions
predictions = model.predict(test_data)
```

where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.

Please make sure `y_train` contains at least one time each possible label.

## Dependencies

- PyTorch Lightning
- NumPy

## Categorical features

If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:

- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).

Default is `None`.

<a name="figure-1"></a>
![Default-architecture](images/NN.drawio.png "Default architecture")  
*Figure 1: The 'sum' architecture*

<a name="figure-2"></a>
![avg-architecture](images/avg_concat.png "Default architecture")  
*Figure 2: The 'average and concatenate' architecture*

<a name="figure-3"></a>
![concat-architecture](images/full_concat.png "Default architecture")  
*Figure 3: The 'concatenate all' architecture*

## Documentation

For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

MIT


## References

Inspired by the original FastText paper [1] and implementation.

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)

```
@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "torchFastText",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "fastText, text classification, NLP, automatic coding, deep learning",
    "author": "Tom Seimandi",
    "author_email": "tom.seimandi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fd/cd/0afeee442dced94e14a3e3567dd164496a3e7ddb8f8ef76dba5544d70ccd/torchfasttext-0.0.2.tar.gz",
    "platform": null,
    "description": "# torchFastText : Efficient text classification with PyTorch\n\nA flexible PyTorch implementation of FastText for text classification with support for categorical features.\n\n## Features\n\n- Supports text classification with FastText architecture\n- Handles both text and categorical features\n- N-gram tokenization\n- Flexible optimizer and scheduler options\n- GPU and CPU support\n- Model checkpointing and early stopping\n- Prediction and model explanation capabilities\n\n## Installation\n\n```bash\npip install torchFastText\n```\n\n## Key Components\n\n- `build()`: Constructs the FastText model architecture\n- `train()`: Trains the model with built-in callbacks and logging\n- `predict()`: Generates class predictions\n- `predict_and_explain()`: Provides predictions with feature attributions\n\n## Subpackages\n\n- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.\n- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.\n\nRun `pip install torchFastText[preprocess]` or `pip install torchFastText[explainability]` to download these optional dependencies.\n\n\n## Quick Start\n\n```python\nfrom torchFastText import torchFastText\n\n# Initialize the model\nmodel = torchFastText(\n    num_tokens=1000000,\n    embedding_dim=100,\n    min_count=5,\n    min_n=3,\n    max_n=6,\n    len_word_ngrams=True,\n    sparse=True\n)\n\n# Train the model\nmodel.train(\n    X_train=train_data,\n    y_train=train_labels,\n    X_val=val_data,\n    y_val=val_labels,\n    num_epochs=10,\n    batch_size=64\n)\n# Make predictions\npredictions = model.predict(test_data)\n```\n\nwhere ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.\n\nPlease make sure `y_train` contains at least one time each possible label.\n\n## Dependencies\n\n- PyTorch Lightning\n- NumPy\n\n## Categorical features\n\nIf any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:\n\n- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).\n- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).\n- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).\n\nDefault is `None`.\n\n<a name=\"figure-1\"></a>\n![Default-architecture](images/NN.drawio.png \"Default architecture\")  \n*Figure 1: The 'sum' architecture*\n\n<a name=\"figure-2\"></a>\n![avg-architecture](images/avg_concat.png \"Default architecture\")  \n*Figure 2: The 'average and concatenate' architecture*\n\n<a name=\"figure-3\"></a>\n![concat-architecture](images/full_concat.png \"Default architecture\")  \n*Figure 3: The 'concatenate all' architecture*\n\n## Documentation\n\nFor detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nMIT\n\n\n## References\n\nInspired by the original FastText paper [1] and implementation.\n\n[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)\n\n```\n@InProceedings{joulin2017bag,\n  title={Bag of Tricks for Efficient Text Classification},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},\n  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},\n  month={April},\n  year={2017},\n  publisher={Association for Computational Linguistics},\n  pages={427--431},\n}\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An implementation of the https://github.com/facebookresearch/fastText supervised learning algorithm for text classification using Pytorch.",
    "version": "0.0.2",
    "project_urls": null,
    "split_keywords": [
        "fasttext",
        " text classification",
        " nlp",
        " automatic coding",
        " deep learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3dcfcc022d8345240ee46bc14b674be264cc4925a27b3cee40cfc64920b5f2be",
                "md5": "b1e0fc06329898ffdc1e5bb69a53a2ee",
                "sha256": "54b6df8322c7c466f4f6bfe76c728c6e99860059c9c469a1eb60ebdf1e12c164"
            },
            "downloads": -1,
            "filename": "torchfasttext-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b1e0fc06329898ffdc1e5bb69a53a2ee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28761,
            "upload_time": "2025-02-12T09:33:42",
            "upload_time_iso_8601": "2025-02-12T09:33:42.627105Z",
            "url": "https://files.pythonhosted.org/packages/3d/cf/cc022d8345240ee46bc14b674be264cc4925a27b3cee40cfc64920b5f2be/torchfasttext-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fdcd0afeee442dced94e14a3e3567dd164496a3e7ddb8f8ef76dba5544d70ccd",
                "md5": "83d2cdbb3b4aa2bc914b1acc9478acd4",
                "sha256": "8ec021e64df2980a7db4d25a06b9b959bbc1680f5057c2d00a789500b31e3fcc"
            },
            "downloads": -1,
            "filename": "torchfasttext-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "83d2cdbb3b4aa2bc914b1acc9478acd4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25005,
            "upload_time": "2025-02-12T09:33:45",
            "upload_time_iso_8601": "2025-02-12T09:33:45.326671Z",
            "url": "https://files.pythonhosted.org/packages/fd/cd/0afeee442dced94e14a3e3567dd164496a3e7ddb8f8ef76dba5544d70ccd/torchfasttext-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-12 09:33:45",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "torchfasttext"
}

Tom Seimandi