belt-nlp


Namebelt-nlp JSON
Version 1.1.0 PyPI version JSON
download
home_pageNone
SummaryBELT (BERT For Longer Texts). BERT-based text classification model for processing texts longer than 512 tokens.
upload_time2024-06-19 12:48:26
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords nlp natural-language-processing text-classification transformers transfer-learning bert pytorch machine-learning deep-learning roberta
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # **BELT** (**BE**RT For **L**onger **T**exts)

🚀**New in version 1.1.0: support for multilabel and regression**. See [the examples](#examples)🚀

## Project description and motivation

### The BELT approach

The BERT model can process texts of the maximal length of 512 tokens (roughly speaking tokens are equivalent to words). It is a consequence of the model architecture and cannot be directly adjusted. Discussion of this issue can be found [here](https://github.com/google-research/bert/issues/27). Method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the previously mentioned discussion: [comment](https://github.com/google-research/bert/issues/27#issuecomment-435265194). The main goal of our project is to implement this method and allow the BERT model to process longer texts during prediction and fine-tuning. We dub this approach BELT (**BE**RT For **L**onger **T**exts).

More technical details are described in the [documentation](https://mim-solutions.github.io/bert_for_longer_texts/). We also prepared the comprehensive blog post: [part 1](https://www.mim.ai/fine-tuning-bert-model-for-arbitrarily-long-texts-part-1/), [part 2](https://www.mim.ai/fine-tuning-bert-model-for-arbitrarily-long-texts-part-2/).

### Attention is all you need, but 512 words is all you have

The limitations of the BERT model to the 512 tokens come from the very beginning of the transformers models. Indeed, the attention mechanism, invented in the groundbreaking 2017 paper [Attention is all you need](https://arxiv.org/abs/1706.03762), scales quadratically with the sequence length. Unlike RNN or CNN models, which can process sequences of arbitrary length, transformers with the full attention (like BERT) are infeasible (or very expensive) to process long sequences.
To overcome the issue, alternative approaches with sparse attention mechanisms were proposed in 2020: [BigBird](https://arxiv.org/abs/2007.14062) and [Longformer](https://arxiv.org/abs/2004.05150).

### BELT vs. BigBird vs. LongFormer

Let us now clarify the key differences between the BELT approach to fine-tuning and the sparse attention models BigBird and Longformer:
- The main difference is that BigBird and Longformers are not modified BERTs. They are models with different architectures. Hence, they need to be pre-trained from scratch or downloaded.
- This leads to the main advantage of the BELT approach - it uses any pre-trained BERT or RoBERTa models. A quick look at the HuggingFace Hub confirms that there are about 100 times more resources for [BERT](https://huggingface.co/models?other=bert) than for [Longformer](https://huggingface.co/models?other=longformer). It might be easier to find the one appropriate for the specific task or language.
- On the other hand, we have not done any benchmark tests yet. We believe that the comparison of the BELT approach with the models with sparse attention might be very instructive. Some work in this direction was done in the 2022 paper [Extend and Explain: Interpreting Very Long Language Models](https://proceedings.mlr.press/v193/stremmel22a/stremmel22a.pdf). The authors cited our implementation under the former name `roberta_for_longer_texts`. We encourage more research in this direction.

## Installation and dependencies

The project requires Python 3.9+ to run. We recommend training the models on the GPU. Hence, it is necessary to install `torch` version compatible with the machine. The version of the driver depends on the machine - first, check the version of GPU drivers by the command `nvidia-smi` and choose the newest version compatible with these drivers according to [this table](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) (e.g.: 11.1). Then we install `torch` to get the compatible build. [Here](https://pytorch.org/get-started/previous-versions/), we find which torch version is compatible with the CUDA version on our machine.

Another option is to use the CPU-only version of torch:
```
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
```
Next, we recommend installing via pip:
```
pip3 install belt-nlp
```

If you want to clone the repo in order to run tests or notebooks, you can use the `requirements.txt` file.

## Model classes

Two main classes are implemented:
- `BertClassifierTruncated` - base binary classification model, longer texts are truncated to 512 tokens
- `BertClassifierWithPooling` - extended model for longer texts (more details in the [documentation](https://mim-solutions.github.io/bert_for_longer_texts/))

## Interface

The main methods are:
- `fit` - fine-tune the model to the training set, use the list of raw texts and labels
- `predict_classes` - calculate the list of classifications for the given list of raw texts. The model must be fine-tuned before that.
- `predict_scores` - calculate the list of probabilities for the given list of raw texts. The model must be fine-tuned before that.

## Loading the pre-trained model
 
As a default, the standard English `bert-base-uncased` model is used as a pre-trained model. However, it is possible to use any Bert or Roberta model. To do this, use the parameter `pretrained_model_name_or_path`.
It can be either:
- a string with the name of a pre-trained model configuration to download from huggingface library, e.g.: `roberta-base`.
- a path to a directory with the downloaded model, e.g.: `./my_model_directory/`.

## Tests
To make sure everything works properly, run the command ```pytest tests -rA```. As a default, during tests, models are trained on small samples on the CPU.

## Examples

All examples use public datasets from huggingface hub.

### Binary classification - prediction of sentiment of IMDB reviews
- [standard approach](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/binary_classification/base.ipynb)
- [belt](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/binary_classification/belt.ipynb)

### Multilabel classification - recognizing authors of Guardian articles
- [standard approach](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/multiclass/base.ipynb)
- [belt](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/multiclass/belt.ipynb)
- **Notice the effectiveness of the BELT approach here: the test accuracy increased by 10%.**

### Regression - prediction of 1 to 5 rating based on reviews from Polish online e-commerce platform Allegro
- [standard approach](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/regression/base.ipynb)
- [belt](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/regression/belt.ipynb)

## Contributors
The project was created at [MIM AI](https://www.mim.ai/) by:
- [Michał Brzozowski](https://github.com/MichalBrzozowski91) 
- [Marek Wachnicki](https://github.com/mwachnicki)

If you want to contribute to the library, see the [contributing info](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/CONTRIBUTING.md).

## Version history

See [CHANGELOG.md](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/CHANGELOG.md).

## License
See the [LICENSE](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/LICENSE.txt) file for license rights and limitations (MIT).

## For Maintainers

File `requirements.txt` can be updated using the command:
```
bash pip-freeze-without-torch.sh > requirements.txt
```
This script saves all dependencies of the current active environment except `torch`.

In order to add the next version of the package to pypi, do the following steps:
- First, increment the package version in `pyproject.toml`.
- Then build the new version: run `python3.9 -m build` from the main folder.
- Finally, upload to pypi: `twine upload dist/*` (two newly created files).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "belt-nlp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "nlp, natural-language-processing, text-classification, transformers, transfer-learning, bert, pytorch, machine-learning, deep-learning, roberta",
    "author": null,
    "author_email": "Micha\u0142 Brzozowski <michal.brzozowski@mim.ai>, Marek Wachnicki <marek.wachnicki@mim.ai>",
    "download_url": "https://files.pythonhosted.org/packages/08/47/e08f67fc1fcea44d6a7e2f664771b27bce9ceb10667438018b8be760b0b9/belt_nlp-1.1.0.tar.gz",
    "platform": null,
    "description": "# **BELT** (**BE**RT For **L**onger **T**exts)\n\n\ud83d\ude80**New in version 1.1.0: support for multilabel and regression**. See [the examples](#examples)\ud83d\ude80\n\n## Project description and motivation\n\n### The BELT approach\n\nThe BERT model can process texts of the maximal length of 512 tokens (roughly speaking tokens are equivalent to words). It is a consequence of the model architecture and cannot be directly adjusted. Discussion of this issue can be found [here](https://github.com/google-research/bert/issues/27). Method to overcome this issue was proposed by Devlin (one of the authors of BERT) in the previously mentioned discussion: [comment](https://github.com/google-research/bert/issues/27#issuecomment-435265194). The main goal of our project is to implement this method and allow the BERT model to process longer texts during prediction and fine-tuning. We dub this approach BELT (**BE**RT For **L**onger **T**exts).\n\nMore technical details are described in the [documentation](https://mim-solutions.github.io/bert_for_longer_texts/). We also prepared the comprehensive blog post: [part 1](https://www.mim.ai/fine-tuning-bert-model-for-arbitrarily-long-texts-part-1/), [part 2](https://www.mim.ai/fine-tuning-bert-model-for-arbitrarily-long-texts-part-2/).\n\n### Attention is all you need, but 512 words is all you have\n\nThe limitations of the BERT model to the 512 tokens come from the very beginning of the transformers models. Indeed, the attention mechanism, invented in the groundbreaking 2017 paper [Attention is all you need](https://arxiv.org/abs/1706.03762), scales quadratically with the sequence length. Unlike RNN or CNN models, which can process sequences of arbitrary length, transformers with the full attention (like BERT) are infeasible (or very expensive) to process long sequences.\nTo overcome the issue, alternative approaches with sparse attention mechanisms were proposed in 2020: [BigBird](https://arxiv.org/abs/2007.14062) and [Longformer](https://arxiv.org/abs/2004.05150).\n\n### BELT vs. BigBird vs. LongFormer\n\nLet us now clarify the key differences between the BELT approach to fine-tuning and the sparse attention models BigBird and Longformer:\n- The main difference is that BigBird and Longformers are not modified BERTs. They are models with different architectures. Hence, they need to be pre-trained from scratch or downloaded.\n- This leads to the main advantage of the BELT approach - it uses any pre-trained BERT or RoBERTa models. A quick look at the HuggingFace Hub confirms that there are about 100 times more resources for [BERT](https://huggingface.co/models?other=bert) than for [Longformer](https://huggingface.co/models?other=longformer). It might be easier to find the one appropriate for the specific task or language.\n- On the other hand, we have not done any benchmark tests yet. We believe that the comparison of the BELT approach with the models with sparse attention might be very instructive. Some work in this direction was done in the 2022 paper [Extend and Explain: Interpreting Very Long Language Models](https://proceedings.mlr.press/v193/stremmel22a/stremmel22a.pdf). The authors cited our implementation under the former name `roberta_for_longer_texts`. We encourage more research in this direction.\n\n## Installation and dependencies\n\nThe project requires Python 3.9+ to run. We recommend training the models on the GPU. Hence, it is necessary to install `torch` version compatible with the machine. The version of the driver depends on the machine - first, check the version of GPU drivers by the command `nvidia-smi` and choose the newest version compatible with these drivers according to [this table](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) (e.g.: 11.1). Then we install `torch` to get the compatible build. [Here](https://pytorch.org/get-started/previous-versions/), we find which torch version is compatible with the CUDA version on our machine.\n\nAnother option is to use the CPU-only version of torch:\n```\npip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu\n```\nNext, we recommend installing via pip:\n```\npip3 install belt-nlp\n```\n\nIf you want to clone the repo in order to run tests or notebooks, you can use the `requirements.txt` file.\n\n## Model classes\n\nTwo main classes are implemented:\n- `BertClassifierTruncated` - base binary classification model, longer texts are truncated to 512 tokens\n- `BertClassifierWithPooling` - extended model for longer texts (more details in the [documentation](https://mim-solutions.github.io/bert_for_longer_texts/))\n\n## Interface\n\nThe main methods are:\n- `fit` - fine-tune the model to the training set, use the list of raw texts and labels\n- `predict_classes` - calculate the list of classifications for the given list of raw texts. The model must be fine-tuned before that.\n- `predict_scores` - calculate the list of probabilities for the given list of raw texts. The model must be fine-tuned before that.\n\n## Loading the pre-trained model\n \nAs a default, the standard English `bert-base-uncased` model is used as a pre-trained model. However, it is possible to use any Bert or Roberta model. To do this, use the parameter `pretrained_model_name_or_path`.\nIt can be either:\n- a string with the name of a pre-trained model configuration to download from huggingface library, e.g.: `roberta-base`.\n- a path to a directory with the downloaded model, e.g.: `./my_model_directory/`.\n\n## Tests\nTo make sure everything works properly, run the command ```pytest tests -rA```. As a default, during tests, models are trained on small samples on the CPU.\n\n## Examples\n\nAll examples use public datasets from huggingface hub.\n\n### Binary classification - prediction of sentiment of IMDB reviews\n- [standard approach](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/binary_classification/base.ipynb)\n- [belt](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/binary_classification/belt.ipynb)\n\n### Multilabel classification - recognizing authors of Guardian articles\n- [standard approach](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/multiclass/base.ipynb)\n- [belt](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/multiclass/belt.ipynb)\n- **Notice the effectiveness of the BELT approach here: the test accuracy increased by 10%.**\n\n### Regression - prediction of 1 to 5 rating based on reviews from Polish online e-commerce platform Allegro\n- [standard approach](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/regression/base.ipynb)\n- [belt](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/notebooks/regression/belt.ipynb)\n\n## Contributors\nThe project was created at [MIM AI](https://www.mim.ai/) by:\n- [Micha\u0142 Brzozowski](https://github.com/MichalBrzozowski91) \n- [Marek Wachnicki](https://github.com/mwachnicki)\n\nIf you want to contribute to the library, see the [contributing info](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/CONTRIBUTING.md).\n\n## Version history\n\nSee [CHANGELOG.md](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/CHANGELOG.md).\n\n## License\nSee the [LICENSE](https://github.com/mim-solutions/bert_for_longer_texts/blob/main/LICENSE.txt) file for license rights and limitations (MIT).\n\n## For Maintainers\n\nFile `requirements.txt` can be updated using the command:\n```\nbash pip-freeze-without-torch.sh > requirements.txt\n```\nThis script saves all dependencies of the current active environment except `torch`.\n\nIn order to add the next version of the package to pypi, do the following steps:\n- First, increment the package version in `pyproject.toml`.\n- Then build the new version: run `python3.9 -m build` from the main folder.\n- Finally, upload to pypi: `twine upload dist/*` (two newly created files).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "BELT (BERT For Longer Texts). BERT-based text classification model for processing texts longer than 512 tokens.",
    "version": "1.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/mim-solutions/bert_for_longer_texts/issues",
        "Homepage": "https://github.com/mim-solutions/bert_for_longer_texts"
    },
    "split_keywords": [
        "nlp",
        " natural-language-processing",
        " text-classification",
        " transformers",
        " transfer-learning",
        " bert",
        " pytorch",
        " machine-learning",
        " deep-learning",
        " roberta"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7d1857368e304224b606db37a1113bb519a29750b369523a04cd7ffc51cf4802",
                "md5": "53d5713b75f977fb6f300f33180aeb89",
                "sha256": "970b1d64629448e28d0d0667b6c72866f96e0aaa88b9f78a54a9d6b95203853a"
            },
            "downloads": -1,
            "filename": "belt_nlp-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "53d5713b75f977fb6f300f33180aeb89",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16293,
            "upload_time": "2024-06-19T12:48:21",
            "upload_time_iso_8601": "2024-06-19T12:48:21.520752Z",
            "url": "https://files.pythonhosted.org/packages/7d/18/57368e304224b606db37a1113bb519a29750b369523a04cd7ffc51cf4802/belt_nlp-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0847e08f67fc1fcea44d6a7e2f664771b27bce9ceb10667438018b8be760b0b9",
                "md5": "7d0157032728e432a54f21575af6fbe3",
                "sha256": "fbd4b5aed6987607f3a4982930dd8f7e3f2fc6b0f32a716d456be26f439d3887"
            },
            "downloads": -1,
            "filename": "belt_nlp-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7d0157032728e432a54f21575af6fbe3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 20625,
            "upload_time": "2024-06-19T12:48:26",
            "upload_time_iso_8601": "2024-06-19T12:48:26.679802Z",
            "url": "https://files.pythonhosted.org/packages/08/47/e08f67fc1fcea44d6a7e2f664771b27bce9ceb10667438018b8be760b0b9/belt_nlp-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-19 12:48:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mim-solutions",
    "github_project": "bert_for_longer_texts",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "belt-nlp"
}
        
Elapsed time: 0.56175s