transformers-embedder

Name	transformers-embedder JSON
Version	3.0.10 JSON
	download
home_page	https://github.com/Riccorl/transformers-embedder
Summary	Word level transformer based embeddings
upload_time	2023-05-19 08:24:14
maintainer
docs_url	None
author	Riccardo Orlando
requires_python	>=3.6
license	Apache
keywords	nlp deep learning transformer pytorch bert google subtoken wordpieces embeddings
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">

# Transformers Embedder

[![Open in Visual Studio Code](https://img.shields.io/badge/preview%20in-vscode.dev-blue)](https://github.dev/Riccorl/transformers-embedder)
[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/4.28-🤗%20Transformers-6670ff)](https://huggingface.co/transformers/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)

[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml)
[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml)
[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/transformers-embedder)](https://github.com/Riccorl/transformers-embedder/releases)
[![Anaconda-Server Badge](https://anaconda.org/riccorl/transformers-embedder/badges/version.svg)](https://anaconda.org/riccorl/transformers-embedder)
[![DeepSource](https://deepsource.io/gh/Riccorl/transformers-embedder.svg/?label=active+issues)](https://deepsource.io/gh/Riccorl/transformers-embedder/?ref=repository-badge)

</div>

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

## How to use

Install the library from [PyPI](https://pypi.org/project/transformers-embedder):

```bash
pip install transformers-embedder
```

or from [Conda](https://anaconda.org/riccorl/transformers-embedder):

```bash
conda install -c riccorl transformers-embedder
```

It offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface 
[🤗Transformers](https://huggingface.co/transformers/) library. Here is a quick example:

```python
import transformers_embedder as tre

tokenizer = tre.Tokenizer("bert-base-cased")

model = tre.TransformersEmbedder(
    "bert-base-cased", subword_pooling_strategy="sparse", layer_pooling_strategy="mean"
)

example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True)
```

```text
{
   'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650,  102]]),
   'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),
   'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
   'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),
   'sparse_offsets': {
        'sparse_indices': tensor(
            [
                [0, 0, 0, 0, 0, 0, 0],
                [0, 1, 2, 3, 4, 5, 6],
                [0, 1, 2, 3, 4, 5, 6]
            ]
        ), 
        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]), 
        'sparse_size': torch.Size([1, 7, 7])
    },
   'sentence_length': 7  # with special tokens included
}
```

```python
outputs = model(**inputs)
```

```text
# outputs.word_embeddings.shape[1:-1]       # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5
```

## Info

One of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings 
from the sub-token embeddings they output. With this API it's as easy as using 🤗Transformers to get 
word-level embeddings from theoretically every transformer model it supports.

### Model

#### Subword Pooling Strategy

The `TransformersEmbedder` class offers 3 ways to get the embeddings:

- `subword_pooling_strategy="sparse"`: computes the mean of the embeddings of the sub-tokens of each word 
  (i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This 
  strategy is the default one.
- `subword_pooling_strategy="scatter"`: computes the mean of the embeddings of the sub-tokens of each word
  using a scatter-gather operation. It is not deterministic, but it works with ONNX export.
- `subword_pooling_strategy="none"`: returns the raw output of the transformer model without sub-token pooling.

Here a little feature table:

|             |      Pooling       |   Deterministic    |        ONNX        |
|-------------|:------------------:|:------------------:|:------------------:|
| **Sparse**  | :white_check_mark: | :white_check_mark: |        :x:         |
| **Scatter** | :white_check_mark: |        :x:         | :white_check_mark: |
| **None**    |        :x:         | :white_check_mark: | :white_check_mark: |

#### Layer Pooling Strategy

There are also multiple type of outputs you can get using `layer_pooling_strategy` parameter:

- `layer_pooling_strategy="last"`: returns the last hidden state of the transformer model
- `layer_pooling_strategy="concat"`: returns the concatenation of the selected `output_layers` of the  
   transformer model
- `layer_pooling_strategy="sum"`: returns the sum of the selected `output_layers` of the transformer model
- `layer_pooling_strategy="mean"`: returns the average of the selected `output_layers` of the transformer model
- `layer_pooling_strategy="scalar_mix"`: returns the output of a parameterised scalar mixture layer of the 
   selected `output_layers` of the transformer model

If you also want all the outputs from the HuggingFace model, you can set `return_all=True` to get them.

```python
class TransformersEmbedder(torch.nn.Module):
    def __init__(
        self,
        model: Union[str, tr.PreTrainedModel],
        subword_pooling_strategy: str = "sparse",
        layer_pooling_strategy: str = "last",
        output_layers: Tuple[int] = (-4, -3, -2, -1),
        fine_tune: bool = True,
        return_all: bool = True,
    )
```

### Tokenizer

The `Tokenizer` class provides the `tokenize` method to preprocess the input for the `TransformersEmbedder` 
layer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them 
returning a dictionary with the inputs for the model. By passing `return_tensors=True` it will return the 
inputs as `torch.Tensor`.

By default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.

```python
text = "This is a sample sentence"
tokenizer(text)

text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)
```

You can pass a pre-tokenized sentence (or batch of sentences) by setting `is_split_into_words=True`

```python
text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)

text = [
    ["This", "is", "a", "sample", "sentence", "1"],
    ["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True)
```

#### Examples

First, initialize the tokenizer

```python
import transformers_embedder as tre

tokenizer = tre.Tokenizer("bert-base-cased")
```

- You can pass a single sentence as a string:

```python
text = "This is a sample sentence"
tokenizer(text)
```

```text
{
{
    'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],
    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
    'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],
    'sparse_offsets': {
        'sparse_indices': tensor(
            [
                [0, 0, 0, 0, 0, 0, 0],
                [0, 1, 2, 3, 4, 5, 6],
                [0, 1, 2, 3, 4, 5, 6]
            ]
        ),
        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
        'sparse_size': torch.Size([1, 7, 7])
    },
    'sentence_lengths': [7],
}
```

- A sentence pair

```python
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair)
```

```text
{
    'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],
    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
    'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
    'sparse_offsets': {
        'sparse_indices': tensor(
            [
                [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  0],
                [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
                [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
            ]
        ),
        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
        'sparse_size': torch.Size([1, 15, 15])
    },
    'sentence_lengths': [15],
}
```

- A batch of sentences or sentence pairs. Using `padding=True` and `return_tensors=True`, the tokenizer 
returns the text ready for the model

```python
batch = [
    ["This", "is", "a", "sample", "sentence", "1"],
    ["This", "is", "sample", "sentence", "2"],
    ["This", "is", "a", "sample", "sentence", "3"],
    # ...
    ["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)

batch_pair = [
    ["This", "is", "a", "sample", "sentence", "pair", "1"],
    ["This", "is", "sample", "sentence", "pair", "2"],
    ["This", "is", "a", "sample", "sentence", "pair", "3"],
    # ...
    ["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)
```

#### Custom fields

It is possible to add custom fields to the model input and tell the `tokenizer` how to pad them using 
`add_padding_ops`. Start by initializing the tokenizer with the model name:

```python
import transformers_embedder as tre

tokenizer = tre.Tokenizer("bert-base-cased")
```

Then add the custom fields to it:

```python
custom_fields = {
  "custom_filed_1": [
    [0, 0, 0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
  ]
}
```

Now we can add the padding logic for our custom field `custom_filed_1`. `add_padding_ops` method takes in 
input

- `key`: name of the field in the tokenizer input
- `value`: value to use for padding
- `length`: length to pad. It can be an `int`, or two string value, `subword` in which the element is padded 
to match the length of the subwords, and `word` where the element is padded relative to the length of the
batch after the merge of the subwords.

```python
tokenizer.add_padding_ops("custom_filed_1", 0, "word")
```

Finally, we can tokenize the input with the custom field:

```python
text = [
    "This is a sample sentence",
    "This is another example sentence just make it longer, with a comma too!"
]

tokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)
```

The inputs are ready for the model, including the custom filed.

```text
>>> inputs

{
    'input_ids': tensor(
        [
            [ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]
        ]
    ),
    'token_type_ids': tensor(
        [
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        ]
    ), 
    'attention_mask': tensor(
        [
            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
        ]
    ),
    'scatter_offsets': tensor(
        [
            [ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
            [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]
        ]
    ),
    'sparse_offsets': {
        'sparse_indices': tensor(
            [
                [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1],
                [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],
                [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
            ]
        ),
        'sparse_values': tensor(
            [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
            1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
            1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]
        ), 
        'sparse_size': torch.Size([2, 17, 18])
    }
    'sentence_lengths': [7, 17],
}
```

## Acknowledgements

Some code in the `TransformersEmbedder` class is taken from the [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter/)
library. The pretrained models and the core of the tokenizer is from [🤗 Transformers](https://huggingface.co/transformers/).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Riccorl/transformers-embedder",
    "name": "transformers-embedder",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "NLP deep learning transformer pytorch BERT google subtoken wordpieces embeddings",
    "author": "Riccardo Orlando",
    "author_email": "orlandoricc@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/42/cf/71d89566b48142b9655600635c00bf5d897372d6d3c40ffbb343aa080445/transformers_embedder-3.0.10.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# Transformers Embedder\n\n[![Open in Visual Studio Code](https://img.shields.io/badge/preview%20in-vscode.dev-blue)](https://github.dev/Riccorl/transformers-embedder)\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Transformers](https://img.shields.io/badge/4.28-\ud83e\udd17%20Transformers-6670ff)](https://huggingface.co/transformers/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n\n[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml)\n[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/transformers-embedder)](https://github.com/Riccorl/transformers-embedder/releases)\n[![Anaconda-Server Badge](https://anaconda.org/riccorl/transformers-embedder/badges/version.svg)](https://anaconda.org/riccorl/transformers-embedder)\n[![DeepSource](https://deepsource.io/gh/Riccorl/transformers-embedder.svg/?label=active+issues)](https://deepsource.io/gh/Riccorl/transformers-embedder/?ref=repository-badge)\n\n</div>\n\nA Word Level Transformer layer based on PyTorch and \ud83e\udd17 Transformers.\n\n## How to use\n\nInstall the library from [PyPI](https://pypi.org/project/transformers-embedder):\n\n```bash\npip install transformers-embedder\n```\n\nor from [Conda](https://anaconda.org/riccorl/transformers-embedder):\n\n```bash\nconda install -c riccorl transformers-embedder\n```\n\nIt offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface \n[\ud83e\udd17Transformers](https://huggingface.co/transformers/) library. Here is a quick example:\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n\nmodel = tre.TransformersEmbedder(\n    \"bert-base-cased\", subword_pooling_strategy=\"sparse\", layer_pooling_strategy=\"mean\"\n)\n\nexample = \"This is a sample sentence\"\ninputs = tokenizer(example, return_tensors=True)\n```\n\n```text\n{\n   'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650,  102]]),\n   'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),\n   'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])\n   'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),\n   'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [0, 0, 0, 0, 0, 0, 0],\n                [0, 1, 2, 3, 4, 5, 6],\n                [0, 1, 2, 3, 4, 5, 6]\n            ]\n        ), \n        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]), \n        'sparse_size': torch.Size([1, 7, 7])\n    },\n   'sentence_length': 7  # with special tokens included\n}\n```\n\n```python\noutputs = model(**inputs)\n```\n\n```text\n# outputs.word_embeddings.shape[1:-1]       # remove [CLS] and [SEP]\ntorch.Size([1, 5, 768])\n# len(example)\n5\n```\n\n## Info\n\nOne of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings \nfrom the sub-token embeddings they output. With this API it's as easy as using \ud83e\udd17Transformers to get \nword-level embeddings from theoretically every transformer model it supports.\n\n### Model\n\n#### Subword Pooling Strategy\n\nThe `TransformersEmbedder` class offers 3 ways to get the embeddings:\n\n- `subword_pooling_strategy=\"sparse\"`: computes the mean of the embeddings of the sub-tokens of each word \n  (i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This \n  strategy is the default one.\n- `subword_pooling_strategy=\"scatter\"`: computes the mean of the embeddings of the sub-tokens of each word\n  using a scatter-gather operation. It is not deterministic, but it works with ONNX export.\n- `subword_pooling_strategy=\"none\"`: returns the raw output of the transformer model without sub-token pooling.\n\nHere a little feature table:\n\n|             |      Pooling       |   Deterministic    |        ONNX        |\n|-------------|:------------------:|:------------------:|:------------------:|\n| **Sparse**  | :white_check_mark: | :white_check_mark: |        :x:         |\n| **Scatter** | :white_check_mark: |        :x:         | :white_check_mark: |\n| **None**    |        :x:         | :white_check_mark: | :white_check_mark: |\n\n#### Layer Pooling Strategy\n\nThere are also multiple type of outputs you can get using `layer_pooling_strategy` parameter:\n\n- `layer_pooling_strategy=\"last\"`: returns the last hidden state of the transformer model\n- `layer_pooling_strategy=\"concat\"`: returns the concatenation of the selected `output_layers` of the  \n   transformer model\n- `layer_pooling_strategy=\"sum\"`: returns the sum of the selected `output_layers` of the transformer model\n- `layer_pooling_strategy=\"mean\"`: returns the average of the selected `output_layers` of the transformer model\n- `layer_pooling_strategy=\"scalar_mix\"`: returns the output of a parameterised scalar mixture layer of the \n   selected `output_layers` of the transformer model\n\nIf you also want all the outputs from the HuggingFace model, you can set `return_all=True` to get them.\n\n```python\nclass TransformersEmbedder(torch.nn.Module):\n    def __init__(\n        self,\n        model: Union[str, tr.PreTrainedModel],\n        subword_pooling_strategy: str = \"sparse\",\n        layer_pooling_strategy: str = \"last\",\n        output_layers: Tuple[int] = (-4, -3, -2, -1),\n        fine_tune: bool = True,\n        return_all: bool = True,\n    )\n```\n\n### Tokenizer\n\nThe `Tokenizer` class provides the `tokenize` method to preprocess the input for the `TransformersEmbedder` \nlayer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them \nreturning a dictionary with the inputs for the model. By passing `return_tensors=True` it will return the \ninputs as `torch.Tensor`.\n\nBy default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.\n\n```python\ntext = \"This is a sample sentence\"\ntokenizer(text)\n\ntext = [\"This is a sample sentence\", \"This is another sample sentence\"]\ntokenizer(text)\n```\n\nYou can pass a pre-tokenized sentence (or batch of sentences) by setting `is_split_into_words=True`\n\n```python\ntext = [\"This\", \"is\", \"a\", \"sample\", \"sentence\"]\ntokenizer(text, is_split_into_words=True)\n\ntext = [\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"1\"],\n    [\"This\", \"is\", \"sample\", \"sentence\", \"2\"],\n]\ntokenizer(text, is_split_into_words=True)\n```\n\n#### Examples\n\nFirst, initialize the tokenizer\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n```\n\n- You can pass a single sentence as a string:\n\n```python\ntext = \"This is a sample sentence\"\ntokenizer(text)\n```\n\n```text\n{\n{\n    'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],\n    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],\n    'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],\n    'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],\n    'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [0, 0, 0, 0, 0, 0, 0],\n                [0, 1, 2, 3, 4, 5, 6],\n                [0, 1, 2, 3, 4, 5, 6]\n            ]\n        ),\n        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),\n        'sparse_size': torch.Size([1, 7, 7])\n    },\n    'sentence_lengths': [7],\n}\n```\n\n- A sentence pair\n\n```python\ntext = \"This is a sample sentence A\"\ntext_pair = \"This is a sample sentence B\"\ntokenizer(text, text_pair)\n```\n\n```text\n{\n    'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],\n    'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],\n    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],\n    'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],\n    'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  0],\n                [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],\n                [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\n            ]\n        ),\n        'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),\n        'sparse_size': torch.Size([1, 15, 15])\n    },\n    'sentence_lengths': [15],\n}\n```\n\n- A batch of sentences or sentence pairs. Using `padding=True` and `return_tensors=True`, the tokenizer \nreturns the text ready for the model\n\n```python\nbatch = [\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"1\"],\n    [\"This\", \"is\", \"sample\", \"sentence\", \"2\"],\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"3\"],\n    # ...\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"n\", \"for\", \"batch\"],\n]\ntokenizer(batch, padding=True, return_tensors=True)\n\nbatch_pair = [\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"1\"],\n    [\"This\", \"is\", \"sample\", \"sentence\", \"pair\", \"2\"],\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"3\"],\n    # ...\n    [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"n\", \"for\", \"batch\"],\n]\ntokenizer(batch, batch_pair, padding=True, return_tensors=True)\n```\n\n#### Custom fields\n\nIt is possible to add custom fields to the model input and tell the `tokenizer` how to pad them using \n`add_padding_ops`. Start by initializing the tokenizer with the model name:\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n```\n\nThen add the custom fields to it:\n\n```python\ncustom_fields = {\n  \"custom_filed_1\": [\n    [0, 0, 0, 0, 1, 0, 0],\n    [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]\n  ]\n}\n```\n\nNow we can add the padding logic for our custom field `custom_filed_1`. `add_padding_ops` method takes in \ninput\n\n- `key`: name of the field in the tokenizer input\n- `value`: value to use for padding\n- `length`: length to pad. It can be an `int`, or two string value, `subword` in which the element is padded \nto match the length of the subwords, and `word` where the element is padded relative to the length of the\nbatch after the merge of the subwords.\n\n```python\ntokenizer.add_padding_ops(\"custom_filed_1\", 0, \"word\")\n```\n\nFinally, we can tokenize the input with the custom field:\n\n```python\ntext = [\n    \"This is a sample sentence\",\n    \"This is another example sentence just make it longer, with a comma too!\"\n]\n\ntokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)\n```\n\nThe inputs are ready for the model, including the custom filed.\n\n```text\n>>> inputs\n\n{\n    'input_ids': tensor(\n        [\n            [ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n            [ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]\n        ]\n    ),\n    'token_type_ids': tensor(\n        [\n            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n        ]\n    ), \n    'attention_mask': tensor(\n        [\n            [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n            [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n        ]\n    ),\n    'scatter_offsets': tensor(\n        [\n            [ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],\n            [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]\n        ]\n    ),\n    'sparse_offsets': {\n        'sparse_indices': tensor(\n            [\n                [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1],\n                [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],\n                [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]\n            ]\n        ),\n        'sparse_values': tensor(\n            [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,\n            1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,\n            1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]\n        ), \n        'sparse_size': torch.Size([2, 17, 18])\n    }\n    'sentence_lengths': [7, 17],\n}\n```\n\n## Acknowledgements\n\nSome code in the `TransformersEmbedder` class is taken from the [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter/)\nlibrary. The pretrained models and the core of the tokenizer is from [\ud83e\udd17 Transformers](https://huggingface.co/transformers/).\n",
    "bugtrack_url": null,
    "license": "Apache",
    "summary": "Word level transformer based embeddings",
    "version": "3.0.10",
    "project_urls": {
        "Homepage": "https://github.com/Riccorl/transformers-embedder"
    },
    "split_keywords": [
        "nlp",
        "deep",
        "learning",
        "transformer",
        "pytorch",
        "bert",
        "google",
        "subtoken",
        "wordpieces",
        "embeddings"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57d41fe7faefdccfb6cb5471a96fa54e2c43e1b8e20e21d63aeee38a53eecdb9",
                "md5": "5cff6d62674c9bdaf90fa11c67961342",
                "sha256": "957192feaac59cb7534817fc34e71d599c7707f063cbd002ff1578073440e89c"
            },
            "downloads": -1,
            "filename": "transformers_embedder-3.0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5cff6d62674c9bdaf90fa11c67961342",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 19221,
            "upload_time": "2023-05-19T08:24:12",
            "upload_time_iso_8601": "2023-05-19T08:24:12.575984Z",
            "url": "https://files.pythonhosted.org/packages/57/d4/1fe7faefdccfb6cb5471a96fa54e2c43e1b8e20e21d63aeee38a53eecdb9/transformers_embedder-3.0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "42cf71d89566b48142b9655600635c00bf5d897372d6d3c40ffbb343aa080445",
                "md5": "59a4c4320b9f76a673f131e8a6e156cf",
                "sha256": "379484340b1ae96e488afcafb2945820f74d9d41b15933af0e4a61a481e8844d"
            },
            "downloads": -1,
            "filename": "transformers_embedder-3.0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "59a4c4320b9f76a673f131e8a6e156cf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 20500,
            "upload_time": "2023-05-19T08:24:14",
            "upload_time_iso_8601": "2023-05-19T08:24:14.742767Z",
            "url": "https://files.pythonhosted.org/packages/42/cf/71d89566b48142b9655600635c00bf5d897372d6d3c40ffbb343aa080445/transformers_embedder-3.0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-19 08:24:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Riccorl",
    "github_project": "transformers-embedder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "transformers-embedder"
}

Riccardo Orlando