<div align="center">
# Transformers Embedder
[![Open in Visual Studio Code](https://img.shields.io/badge/preview%20in-vscode.dev-blue)](https://github.dev/Riccorl/transformers-embedder)
[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/4.28-🤗%20Transformers-6670ff)](https://huggingface.co/transformers/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)
[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml)
[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml)
[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/transformers-embedder)](https://github.com/Riccorl/transformers-embedder/releases)
[![Anaconda-Server Badge](https://anaconda.org/riccorl/transformers-embedder/badges/version.svg)](https://anaconda.org/riccorl/transformers-embedder)
[![DeepSource](https://deepsource.io/gh/Riccorl/transformers-embedder.svg/?label=active+issues)](https://deepsource.io/gh/Riccorl/transformers-embedder/?ref=repository-badge)
</div>
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.
## How to use
Install the library from [PyPI](https://pypi.org/project/transformers-embedder):
```bash
pip install transformers-embedder
```
or from [Conda](https://anaconda.org/riccorl/transformers-embedder):
```bash
conda install -c riccorl transformers-embedder
```
It offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface
[🤗Transformers](https://huggingface.co/transformers/) library. Here is a quick example:
```python
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
model = tre.TransformersEmbedder(
"bert-base-cased", subword_pooling_strategy="sparse", layer_pooling_strategy="mean"
)
example = "This is a sample sentence"
inputs = tokenizer(example, return_tensors=True)
```
```text
{
'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])
'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_length': 7 # with special tokens included
}
```
```python
outputs = model(**inputs)
```
```text
# outputs.word_embeddings.shape[1:-1] # remove [CLS] and [SEP]
torch.Size([1, 5, 768])
# len(example)
5
```
## Info
One of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings
from the sub-token embeddings they output. With this API it's as easy as using 🤗Transformers to get
word-level embeddings from theoretically every transformer model it supports.
### Model
#### Subword Pooling Strategy
The `TransformersEmbedder` class offers 3 ways to get the embeddings:
- `subword_pooling_strategy="sparse"`: computes the mean of the embeddings of the sub-tokens of each word
(i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This
strategy is the default one.
- `subword_pooling_strategy="scatter"`: computes the mean of the embeddings of the sub-tokens of each word
using a scatter-gather operation. It is not deterministic, but it works with ONNX export.
- `subword_pooling_strategy="none"`: returns the raw output of the transformer model without sub-token pooling.
Here a little feature table:
| | Pooling | Deterministic | ONNX |
|-------------|:------------------:|:------------------:|:------------------:|
| **Sparse** | :white_check_mark: | :white_check_mark: | :x: |
| **Scatter** | :white_check_mark: | :x: | :white_check_mark: |
| **None** | :x: | :white_check_mark: | :white_check_mark: |
#### Layer Pooling Strategy
There are also multiple type of outputs you can get using `layer_pooling_strategy` parameter:
- `layer_pooling_strategy="last"`: returns the last hidden state of the transformer model
- `layer_pooling_strategy="concat"`: returns the concatenation of the selected `output_layers` of the
transformer model
- `layer_pooling_strategy="sum"`: returns the sum of the selected `output_layers` of the transformer model
- `layer_pooling_strategy="mean"`: returns the average of the selected `output_layers` of the transformer model
- `layer_pooling_strategy="scalar_mix"`: returns the output of a parameterised scalar mixture layer of the
selected `output_layers` of the transformer model
If you also want all the outputs from the HuggingFace model, you can set `return_all=True` to get them.
```python
class TransformersEmbedder(torch.nn.Module):
def __init__(
self,
model: Union[str, tr.PreTrainedModel],
subword_pooling_strategy: str = "sparse",
layer_pooling_strategy: str = "last",
output_layers: Tuple[int] = (-4, -3, -2, -1),
fine_tune: bool = True,
return_all: bool = True,
)
```
### Tokenizer
The `Tokenizer` class provides the `tokenize` method to preprocess the input for the `TransformersEmbedder`
layer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them
returning a dictionary with the inputs for the model. By passing `return_tensors=True` it will return the
inputs as `torch.Tensor`.
By default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.
```python
text = "This is a sample sentence"
tokenizer(text)
text = ["This is a sample sentence", "This is another sample sentence"]
tokenizer(text)
```
You can pass a pre-tokenized sentence (or batch of sentences) by setting `is_split_into_words=True`
```python
text = ["This", "is", "a", "sample", "sentence"]
tokenizer(text, is_split_into_words=True)
text = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
]
tokenizer(text, is_split_into_words=True)
```
#### Examples
First, initialize the tokenizer
```python
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
```
- You can pass a single sentence as a string:
```python
text = "This is a sample sentence"
tokenizer(text)
```
```text
{
{
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 2, 3, 4, 5, 6],
[0, 1, 2, 3, 4, 5, 6]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 7, 7])
},
'sentence_lengths': [7],
}
```
- A sentence pair
```python
text = "This is a sample sentence A"
text_pair = "This is a sample sentence B"
tokenizer(text, text_pair)
```
```text
{
'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
]
),
'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
'sparse_size': torch.Size([1, 15, 15])
},
'sentence_lengths': [15],
}
```
- A batch of sentences or sentence pairs. Using `padding=True` and `return_tensors=True`, the tokenizer
returns the text ready for the model
```python
batch = [
["This", "is", "a", "sample", "sentence", "1"],
["This", "is", "sample", "sentence", "2"],
["This", "is", "a", "sample", "sentence", "3"],
# ...
["This", "is", "a", "sample", "sentence", "n", "for", "batch"],
]
tokenizer(batch, padding=True, return_tensors=True)
batch_pair = [
["This", "is", "a", "sample", "sentence", "pair", "1"],
["This", "is", "sample", "sentence", "pair", "2"],
["This", "is", "a", "sample", "sentence", "pair", "3"],
# ...
["This", "is", "a", "sample", "sentence", "pair", "n", "for", "batch"],
]
tokenizer(batch, batch_pair, padding=True, return_tensors=True)
```
#### Custom fields
It is possible to add custom fields to the model input and tell the `tokenizer` how to pad them using
`add_padding_ops`. Start by initializing the tokenizer with the model name:
```python
import transformers_embedder as tre
tokenizer = tre.Tokenizer("bert-base-cased")
```
Then add the custom fields to it:
```python
custom_fields = {
"custom_filed_1": [
[0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
]
}
```
Now we can add the padding logic for our custom field `custom_filed_1`. `add_padding_ops` method takes in
input
- `key`: name of the field in the tokenizer input
- `value`: value to use for padding
- `length`: length to pad. It can be an `int`, or two string value, `subword` in which the element is padded
to match the length of the subwords, and `word` where the element is padded relative to the length of the
batch after the merge of the subwords.
```python
tokenizer.add_padding_ops("custom_filed_1", 0, "word")
```
Finally, we can tokenize the input with the custom field:
```python
text = [
"This is a sample sentence",
"This is another example sentence just make it longer, with a comma too!"
]
tokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)
```
The inputs are ready for the model, including the custom filed.
```text
>>> inputs
{
'input_ids': tensor(
[
[ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]
]
),
'token_type_ids': tensor(
[
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
),
'attention_mask': tensor(
[
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]
),
'scatter_offsets': tensor(
[
[ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]
]
),
'sparse_offsets': {
'sparse_indices': tensor(
[
[ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],
[ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
]
),
'sparse_values': tensor(
[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,
1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]
),
'sparse_size': torch.Size([2, 17, 18])
}
'sentence_lengths': [7, 17],
}
```
## Acknowledgements
Some code in the `TransformersEmbedder` class is taken from the [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter/)
library. The pretrained models and the core of the tokenizer is from [🤗 Transformers](https://huggingface.co/transformers/).
Raw data
{
"_id": null,
"home_page": "https://github.com/Riccorl/transformers-embedder",
"name": "transformers-embedder",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "NLP deep learning transformer pytorch BERT google subtoken wordpieces embeddings",
"author": "Riccardo Orlando",
"author_email": "orlandoricc@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/42/cf/71d89566b48142b9655600635c00bf5d897372d6d3c40ffbb343aa080445/transformers_embedder-3.0.10.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# Transformers Embedder\n\n[![Open in Visual Studio Code](https://img.shields.io/badge/preview%20in-vscode.dev-blue)](https://github.dev/Riccorl/transformers-embedder)\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Transformers](https://img.shields.io/badge/4.28-\ud83e\udd17%20Transformers-6670ff)](https://huggingface.co/transformers/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n\n[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-pypi.yml)\n[![Upload to PyPi](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml/badge.svg)](https://github.com/Riccorl/transformers-embedder/actions/workflows/python-publish-conda.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/Riccorl/transformers-embedder)](https://github.com/Riccorl/transformers-embedder/releases)\n[![Anaconda-Server Badge](https://anaconda.org/riccorl/transformers-embedder/badges/version.svg)](https://anaconda.org/riccorl/transformers-embedder)\n[![DeepSource](https://deepsource.io/gh/Riccorl/transformers-embedder.svg/?label=active+issues)](https://deepsource.io/gh/Riccorl/transformers-embedder/?ref=repository-badge)\n\n</div>\n\nA Word Level Transformer layer based on PyTorch and \ud83e\udd17 Transformers.\n\n## How to use\n\nInstall the library from [PyPI](https://pypi.org/project/transformers-embedder):\n\n```bash\npip install transformers-embedder\n```\n\nor from [Conda](https://anaconda.org/riccorl/transformers-embedder):\n\n```bash\nconda install -c riccorl transformers-embedder\n```\n\nIt offers a PyTorch layer and a tokenizer that support almost every pretrained model from Huggingface \n[\ud83e\udd17Transformers](https://huggingface.co/transformers/) library. Here is a quick example:\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n\nmodel = tre.TransformersEmbedder(\n \"bert-base-cased\", subword_pooling_strategy=\"sparse\", layer_pooling_strategy=\"mean\"\n)\n\nexample = \"This is a sample sentence\"\ninputs = tokenizer(example, return_tensors=True)\n```\n\n```text\n{\n 'input_ids': tensor([[ 101, 1188, 1110, 170, 6876, 5650, 102]]),\n 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]),\n 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]])\n 'scatter_offsets': tensor([[0, 1, 2, 3, 4, 5, 6]]),\n 'sparse_offsets': {\n 'sparse_indices': tensor(\n [\n [0, 0, 0, 0, 0, 0, 0],\n [0, 1, 2, 3, 4, 5, 6],\n [0, 1, 2, 3, 4, 5, 6]\n ]\n ), \n 'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]), \n 'sparse_size': torch.Size([1, 7, 7])\n },\n 'sentence_length': 7 # with special tokens included\n}\n```\n\n```python\noutputs = model(**inputs)\n```\n\n```text\n# outputs.word_embeddings.shape[1:-1] # remove [CLS] and [SEP]\ntorch.Size([1, 5, 768])\n# len(example)\n5\n```\n\n## Info\n\nOne of the annoyance of using transformer-based models is that it is not trivial to compute word embeddings \nfrom the sub-token embeddings they output. With this API it's as easy as using \ud83e\udd17Transformers to get \nword-level embeddings from theoretically every transformer model it supports.\n\n### Model\n\n#### Subword Pooling Strategy\n\nThe `TransformersEmbedder` class offers 3 ways to get the embeddings:\n\n- `subword_pooling_strategy=\"sparse\"`: computes the mean of the embeddings of the sub-tokens of each word \n (i.e. the embeddings of the sub-tokens are pooled together) using a sparse matrix multiplication. This \n strategy is the default one.\n- `subword_pooling_strategy=\"scatter\"`: computes the mean of the embeddings of the sub-tokens of each word\n using a scatter-gather operation. It is not deterministic, but it works with ONNX export.\n- `subword_pooling_strategy=\"none\"`: returns the raw output of the transformer model without sub-token pooling.\n\nHere a little feature table:\n\n| | Pooling | Deterministic | ONNX |\n|-------------|:------------------:|:------------------:|:------------------:|\n| **Sparse** | :white_check_mark: | :white_check_mark: | :x: |\n| **Scatter** | :white_check_mark: | :x: | :white_check_mark: |\n| **None** | :x: | :white_check_mark: | :white_check_mark: |\n\n#### Layer Pooling Strategy\n\nThere are also multiple type of outputs you can get using `layer_pooling_strategy` parameter:\n\n- `layer_pooling_strategy=\"last\"`: returns the last hidden state of the transformer model\n- `layer_pooling_strategy=\"concat\"`: returns the concatenation of the selected `output_layers` of the \n transformer model\n- `layer_pooling_strategy=\"sum\"`: returns the sum of the selected `output_layers` of the transformer model\n- `layer_pooling_strategy=\"mean\"`: returns the average of the selected `output_layers` of the transformer model\n- `layer_pooling_strategy=\"scalar_mix\"`: returns the output of a parameterised scalar mixture layer of the \n selected `output_layers` of the transformer model\n\nIf you also want all the outputs from the HuggingFace model, you can set `return_all=True` to get them.\n\n```python\nclass TransformersEmbedder(torch.nn.Module):\n def __init__(\n self,\n model: Union[str, tr.PreTrainedModel],\n subword_pooling_strategy: str = \"sparse\",\n layer_pooling_strategy: str = \"last\",\n output_layers: Tuple[int] = (-4, -3, -2, -1),\n fine_tune: bool = True,\n return_all: bool = True,\n )\n```\n\n### Tokenizer\n\nThe `Tokenizer` class provides the `tokenize` method to preprocess the input for the `TransformersEmbedder` \nlayer. You can pass raw sentences, pre-tokenized sentences and sentences in batch. It will preprocess them \nreturning a dictionary with the inputs for the model. By passing `return_tensors=True` it will return the \ninputs as `torch.Tensor`.\n\nBy default, if you pass text (or batch) as strings, it uses the HuggingFace tokenizer to tokenize them.\n\n```python\ntext = \"This is a sample sentence\"\ntokenizer(text)\n\ntext = [\"This is a sample sentence\", \"This is another sample sentence\"]\ntokenizer(text)\n```\n\nYou can pass a pre-tokenized sentence (or batch of sentences) by setting `is_split_into_words=True`\n\n```python\ntext = [\"This\", \"is\", \"a\", \"sample\", \"sentence\"]\ntokenizer(text, is_split_into_words=True)\n\ntext = [\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"1\"],\n [\"This\", \"is\", \"sample\", \"sentence\", \"2\"],\n]\ntokenizer(text, is_split_into_words=True)\n```\n\n#### Examples\n\nFirst, initialize the tokenizer\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n```\n\n- You can pass a single sentence as a string:\n\n```python\ntext = \"This is a sample sentence\"\ntokenizer(text)\n```\n\n```text\n{\n{\n 'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 102]],\n 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0]],\n 'attention_mask': [[1, 1, 1, 1, 1, 1, 1]],\n 'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6]],\n 'sparse_offsets': {\n 'sparse_indices': tensor(\n [\n [0, 0, 0, 0, 0, 0, 0],\n [0, 1, 2, 3, 4, 5, 6],\n [0, 1, 2, 3, 4, 5, 6]\n ]\n ),\n 'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1.]),\n 'sparse_size': torch.Size([1, 7, 7])\n },\n 'sentence_lengths': [7],\n}\n```\n\n- A sentence pair\n\n```python\ntext = \"This is a sample sentence A\"\ntext_pair = \"This is a sample sentence B\"\ntokenizer(text, text_pair)\n```\n\n```text\n{\n 'input_ids': [[101, 1188, 1110, 170, 6876, 5650, 138, 102, 1188, 1110, 170, 6876, 5650, 139, 102]],\n 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]],\n 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],\n 'scatter_offsets': [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],\n 'sparse_offsets': {\n 'sparse_indices': tensor(\n [\n [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],\n [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]\n ]\n ),\n 'sparse_values': tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),\n 'sparse_size': torch.Size([1, 15, 15])\n },\n 'sentence_lengths': [15],\n}\n```\n\n- A batch of sentences or sentence pairs. Using `padding=True` and `return_tensors=True`, the tokenizer \nreturns the text ready for the model\n\n```python\nbatch = [\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"1\"],\n [\"This\", \"is\", \"sample\", \"sentence\", \"2\"],\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"3\"],\n # ...\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"n\", \"for\", \"batch\"],\n]\ntokenizer(batch, padding=True, return_tensors=True)\n\nbatch_pair = [\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"1\"],\n [\"This\", \"is\", \"sample\", \"sentence\", \"pair\", \"2\"],\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"3\"],\n # ...\n [\"This\", \"is\", \"a\", \"sample\", \"sentence\", \"pair\", \"n\", \"for\", \"batch\"],\n]\ntokenizer(batch, batch_pair, padding=True, return_tensors=True)\n```\n\n#### Custom fields\n\nIt is possible to add custom fields to the model input and tell the `tokenizer` how to pad them using \n`add_padding_ops`. Start by initializing the tokenizer with the model name:\n\n```python\nimport transformers_embedder as tre\n\ntokenizer = tre.Tokenizer(\"bert-base-cased\")\n```\n\nThen add the custom fields to it:\n\n```python\ncustom_fields = {\n \"custom_filed_1\": [\n [0, 0, 0, 0, 1, 0, 0],\n [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]\n ]\n}\n```\n\nNow we can add the padding logic for our custom field `custom_filed_1`. `add_padding_ops` method takes in \ninput\n\n- `key`: name of the field in the tokenizer input\n- `value`: value to use for padding\n- `length`: length to pad. It can be an `int`, or two string value, `subword` in which the element is padded \nto match the length of the subwords, and `word` where the element is padded relative to the length of the\nbatch after the merge of the subwords.\n\n```python\ntokenizer.add_padding_ops(\"custom_filed_1\", 0, \"word\")\n```\n\nFinally, we can tokenize the input with the custom field:\n\n```python\ntext = [\n \"This is a sample sentence\",\n \"This is another example sentence just make it longer, with a comma too!\"\n]\n\ntokenizer(text, padding=True, return_tensors=True, additional_inputs=custom_fields)\n```\n\nThe inputs are ready for the model, including the custom filed.\n\n```text\n>>> inputs\n\n{\n 'input_ids': tensor(\n [\n [ 101, 1188, 1110, 170, 6876, 5650, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n [ 101, 1188, 1110, 1330, 1859, 5650, 1198, 1294, 1122, 2039, 117, 1114, 170, 3254, 1918, 1315, 106, 102]\n ]\n ),\n 'token_type_ids': tensor(\n [\n [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n ]\n ), \n 'attention_mask': tensor(\n [\n [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n ]\n ),\n 'scatter_offsets': tensor(\n [\n [ 0, 1, 2, 3, 4, 5, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],\n [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16]\n ]\n ),\n 'sparse_offsets': {\n 'sparse_indices': tensor(\n [\n [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16],\n [ 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]\n ]\n ),\n 'sparse_values': tensor(\n [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,\n 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000,\n 1.0000, 1.0000, 0.5000, 0.5000, 1.0000, 1.0000, 1.0000]\n ), \n 'sparse_size': torch.Size([2, 17, 18])\n }\n 'sentence_lengths': [7, 17],\n}\n```\n\n## Acknowledgements\n\nSome code in the `TransformersEmbedder` class is taken from the [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter/)\nlibrary. The pretrained models and the core of the tokenizer is from [\ud83e\udd17 Transformers](https://huggingface.co/transformers/).\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Word level transformer based embeddings",
"version": "3.0.10",
"project_urls": {
"Homepage": "https://github.com/Riccorl/transformers-embedder"
},
"split_keywords": [
"nlp",
"deep",
"learning",
"transformer",
"pytorch",
"bert",
"google",
"subtoken",
"wordpieces",
"embeddings"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "57d41fe7faefdccfb6cb5471a96fa54e2c43e1b8e20e21d63aeee38a53eecdb9",
"md5": "5cff6d62674c9bdaf90fa11c67961342",
"sha256": "957192feaac59cb7534817fc34e71d599c7707f063cbd002ff1578073440e89c"
},
"downloads": -1,
"filename": "transformers_embedder-3.0.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5cff6d62674c9bdaf90fa11c67961342",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 19221,
"upload_time": "2023-05-19T08:24:12",
"upload_time_iso_8601": "2023-05-19T08:24:12.575984Z",
"url": "https://files.pythonhosted.org/packages/57/d4/1fe7faefdccfb6cb5471a96fa54e2c43e1b8e20e21d63aeee38a53eecdb9/transformers_embedder-3.0.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "42cf71d89566b48142b9655600635c00bf5d897372d6d3c40ffbb343aa080445",
"md5": "59a4c4320b9f76a673f131e8a6e156cf",
"sha256": "379484340b1ae96e488afcafb2945820f74d9d41b15933af0e4a61a481e8844d"
},
"downloads": -1,
"filename": "transformers_embedder-3.0.10.tar.gz",
"has_sig": false,
"md5_digest": "59a4c4320b9f76a673f131e8a6e156cf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 20500,
"upload_time": "2023-05-19T08:24:14",
"upload_time_iso_8601": "2023-05-19T08:24:14.742767Z",
"url": "https://files.pythonhosted.org/packages/42/cf/71d89566b48142b9655600635c00bf5d897372d6d3c40ffbb343aa080445/transformers_embedder-3.0.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-19 08:24:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Riccorl",
"github_project": "transformers-embedder",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "transformers-embedder"
}