[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)
# Multi-Modal Pathway Transformer
![Diagram](diagram.png)
Implementation of M2PT in PyTorch from the paper: "Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities". [PAPER LINK](https://arxiv.org/abs/2401.14405). This is really really cool because just by merging the projections of different multi-modal models together you can increase the performance of your base model. This is a small but effective technique that can be implemented in any model with a minor plug in.
## Install
`pip3 install -U m2pt`
## Usage
### `M2PT`
A fully ready to train implementation of the M2PT model that can be merged with the linears from any multi-modal models, just plug it in! It takes in tokenized texts which are integers then embeds them and then passes -> them into the transformer blocks and then at the end projects them and applies a softmax
```python
import torch
from torch import nn
from m2pt.main import M2PT
# Create an instance of the M2PT model class with the specified parameters
model = M2PT(
dim=512, # Dimension of the input and output tensors
num_tokens=10000,
depth=6,
dim_head=64, # Dimension of each attention head
heads=8, # Number of attention heads
dropout=0.1, # Dropout rate
ff_mult=4, # Multiplier for the dimension of the feed-forward network
original_linear=nn.Linear(512, 512), # Linear layer for the original input tensor
auxiliar_linear=nn.Linear(512, 512), # Linear layer for the auxiliary input tensor
ffn_original_linear=nn.Linear, # Linear layer for the original input tensor in the feed-forward network
ffn_auxiliar_linear=nn.Linear, # Linear layer for the auxiliary input tensor in the feed-forward network
ffn_original_last_linear=nn.Linear, # Last linear layer for the original input tensor in the feed-forward network
ffn_aux_last_linear=nn.Linear, # Last linear layer for the auxiliary input tensor in the feed-forward network
)
# Create a 3D tensor with shape B x S x D
x = torch.randint(0, 10000, (1, 512))
# Pass the input tensor through the model
out = model(x)
# Print the shape of the output tensor
print(out.shape)
```
### `MPTransformerBlock`
- Implementation of Figure 2 and the Multimodal Pathway Transformer with cross modal FFN, plug in and play your FFN
- Re-Usable and Modular.
- Combines linear projections from multiple models
```python
import torch
from torch import nn
from m2pt import MPTransformerBlock
# Create an instance of the MPTransformerBlock class with the specified parameters
model = MPTransformerBlock(
dim=512, # Dimension of the input and output tensors
dim_head=64, # Dimension of each attention head
heads=8, # Number of attention heads
dropout=0.1, # Dropout rate
ff_mult=4, # Multiplier for the dimension of the feed-forward network
original_linear=nn.Linear(512, 512), # Linear layer for the original input tensor
auxiliar_linear=nn.Linear(512, 512), # Linear layer for the auxiliary input tensor
ffn_original_linear=nn.Linear, # Linear layer for the original input tensor in the feed-forward network
ffn_auxiliar_linear=nn.Linear, # Linear layer for the auxiliary input tensor in the feed-forward network
ffn_original_last_linear=nn.Linear, # Last linear layer for the original input tensor in the feed-forward network
ffn_aux_last_linear=nn.Linear, # Last linear layer for the auxiliary input tensor in the feed-forward network
)
# Create a 3D tensor with shape B x S x D
x = torch.randn(1, 512, 512)
# Pass the input tensor through the model
out = model(x)
# Print the shape of the output tensor
print(out.shape)
```
### `CrossModalReparameterization`
- Implementation of the Cross Modal Reparameterization from the paper in Figure 2 and section 3.2
- It combines the linear methods of different multi-modal models and kinda merges them through addition and a constant value lambda or Cross Modal Scale
- Modular & Re-usable: Simply plug in your linears from any models!
```python
import torch
import torch.nn as nn
from transformers import BertModel, BertConfig, ViTModel, ViTConfig
from m2pt import CrossModalReparameterization
# Define a simple Transformer model for text
class TextTransformerModel(nn.Module):
def __init__(self, bert_model_name='bert-base-uncased'):
super(TextTransformerModel, self).__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
# Assume we're reparameterizing the first linear layer of the classifier
self.classifier = nn.Linear(self.bert.config.hidden_size, 2)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs.pooler_output
logits = self.classifier(pooled_output)
return logits
# Define a simple Transformer model for images (using ViT for example)
class ImageTransformerModel(nn.Module):
def __init__(self, vit_model_name='google/vit-base-patch16-224'):
super(ImageTransformerModel, self).__init__()
self.vit = ViTModel.from_pretrained(vit_model_name)
# Assume we're using the first linear layer of the classifier as the auxiliary layer
self.classifier = nn.Linear(self.vit.config.hidden_size, 2)
def forward(self, pixel_values):
outputs = self.vit(pixel_values=pixel_values)
pooled_output = outputs.pooler_output
logits = self.classifier(pooled_output)
return logits
# Example usage
# Initialize both models
text_model = TextTransformerModel()
image_model = ImageTransformerModel()
# Assume we want to reparameterize the classifier layer of the text model
# using the classifier layer of the image model
cross_modal_layer = CrossModalReparameterization(text_model.classifier, image_model.classifier)
# Replace the classifier in the text model with the cross-modal layer
text_model.classifier = cross_modal_layer
# Example input (batch_size, sequence_length)
input_ids = torch.randint(0, 1000, (8, 512))
attention_mask = torch.ones(8, 512)
# Forward pass through the reparameterized model
logits = text_model(input_ids, attention_mask)
print(logits)
# Train the text model as usual...
# After training, merge the parameters for inference
text_model.classifier.merge_parameters()
```
## Citation
```bibtex
@misc{zhang2024multimodal,
title={Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities},
author={Yiyuan Zhang and Xiaohan Ding and Kaixiong Gong and Yixiao Ge and Ying Shan and Xiangyu Yue},
year={2024},
eprint={2401.14405},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
# License
MIT
Raw data
{
"_id": null,
"home_page": "https://github.com/kyegomez/M2PT",
"name": "m2pt",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6,<4.0",
"maintainer_email": "",
"keywords": "artificial intelligence,deep learning,optimizers,Prompt Engineering",
"author": "Kye Gomez",
"author_email": "kye@apac.ai",
"download_url": "https://files.pythonhosted.org/packages/48/2f/93a30aeee9330c4778d7e43142ef548b33d7e71d67e94f5d1fdda4c1d755/m2pt-0.0.6.tar.gz",
"platform": null,
"description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n# Multi-Modal Pathway Transformer\n\n![Diagram](diagram.png)\n\nImplementation of M2PT in PyTorch from the paper: \"Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities\". [PAPER LINK](https://arxiv.org/abs/2401.14405). This is really really cool because just by merging the projections of different multi-modal models together you can increase the performance of your base model. This is a small but effective technique that can be implemented in any model with a minor plug in.\n\n\n## Install\n`pip3 install -U m2pt`\n\n## Usage\n\n### `M2PT`\nA fully ready to train implementation of the M2PT model that can be merged with the linears from any multi-modal models, just plug it in! It takes in tokenized texts which are integers then embeds them and then passes -> them into the transformer blocks and then at the end projects them and applies a softmax\n\n```python\nimport torch\nfrom torch import nn\nfrom m2pt.main import M2PT\n\n# Create an instance of the M2PT model class with the specified parameters\nmodel = M2PT(\n dim=512, # Dimension of the input and output tensors\n num_tokens=10000,\n depth=6,\n dim_head=64, # Dimension of each attention head\n heads=8, # Number of attention heads\n dropout=0.1, # Dropout rate\n ff_mult=4, # Multiplier for the dimension of the feed-forward network\n original_linear=nn.Linear(512, 512), # Linear layer for the original input tensor\n auxiliar_linear=nn.Linear(512, 512), # Linear layer for the auxiliary input tensor\n ffn_original_linear=nn.Linear, # Linear layer for the original input tensor in the feed-forward network\n ffn_auxiliar_linear=nn.Linear, # Linear layer for the auxiliary input tensor in the feed-forward network\n ffn_original_last_linear=nn.Linear, # Last linear layer for the original input tensor in the feed-forward network\n ffn_aux_last_linear=nn.Linear, # Last linear layer for the auxiliary input tensor in the feed-forward network\n)\n\n# Create a 3D tensor with shape B x S x D\nx = torch.randint(0, 10000, (1, 512))\n\n# Pass the input tensor through the model\nout = model(x)\n\n# Print the shape of the output tensor\nprint(out.shape)\n```\n\n\n\n### `MPTransformerBlock`\n\n- Implementation of Figure 2 and the Multimodal Pathway Transformer with cross modal FFN, plug in and play your FFN\n\n- Re-Usable and Modular.\n\n- Combines linear projections from multiple models\n\n\n```python\nimport torch\nfrom torch import nn\nfrom m2pt import MPTransformerBlock\n\n# Create an instance of the MPTransformerBlock class with the specified parameters\nmodel = MPTransformerBlock(\n dim=512, # Dimension of the input and output tensors\n dim_head=64, # Dimension of each attention head\n heads=8, # Number of attention heads\n dropout=0.1, # Dropout rate\n ff_mult=4, # Multiplier for the dimension of the feed-forward network\n original_linear=nn.Linear(512, 512), # Linear layer for the original input tensor\n auxiliar_linear=nn.Linear(512, 512), # Linear layer for the auxiliary input tensor\n ffn_original_linear=nn.Linear, # Linear layer for the original input tensor in the feed-forward network\n ffn_auxiliar_linear=nn.Linear, # Linear layer for the auxiliary input tensor in the feed-forward network\n ffn_original_last_linear=nn.Linear, # Last linear layer for the original input tensor in the feed-forward network\n ffn_aux_last_linear=nn.Linear, # Last linear layer for the auxiliary input tensor in the feed-forward network\n)\n\n# Create a 3D tensor with shape B x S x D\nx = torch.randn(1, 512, 512)\n\n# Pass the input tensor through the model\nout = model(x)\n\n# Print the shape of the output tensor\nprint(out.shape)\n\n\n```\n\n\n### `CrossModalReparameterization`\n- Implementation of the Cross Modal Reparameterization from the paper in Figure 2 and section 3.2\n\n- It combines the linear methods of different multi-modal models and kinda merges them through addition and a constant value lambda or Cross Modal Scale\n\n- Modular & Re-usable: Simply plug in your linears from any models!\n\n```python\nimport torch\n\nimport torch.nn as nn\n\nfrom transformers import BertModel, BertConfig, ViTModel, ViTConfig\n\nfrom m2pt import CrossModalReparameterization\n\n# Define a simple Transformer model for text\nclass TextTransformerModel(nn.Module):\n def __init__(self, bert_model_name='bert-base-uncased'):\n super(TextTransformerModel, self).__init__()\n self.bert = BertModel.from_pretrained(bert_model_name)\n\n # Assume we're reparameterizing the first linear layer of the classifier\n self.classifier = nn.Linear(self.bert.config.hidden_size, 2)\n\n def forward(self, input_ids, attention_mask):\n outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)\n pooled_output = outputs.pooler_output\n logits = self.classifier(pooled_output)\n return logits\n\n# Define a simple Transformer model for images (using ViT for example)\nclass ImageTransformerModel(nn.Module):\n def __init__(self, vit_model_name='google/vit-base-patch16-224'):\n super(ImageTransformerModel, self).__init__()\n self.vit = ViTModel.from_pretrained(vit_model_name)\n\n # Assume we're using the first linear layer of the classifier as the auxiliary layer\n self.classifier = nn.Linear(self.vit.config.hidden_size, 2)\n\n def forward(self, pixel_values):\n outputs = self.vit(pixel_values=pixel_values)\n pooled_output = outputs.pooler_output\n logits = self.classifier(pooled_output)\n return logits\n\n# Example usage\n# Initialize both models\ntext_model = TextTransformerModel()\nimage_model = ImageTransformerModel()\n\n# Assume we want to reparameterize the classifier layer of the text model\n# using the classifier layer of the image model\ncross_modal_layer = CrossModalReparameterization(text_model.classifier, image_model.classifier)\n\n# Replace the classifier in the text model with the cross-modal layer\ntext_model.classifier = cross_modal_layer\n\n# Example input (batch_size, sequence_length)\ninput_ids = torch.randint(0, 1000, (8, 512))\nattention_mask = torch.ones(8, 512)\n\n# Forward pass through the reparameterized model\nlogits = text_model(input_ids, attention_mask)\nprint(logits)\n\n# Train the text model as usual...\n\n# After training, merge the parameters for inference\ntext_model.classifier.merge_parameters()\n\n\n\n```\n\n\n## Citation\n```bibtex\n@misc{zhang2024multimodal,\n title={Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities}, \n author={Yiyuan Zhang and Xiaohan Ding and Kaixiong Gong and Yixiao Ge and Ying Shan and Xiangyu Yue},\n year={2024},\n eprint={2401.14405},\n archivePrefix={arXiv},\n primaryClass={cs.CV}\n}\n```\n\n\n# License\nMIT\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "M2PT - Pytorch",
"version": "0.0.6",
"project_urls": {
"Documentation": "https://github.com/kyegomez/M2PT",
"Homepage": "https://github.com/kyegomez/M2PT",
"Repository": "https://github.com/kyegomez/M2PT"
},
"split_keywords": [
"artificial intelligence",
"deep learning",
"optimizers",
"prompt engineering"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8d06548732b4c40ca6c17c1205dcdfcbe879977d14bd81dcd7dd7a4e073e9a9d",
"md5": "6e9c3e617eddecf51aa55645984192aa",
"sha256": "4ad0135d2dab80298e4d3737e00b626661cffc6765ca3722646072e416d6d780"
},
"downloads": -1,
"filename": "m2pt-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6e9c3e617eddecf51aa55645984192aa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6,<4.0",
"size": 7326,
"upload_time": "2024-01-29T16:28:44",
"upload_time_iso_8601": "2024-01-29T16:28:44.996127Z",
"url": "https://files.pythonhosted.org/packages/8d/06/548732b4c40ca6c17c1205dcdfcbe879977d14bd81dcd7dd7a4e073e9a9d/m2pt-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "482f93a30aeee9330c4778d7e43142ef548b33d7e71d67e94f5d1fdda4c1d755",
"md5": "27eb24f5270cb866989145017040205d",
"sha256": "47bd85f290d4272267a9b222ece4074d30a076d6ceaabf08679d5bbb1ee623a3"
},
"downloads": -1,
"filename": "m2pt-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "27eb24f5270cb866989145017040205d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6,<4.0",
"size": 8165,
"upload_time": "2024-01-29T16:28:46",
"upload_time_iso_8601": "2024-01-29T16:28:46.081832Z",
"url": "https://files.pythonhosted.org/packages/48/2f/93a30aeee9330c4778d7e43142ef548b33d7e71d67e94f5d1fdda4c1d755/m2pt-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-29 16:28:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyegomez",
"github_project": "M2PT",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "m2pt"
}