# ViX
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-xformers-efficient-attention-for-image/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=vision-xformers-efficient-attention-for-image)
## Vision Xformers: Efficient Attention for Image Classification
![image](https://user-images.githubusercontent.com/15833382/172207987-e07bb02b-4a1e-430c-a1bf-bc78af87976b.png)
We use Linear Attention mechanisms to replace quadratic attention in ViT for image classification. We show that models using linear attention and CNN embedding layers need less parameters and low GPU requirements for achieving good accuracy. These improvements can be used to democratize the use of transformers by practitioners who are limited by data and GPU.
Hybrid ViX uses convolutional layers instead of linear layer for generating embeddings
Rotary Postion Embedding (RoPE) is also used in our models instead of 1D learnable position embeddings
Nomenclature:
We replace the X in ViX with the starting alphabet of the attention mechanism used
Eg. When we use Performer in ViX, we replace the X with P, calling it ViP (Vision Performer)
'Hybrid' prefix is used in models which uses convolutional layers instead of linear embeddding layer.
We have added RoPE in the title of models which used Rotary Postion Embedding
The code for using all for these models for classification of CIFAR 10/Tiny ImageNet dataset is provided
### Models
- Vision Transformer (ViT)
- Vision Linformer (ViL)
- Vision Performer (ViP)
- Vision Nyströmformer (ViN)
- FNet
- Hybrid Vision Transformer (HybridViT)
- Hybrid Vision Linformer (HybridViL)
- Hybrid Vision Performer (HybridViP)
- Hybrid Vision Nyströmformer (HybridViN)
- Hybrid FNet
- LeViN (Replacing Transformer in LeViT with Nyströmformer)
- LeViP (Replacing Transformer in LeViT with Performer)
- CvN (Replacing Transformer in CvT with Nyströmformer)
- CvP (Replacing Transformer in CvT with Performer)
- CCN (Replacing Transformer in CCT with Nyströmformer)
- CCP(Replacing Transformer in CCT with Performer)
We have adapted the codes for ViT and linear transformers from @lucidrains
## Install
```bash
$ pip install vision-xformer
```
## Usage
### Image Classification
#### Vision Nyströmformer (ViN)
```python
import torch, vision_xformer
from vision_xformer import ViN
model = ViN(
image_size = 32,
patch_size = 1,
num_classes = 10,
dim = 128,
depth = 4,
heads = 4,
mlp_dim = 256,
num_landmarks = 256,
pool = 'cls',
channels = 3,
dropout = 0.,
emb_dropout = 0.
dim_head = 32
)
img = torch.randn(1, 3, 32, 32)
preds = model(img) # (1, 10)
```
#### Vision Performer (ViP)
```python
import torch, vision_xformer
from vision_xformer import ViP
model = ViP(
image_size = 32,
patch_size = 1,
num_classes = 10,
dim = 128,
depth = 4,
heads = 4,
mlp_dim = 256,
dropout = 0.25,
dim_head = 32
)
img = torch.randn(1, 3, 32, 32)
preds = model(img) # (1, 10)
```
#### Vision Linformer (ViL)
```python
import torch, vision_xformer
from vision_xformer import ViL
model = ViL(
image_size = 32,
patch_size = 1,
num_classes = 10,
dim = 128,
depth = 4,
heads = 4,
mlp_dim = 256,
dropout = 0.25,
dim_head = 32
)
img = torch.randn(1, 3, 32, 32)
preds = model(img) # (1, 10)
```
## Parameters
- `image_size`: int.
Size of input image. If you have rectangular images, make sure your image size is the maximum of the width and height
- `patch_size`: int.
Number of patches. `image_size` must be divisible by `patch_size`.
- `num_classes`: int.
Number of classes to classify.
- `dim`: int.
Final dimension of token emeddings after linear layer.
- `depth`: int.
Number of layers.
- `heads`: int.
Number of heads in multi-head attention
- `mlp_dim`: int.
Embedding dimension in the MLP (FeedForward) layer.
- `num_landmarks`: int.
Number of landmark points. Use one-fourth the number of patches.
- `pool`: str.
Pool type must be either `'cls'` (cls token) or `'mean'` (mean pooling)
- `dropout`: float between `[0, 1]`, default `0.`.
Dropout rate.
- `dim_head`: int.
Embedding dimension of token in each head of mulit-head attention.
More information about these models can be obtained from our paper : [ArXiv Paper](https://arxiv.org/abs/2107.02239), [WACV 2022 Paper](https://openaccess.thecvf.com/content/WACV2022/html/Jeevan_Resource-Efficient_Hybrid_X-Formers_for_Vision_WACV_2022_paper.html)
If you wish to cite this, please use:
```
@misc{jeevan2021vision,
title={Vision Xformers: Efficient Attention for Image Classification},
author={Pranav Jeevan and Amit Sethi},
year={2021},
eprint={2107.02239},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@InProceedings{Jeevan_2022_WACV,
author = {Jeevan, Pranav and Sethi, Amit},
title = {Resource-Efficient Hybrid X-Formers for Vision},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {January},
year = {2022},
pages = {2982-2990}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/pranavphoenix/VisionXformer",
"name": "vision-xformer",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "artificial intelligence,training,optimizer,machine learning,attention,transformers,computer vision",
"author": "Pranav Jeevan",
"author_email": "pranav13phoenix@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c3/71/83055714b73da06dff2f6b82d389407eba72c24cdded7681c0adc529eccf/vision_xformer-0.2.0.tar.gz",
"platform": null,
"description": "# ViX \r\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-xformers-efficient-attention-for-image/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=vision-xformers-efficient-attention-for-image)\r\n## Vision Xformers: Efficient Attention for Image Classification\r\n\r\n![image](https://user-images.githubusercontent.com/15833382/172207987-e07bb02b-4a1e-430c-a1bf-bc78af87976b.png)\r\n\r\n\r\nWe use Linear Attention mechanisms to replace quadratic attention in ViT for image classification. We show that models using linear attention and CNN embedding layers need less parameters and low GPU requirements for achieving good accuracy. These improvements can be used to democratize the use of transformers by practitioners who are limited by data and GPU.\r\n\r\nHybrid ViX uses convolutional layers instead of linear layer for generating embeddings\r\n\r\nRotary Postion Embedding (RoPE) is also used in our models instead of 1D learnable position embeddings\r\n\r\nNomenclature:\r\nWe replace the X in ViX with the starting alphabet of the attention mechanism used\r\nEg. When we use Performer in ViX, we replace the X with P, calling it ViP (Vision Performer)\r\n\r\n'Hybrid' prefix is used in models which uses convolutional layers instead of linear embeddding layer. \r\n\r\nWe have added RoPE in the title of models which used Rotary Postion Embedding\r\n\r\nThe code for using all for these models for classification of CIFAR 10/Tiny ImageNet dataset is provided\r\n\r\n### Models\r\n\r\n- Vision Transformer (ViT)\r\n- Vision Linformer (ViL)\r\n- Vision Performer (ViP)\r\n- Vision Nystr\u00c3\u00b6mformer (ViN)\r\n- FNet\r\n- Hybrid Vision Transformer (HybridViT)\r\n- Hybrid Vision Linformer (HybridViL)\r\n- Hybrid Vision Performer (HybridViP)\r\n- Hybrid Vision Nystr\u00c3\u00b6mformer (HybridViN)\r\n- Hybrid FNet\r\n- LeViN (Replacing Transformer in LeViT with Nystr\u00c3\u00b6mformer)\r\n- LeViP (Replacing Transformer in LeViT with Performer)\r\n- CvN (Replacing Transformer in CvT with Nystr\u00c3\u00b6mformer)\r\n- CvP (Replacing Transformer in CvT with Performer)\r\n- CCN (Replacing Transformer in CCT with Nystr\u00c3\u00b6mformer)\r\n- CCP(Replacing Transformer in CCT with Performer)\r\n\r\nWe have adapted the codes for ViT and linear transformers from @lucidrains \r\n\r\n## Install\r\n```bash\r\n$ pip install vision-xformer\r\n```\r\n## Usage\r\n### Image Classification\r\n#### Vision Nystr\u00c3\u00b6mformer (ViN)\r\n\r\n```python\r\nimport torch, vision_xformer\r\nfrom vision_xformer import ViN\r\n\r\nmodel = ViN(\r\n image_size = 32,\r\n patch_size = 1,\r\n num_classes = 10, \r\n dim = 128, \r\n depth = 4, \r\n heads = 4, \r\n mlp_dim = 256,\r\n num_landmarks = 256,\r\n pool = 'cls',\r\n channels = 3,\r\n dropout = 0.,\r\n emb_dropout = 0.\r\n dim_head = 32\r\n)\r\n\r\nimg = torch.randn(1, 3, 32, 32)\r\n\r\npreds = model(img) # (1, 10)\r\n```\r\n\r\n#### Vision Performer (ViP)\r\n\r\n```python\r\nimport torch, vision_xformer\r\nfrom vision_xformer import ViP\r\n\r\nmodel = ViP(\r\n image_size = 32,\r\n patch_size = 1,\r\n num_classes = 10, \r\n dim = 128, \r\n depth = 4, \r\n heads = 4, \r\n mlp_dim = 256,\r\n dropout = 0.25,\r\n dim_head = 32\r\n)\r\n\r\nimg = torch.randn(1, 3, 32, 32)\r\n\r\npreds = model(img) # (1, 10)\r\n```\r\n\r\n#### Vision Linformer (ViL)\r\n\r\n```python\r\nimport torch, vision_xformer\r\nfrom vision_xformer import ViL\r\n\r\nmodel = ViL(\r\n image_size = 32,\r\n patch_size = 1,\r\n num_classes = 10, \r\n dim = 128, \r\n depth = 4, \r\n heads = 4, \r\n mlp_dim = 256,\r\n dropout = 0.25,\r\n dim_head = 32\r\n)\r\n\r\nimg = torch.randn(1, 3, 32, 32)\r\n\r\npreds = model(img) # (1, 10)\r\n```\r\n## Parameters\r\n\r\n- `image_size`: int. \r\nSize of input image. If you have rectangular images, make sure your image size is the maximum of the width and height\r\n- `patch_size`: int. \r\nNumber of patches. `image_size` must be divisible by `patch_size`.\r\n- `num_classes`: int. \r\nNumber of classes to classify.\r\n- `dim`: int. \r\nFinal dimension of token emeddings after linear layer. \r\n- `depth`: int. \r\nNumber of layers.\r\n- `heads`: int. \r\nNumber of heads in multi-head attention\r\n- `mlp_dim`: int. \r\nEmbedding dimension in the MLP (FeedForward) layer. \r\n- `num_landmarks`: int.\r\nNumber of landmark points. Use one-fourth the number of patches.\r\n- `pool`: str.\r\nPool type must be either `'cls'` (cls token) or `'mean'` (mean pooling)\r\n- `dropout`: float between `[0, 1]`, default `0.`. \r\nDropout rate. \r\n- `dim_head`: int. \r\nEmbedding dimension of token in each head of mulit-head attention.\r\n\r\n\r\nMore information about these models can be obtained from our paper : [ArXiv Paper](https://arxiv.org/abs/2107.02239), [WACV 2022 Paper](https://openaccess.thecvf.com/content/WACV2022/html/Jeevan_Resource-Efficient_Hybrid_X-Formers_for_Vision_WACV_2022_paper.html)\r\n\r\nIf you wish to cite this, please use:\r\n```\r\n@misc{jeevan2021vision,\r\n title={Vision Xformers: Efficient Attention for Image Classification}, \r\n author={Pranav Jeevan and Amit Sethi},\r\n year={2021},\r\n eprint={2107.02239},\r\n archivePrefix={arXiv},\r\n primaryClass={cs.CV}\r\n}\r\n@InProceedings{Jeevan_2022_WACV,\r\n author = {Jeevan, Pranav and Sethi, Amit},\r\n title = {Resource-Efficient Hybrid X-Formers for Vision},\r\n booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},\r\n month = {January},\r\n year = {2022},\r\n pages = {2982-2990}\r\n}\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Vision Xformers",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/pranavphoenix/VisionXformer"
},
"split_keywords": [
"artificial intelligence",
"training",
"optimizer",
"machine learning",
"attention",
"transformers",
"computer vision"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0c429688c73344eb1b0ddafb21410667e2c480c5c48d35860ef209b167c602ac",
"md5": "58a8ec2635349c15a62aeba2f8ddc3ba",
"sha256": "6b4ebfef0a7331ff845a289dbc05556cf88ada592d6b7057031d2eec299bd725"
},
"downloads": -1,
"filename": "vision_xformer-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "58a8ec2635349c15a62aeba2f8ddc3ba",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 15063,
"upload_time": "2023-05-10T22:25:38",
"upload_time_iso_8601": "2023-05-10T22:25:38.098051Z",
"url": "https://files.pythonhosted.org/packages/0c/42/9688c73344eb1b0ddafb21410667e2c480c5c48d35860ef209b167c602ac/vision_xformer-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c37183055714b73da06dff2f6b82d389407eba72c24cdded7681c0adc529eccf",
"md5": "97fb7b3d99cb6dde2614c284f4f43d00",
"sha256": "dd43ee2e6a68d9674a1d360d9677dfa9bf0d510e0b39b6b24d156218a29020ad"
},
"downloads": -1,
"filename": "vision_xformer-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "97fb7b3d99cb6dde2614c284f4f43d00",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 13263,
"upload_time": "2023-05-10T22:25:40",
"upload_time_iso_8601": "2023-05-10T22:25:40.015467Z",
"url": "https://files.pythonhosted.org/packages/c3/71/83055714b73da06dff2f6b82d389407eba72c24cdded7681c0adc529eccf/vision_xformer-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-10 22:25:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pranavphoenix",
"github_project": "VisionXformer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "vision-xformer"
}