# FasterViT: Fast Vision Transformers with Hierarchical Attention
FasterViT achieves a new SOTA Pareto-front in
terms of accuracy vs. image throughput without extra training data !
<p align="center">
<img src="https://github.com/NVlabs/FasterViT/assets/26806394/b0fc1056-3df8-4573-b3cb-3032eca15343" width=62% height=62%
class="center">
</p>
Note: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops.
## Quick Start
We can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by:
```bash
pip install fastervit
```
A pretrained FasterViT model with default hyper-parameters can be created as in the following:
```python
>>> from fastervit import create_model
# Define fastervit-0 model with 224 x 224 resolution
>>> model = create_model('faster_vit_0_224',
pretrained=True,
model_path="/tmp/faster_vit_0.pth.tar")
```
`model_path` is used to set the directory to download the model.
We can also simply test the model by passing a dummy input image. The output is the logits:
```python
>>> import torch
>>> image = torch.rand(1, 3, 224, 224)
>>> output = model(image) # torch.Size([1, 1000])
```
We can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0
model with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of
64:
```python
>>> from fastervit import create_model
# Define any-resolution FasterViT-0 model with 576 x 960 resolution
>>> model = create_model('faster_vit_0_any_res',
resolution=[576, 960],
window_size=[7, 7, 12, 6],
ct_size=2,
dim=64,
pretrained=True)
```
Note that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.)
We can simply test the model by passing a dummy input image. The output is the logits:
```python
>>> import torch
>>> image = torch.rand(1, 3, 576, 960)
>>> output = model(image) # torch.Size([1, 1000])
```
---
## Results + Pretrained Models
### ImageNet-1K
**FasterViT ImageNet-1K Pretrained Models**
<table>
<tr>
<th>Name</th>
<th>Acc@1(%)</th>
<th>Acc@5(%)</th>
<th>Throughput(Img/Sec)</th>
<th>Resolution</th>
<th>#Params(M)</th>
<th>FLOPs(G)</th>
<th>Download</th>
</tr>
<tr>
<td>FasterViT-0</td>
<td>82.1</td>
<td>95.9</td>
<td>5802</td>
<td>224x224</td>
<td>31.4</td>
<td>3.3</td>
<td><a href="https://drive.google.com/uc?export=download&id=1twI2LFJs391Yrj8MR4Ui9PfrvWqjE1iB">model</a></td>
</tr>
<tr>
<td>FasterViT-1</td>
<td>83.2</td>
<td>96.5</td>
<td>4188</td>
<td>224x224</td>
<td>53.4</td>
<td>5.3</td>
<td><a href="https://drive.google.com/uc?export=download&id=1r7W10n5-bFtM3sz4bmaLrowN2gYPkLGT">model</a></td>
</tr>
<tr>
<td>FasterViT-2</td>
<td>84.2</td>
<td>96.8</td>
<td>3161</td>
<td>224x224</td>
<td>75.9</td>
<td>8.7</td>
<td><a href="https://drive.google.com/uc?export=download&id=1n_a6s0pgi0jVZOGmDei2vXHU5E6RH5wU">model</a></td>
</tr>
<tr>
<td>FasterViT-3</td>
<td>84.9</td>
<td>97.2</td>
<td>1780</td>
<td>224x224</td>
<td>159.5</td>
<td>18.2</td>
<td><a href="https://drive.google.com/uc?export=download&id=1tvWElZ91Sia2SsXYXFMNYQwfipCxtI7X">model</a></td>
</tr>
<tr>
<td>FasterViT-4</td>
<td>85.4</td>
<td>97.3</td>
<td>849</td>
<td>224x224</td>
<td>424.6</td>
<td>36.6</td>
<td><a href="https://drive.google.com/uc?export=download&id=1gYhXA32Q-_9C5DXel17avV_ZLoaHwdgz">model</a></td>
</tr>
<tr>
<td>FasterViT-5</td>
<td>85.6</td>
<td>97.4</td>
<td>449</td>
<td>224x224</td>
<td>975.5</td>
<td>113.0</td>
<td><a href="https://drive.google.com/uc?export=download&id=1mqpai7XiHLr_n1tjxjzT8q369xTCq_z-">model</a></td>
</tr>
<tr>
<td>FasterViT-6</td>
<td>85.8</td>
<td>97.4</td>
<td>352</td>
<td>224x224</td>
<td>1360.0</td>
<td>142.0</td>
<td><a href="https://drive.google.com/uc?export=download&id=12jtavR2QxmMzcKwPzWe7kw-oy34IYi59">model</a></td>
</tr>
</table>
### ImageNet-21K
**FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**
<table>
<tr>
<th>Name</th>
<th>Acc@1(%)</th>
<th>Acc@5(%)</th>
<th>Resolution</th>
<th>#Params(M)</th>
<th>FLOPs(G)</th>
<th>Download</th>
</tr>
<tr>
<td>FasterViT-4-21K-224</td>
<td>86.6</td>
<td>97.8</td>
<td>224x224</td>
<td>271.9</td>
<td>40.8</td>
<td><a href="https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_224_w14.pth.tar">model</a></td>
</tr>
<tr>
<td>FasterViT-4-21K-384</td>
<td>87.6</td>
<td>98.3</td>
<td>384x384</td>
<td>271.9</td>
<td>120.1</td>
<td><a href="https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_384_w24.pth.tar">model</a></td>
</tr>
<tr>
<td>FasterViT-4-21K-512</td>
<td>87.8</td>
<td>98.4</td>
<td>512x512</td>
<td>271.9</td>
<td>213.5</td>
<td><a href="https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_512_w32.pth.tar">model</a></td>
</tr>
<tr>
<td>FasterViT-4-21K-768</td>
<td>87.9</td>
<td>98.5</td>
<td>768x768</td>
<td>271.9</td>
<td>480.4</td>
<td><a href="https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_768_w48.pth.tar">model</a></td>
</tr>
</table>
### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2)
All models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.
<table>
<tr>
<th>Name</th>
<th>A-Acc@1(%)</th>
<th>A-Acc@5(%)</th>
<th>R-Acc@1(%)</th>
<th>R-Acc@5(%)</th>
<th>V2-Acc@1(%)</th>
<th>V2-Acc@5(%)</th>
</tr>
<tr>
<td>FasterViT-0</td>
<td>23.9</td>
<td>57.6</td>
<td>45.9</td>
<td>60.4</td>
<td>70.9</td>
<td>90.0</td>
</tr>
<tr>
<td>FasterViT-1</td>
<td>31.2</td>
<td>63.3</td>
<td>47.5</td>
<td>61.9</td>
<td>72.6</td>
<td>91.0</td>
</tr>
<tr>
<td>FasterViT-2</td>
<td>38.2</td>
<td>68.9</td>
<td>49.6</td>
<td>63.4</td>
<td>73.7</td>
<td>91.6</td>
</tr>
<tr>
<td>FasterViT-3</td>
<td>44.2</td>
<td>73.0</td>
<td>51.9</td>
<td>65.6</td>
<td>75.0</td>
<td>92.2</td>
</tr>
<tr>
<td>FasterViT-4</td>
<td>49.0</td>
<td>75.4</td>
<td>56.0</td>
<td>69.6</td>
<td>75.7</td>
<td>92.7</td>
</tr>
<tr>
<td>FasterViT-5</td>
<td>52.7</td>
<td>77.6</td>
<td>56.9</td>
<td>70.0</td>
<td>76.0</td>
<td>93.0</td>
</tr>
<tr>
<td>FasterViT-6</td>
<td>53.7</td>
<td>78.4</td>
<td>57.1</td>
<td>70.1</td>
<td>76.1</td>
<td>93.0</td>
</tr>
</table>
A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively.
## Citation
Please consider citing FasterViT if this repository is useful for your work.
```
@article{hatamizadeh2023fastervit,
title={FasterViT: Fast Vision Transformers with Hierarchical Attention},
author={Hatamizadeh, Ali and Heinrich, Greg and Yin, Hongxu and Tao, Andrew and Alvarez, Jose M and Kautz, Jan and Molchanov, Pavlo},
journal={arXiv preprint arXiv:2306.06189},
year={2023}
}
```
## Licenses
Copyright © 2023, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license.
For license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models).
For license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/).
## Acknowledgement
This repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.
Raw data
{
"_id": null,
"home_page": "https://github.com/NVlabs/FasterViT",
"name": "fastervit",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "pytorch pretrained models efficientnet mobilenetv3 mnasnet resnet vision transformer vit",
"author": "Ali Hatamizadeh",
"author_email": "ahatamiz123@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/61/e0/6e4b314b5edfe93908c2171d784f7b9e0d12cbea2d31d278216769b425d4/fastervit-0.9.8.tar.gz",
"platform": null,
"description": "# FasterViT: Fast Vision Transformers with Hierarchical Attention\n\n\nFasterViT achieves a new SOTA Pareto-front in\nterms of accuracy vs. image throughput without extra training data !\n\n<p align=\"center\">\n<img src=\"https://github.com/NVlabs/FasterViT/assets/26806394/b0fc1056-3df8-4573-b3cb-3032eca15343\" width=62% height=62% \nclass=\"center\">\n</p>\n\nNote: Please use the [**latest NVIDIA TensorRT release**](https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/index.html) to enjoy the benefits of optimized FasterViT ops. \n\n\n## Quick Start\n\nWe can import pre-trained FasterViT models with **1 line of code**. First, FasterViT can be simply installed by:\n\n```bash\npip install fastervit\n```\n\nA pretrained FasterViT model with default hyper-parameters can be created as in the following:\n\n```python\n>>> from fastervit import create_model\n\n# Define fastervit-0 model with 224 x 224 resolution\n\n>>> model = create_model('faster_vit_0_224', \n pretrained=True,\n model_path=\"/tmp/faster_vit_0.pth.tar\")\n```\n\n`model_path` is used to set the directory to download the model.\n\nWe can also simply test the model by passing a dummy input image. The output is the logits:\n\n```python\n>>> import torch\n\n>>> image = torch.rand(1, 3, 224, 224)\n>>> output = model(image) # torch.Size([1, 1000])\n```\n\nWe can also use the any-resolution FasterViT model to accommodate arbitrary image resolutions. In the following, we define an any-resolution FasterViT-0\nmodel with input resolution of 576 x 960, window sizes of 12 and 6 in 3rd and 4th stages, carrier token size of 2 and embedding dimension of\n64:\n\n```python\n>>> from fastervit import create_model\n\n# Define any-resolution FasterViT-0 model with 576 x 960 resolution\n>>> model = create_model('faster_vit_0_any_res', \n resolution=[576, 960],\n window_size=[7, 7, 12, 6],\n ct_size=2,\n dim=64,\n pretrained=True)\n```\nNote that the above model is intiliazed from the original ImageNet pre-trained FasterViT with original resolution of 224 x 224. As a result, missing keys and mis-matches could be expected since we are addign new layers (e.g. addition of new carrier tokens, etc.) \n\nWe can simply test the model by passing a dummy input image. The output is the logits:\n\n```python\n>>> import torch\n\n>>> image = torch.rand(1, 3, 576, 960)\n>>> output = model(image) # torch.Size([1, 1000])\n```\n\n--- \n\n## Results + Pretrained Models\n\n### ImageNet-1K\n**FasterViT ImageNet-1K Pretrained Models**\n\n<table>\n <tr>\n <th>Name</th>\n <th>Acc@1(%)</th>\n <th>Acc@5(%)</th>\n <th>Throughput(Img/Sec)</th>\n <th>Resolution</th>\n <th>#Params(M)</th>\n <th>FLOPs(G)</th>\n <th>Download</th>\n </tr>\n\n<tr>\n <td>FasterViT-0</td>\n <td>82.1</td>\n <td>95.9</td>\n <td>5802</td>\n <td>224x224</td>\n <td>31.4</td>\n <td>3.3</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=1twI2LFJs391Yrj8MR4Ui9PfrvWqjE1iB\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-1</td>\n <td>83.2</td>\n <td>96.5</td>\n <td>4188</td>\n <td>224x224</td>\n <td>53.4</td>\n <td>5.3</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=1r7W10n5-bFtM3sz4bmaLrowN2gYPkLGT\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-2</td>\n <td>84.2</td>\n <td>96.8</td>\n <td>3161</td>\n <td>224x224</td>\n <td>75.9</td>\n <td>8.7</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=1n_a6s0pgi0jVZOGmDei2vXHU5E6RH5wU\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-3</td>\n <td>84.9</td>\n <td>97.2</td>\n <td>1780</td>\n <td>224x224</td>\n <td>159.5</td>\n <td>18.2</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=1tvWElZ91Sia2SsXYXFMNYQwfipCxtI7X\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-4</td>\n <td>85.4</td>\n <td>97.3</td>\n <td>849</td>\n <td>224x224</td>\n <td>424.6</td>\n <td>36.6</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=1gYhXA32Q-_9C5DXel17avV_ZLoaHwdgz\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-5</td>\n <td>85.6</td>\n <td>97.4</td>\n <td>449</td>\n <td>224x224</td>\n <td>975.5</td>\n <td>113.0</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=1mqpai7XiHLr_n1tjxjzT8q369xTCq_z-\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-6</td>\n <td>85.8</td>\n <td>97.4</td>\n <td>352</td>\n <td>224x224</td>\n <td>1360.0</td>\n <td>142.0</td>\n <td><a href=\"https://drive.google.com/uc?export=download&id=12jtavR2QxmMzcKwPzWe7kw-oy34IYi59\">model</a></td>\n</tr>\n\n</table>\n\n### ImageNet-21K\n**FasterViT ImageNet-21K Pretrained Models (ImageNet-1K Fine-tuned)**\n\n<table>\n <tr>\n <th>Name</th>\n <th>Acc@1(%)</th>\n <th>Acc@5(%)</th>\n <th>Resolution</th>\n <th>#Params(M)</th>\n <th>FLOPs(G)</th>\n <th>Download</th>\n </tr>\n\n<tr>\n <td>FasterViT-4-21K-224</td>\n <td>86.6</td>\n <td>97.8</td>\n <td>224x224</td>\n <td>271.9</td>\n <td>40.8</td>\n <td><a href=\"https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_224_w14.pth.tar\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-4-21K-384</td>\n <td>87.6</td>\n <td>98.3</td>\n <td>384x384</td>\n <td>271.9</td>\n <td>120.1</td>\n <td><a href=\"https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_384_w24.pth.tar\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-4-21K-512</td>\n <td>87.8</td>\n <td>98.4</td>\n <td>512x512</td>\n <td>271.9</td>\n <td>213.5</td>\n <td><a href=\"https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_512_w32.pth.tar\">model</a></td>\n</tr>\n\n<tr>\n <td>FasterViT-4-21K-768</td>\n <td>87.9</td>\n <td>98.5</td>\n <td>768x768</td>\n <td>271.9</td>\n <td>480.4</td>\n <td><a href=\"https://huggingface.co/ahatamiz/FasterViT/resolve/main/fastervit_4_21k_768_w48.pth.tar\">model</a></td>\n</tr>\n\n</table>\n\n### Robustness (ImageNet-A - ImageNet-R - ImageNet-V2)\n\nAll models use `crop_pct=0.875`. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.\n<table>\n <tr>\n <th>Name</th>\n <th>A-Acc@1(%)</th>\n <th>A-Acc@5(%)</th>\n <th>R-Acc@1(%)</th>\n <th>R-Acc@5(%)</th>\n <th>V2-Acc@1(%)</th>\n <th>V2-Acc@5(%)</th>\n </tr>\n\n<tr>\n <td>FasterViT-0</td>\n <td>23.9</td>\n <td>57.6</td>\n <td>45.9</td>\n <td>60.4</td>\n <td>70.9</td>\n <td>90.0</td>\n</tr>\n\n<tr>\n <td>FasterViT-1</td>\n <td>31.2</td>\n <td>63.3</td>\n <td>47.5</td>\n <td>61.9</td>\n <td>72.6</td>\n <td>91.0</td>\n</tr>\n\n<tr>\n <td>FasterViT-2</td>\n <td>38.2</td>\n <td>68.9</td>\n <td>49.6</td>\n <td>63.4</td>\n <td>73.7</td>\n <td>91.6</td>\n</tr>\n\n<tr>\n <td>FasterViT-3</td>\n <td>44.2</td>\n <td>73.0</td>\n <td>51.9</td>\n <td>65.6</td>\n <td>75.0</td>\n <td>92.2</td>\n</tr>\n\n<tr>\n <td>FasterViT-4</td>\n <td>49.0</td>\n <td>75.4</td>\n <td>56.0</td>\n <td>69.6</td>\n <td>75.7</td>\n <td>92.7</td>\n</tr>\n\n<tr>\n <td>FasterViT-5</td>\n <td>52.7</td>\n <td>77.6</td>\n <td>56.9</td>\n <td>70.0</td>\n <td>76.0</td>\n <td>93.0</td>\n</tr>\n\n<tr>\n <td>FasterViT-6</td>\n <td>53.7</td>\n <td>78.4</td>\n <td>57.1</td>\n <td>70.1</td>\n <td>76.1</td>\n <td>93.0</td>\n</tr>\n\n</table>\n\nA, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively. \n\n## Citation\n\nPlease consider citing FasterViT if this repository is useful for your work. \n\n```\n@article{hatamizadeh2023fastervit,\n title={FasterViT: Fast Vision Transformers with Hierarchical Attention},\n author={Hatamizadeh, Ali and Heinrich, Greg and Yin, Hongxu and Tao, Andrew and Alvarez, Jose M and Kautz, Jan and Molchanov, Pavlo},\n journal={arXiv preprint arXiv:2306.06189},\n year={2023}\n}\n```\n\n\n## Licenses\n\nCopyright \u00a9 2023, NVIDIA Corporation. All rights reserved.\n\nThis work is made available under the NVIDIA Source Code License-NC. Click [here](LICENSE) to view a copy of this license.\n\nFor license information regarding the timm repository, please refer to its [repository](https://github.com/rwightman/pytorch-image-models).\n\nFor license information regarding the ImageNet dataset, please see the [ImageNet official website](https://www.image-net.org/). \n\n## Acknowledgement\nThis repository is built on top of the [timm](https://github.com/huggingface/pytorch-image-models) repository. We thank [Ross Wrightman](https://rwightman.com/) for creating and maintaining this high-quality library.\n",
"bugtrack_url": null,
"license": "NVIDIA Source Code License-NC",
"summary": "FasterViT: Fast Vision Transformers with Hierarchical Attention",
"version": "0.9.8",
"project_urls": {
"Homepage": "https://github.com/NVlabs/FasterViT"
},
"split_keywords": [
"pytorch",
"pretrained",
"models",
"efficientnet",
"mobilenetv3",
"mnasnet",
"resnet",
"vision",
"transformer",
"vit"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3628be7dbba26be472928016495104e2ae647d65f52e10c82330457a09c0559e",
"md5": "096d685a0e4d9442f97411f249d481b5",
"sha256": "55e1829ce46e382b88daa45f3d4587396464b2c3b35875883ae55529adc11542"
},
"downloads": -1,
"filename": "fastervit-0.9.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "096d685a0e4d9442f97411f249d481b5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 165745,
"upload_time": "2023-09-01T18:48:18",
"upload_time_iso_8601": "2023-09-01T18:48:18.847447Z",
"url": "https://files.pythonhosted.org/packages/36/28/be7dbba26be472928016495104e2ae647d65f52e10c82330457a09c0559e/fastervit-0.9.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "61e06e4b314b5edfe93908c2171d784f7b9e0d12cbea2d31d278216769b425d4",
"md5": "8af2617b6a892b76d490dd16411ffc45",
"sha256": "04a4441beacb59058555c62f2e8694dadecd2d40914d58b28088bb1079842299"
},
"downloads": -1,
"filename": "fastervit-0.9.8.tar.gz",
"has_sig": false,
"md5_digest": "8af2617b6a892b76d490dd16411ffc45",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 156304,
"upload_time": "2023-09-01T18:48:20",
"upload_time_iso_8601": "2023-09-01T18:48:20.634443Z",
"url": "https://files.pythonhosted.org/packages/61/e0/6e4b314b5edfe93908c2171d784f7b9e0d12cbea2d31d278216769b425d4/fastervit-0.9.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-01 18:48:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NVlabs",
"github_project": "FasterViT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "fastervit"
}