# KVMM: Keras Vision Models π
[](https://opensource.org/licenses/Apache-2.0)
[](https://github.com/keras-team/keras)

## π Table of Contents
- π [Introduction](#introduction)
- β‘ [Installation](#installation)
- π οΈ [Usage](#usage)
- π [Models](#models)
- π [License](#license)
- π [Credits](#Credits)
## π Introduction
Keras Vision Models (KVMM) is a collection of vision models with pretrained weights, built entirely with Keras 3. It supports a range of tasks, including segmentation, object detection, vision-language modeling (VLMs), and classification. KVMM includes custom layers and backbone support, providing flexibility and efficiency across various vision applications. For backbones, there are various weight variants like `in1k`, `in21k`, `fb_dist_in1k`, `ms_in22k`, `fb_in22k_ft_in1k`, `ns_jft_in1k`, `aa_in1k`, `cvnets_in1k`, `augreg_in21k_ft_in1k`, `augreg_in21k`, and many more.
## β‘Installation
From PyPI (recommended)
```shell
pip install -U kvmm
```
From Source
```shell
pip install -U git+https://github.com/IMvision12/keras-vision-models
```
## π οΈ Usage
<h3><b>π Listing Available Models</b></h3>
Shows all available models, including backbones, segmentation models, object detection models, and vision-language models (VLMs). It also includes the names of the weights available for each specific model variant.
```python
import kvmm
print(kvmm.list_models())
## Output:
"""
CaiTM36 : fb_dist_in1k_384
CaiTM48 : fb_dist_in1k_448
CaiTS24 : fb_dist_in1k_224, fb_dist_in1k_384
...
ConvMixer1024D20 : in1k
ConvMixer1536D20 : in1k
...
ConvNeXtAtto : d2_in1k
ConvNeXtBase : fb_in1k, fb_in22k, fb_in22k_ft_in1k, fb_in22k_ft_in1k_384
...
"""
```
<h3><b>π List Specific Model Variant</b></h3>
```python
import kvmm
print(kvmm.list_models("swin"))
# Output:
"""
SwinBaseP4W12 : ms_in1k, ms_in22k, ms_in22k_ft_in1k
SwinBaseP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k
SwinLargeP4W12 : ms_in22k, ms_in22k_ft_in1k
SwinLargeP4W7 : ms_in22k, ms_in22k_ft_in1k
SwinSmallP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k
SwinTinyP4W7 : ms_in1k, ms_in22k
"""
```
<h3><b>βοΈ Layers </b></h3>
KVMM provides various custom layers like StochasticDepth, LayerScale, EfficientMultiheadSelfAttention, and more. These layers can be seamlessly integrated into your custom models and workflows π
```python
import kvmm
# Example 1
layer = kvmm.layers.StochasticDepth(drop_path_rate=0.1)
output = layer(input_tensor, training=True)
# Example 2
window_partition = WindowPartition(window_size=7)
windowed_features = window_partition(features, height=28, width=28)
```
<h3><b>ποΈ Backbone Usage (Classification) </b></h3>
#### π οΈ Basic Usage
```python
import kvmm
import numpy as np
# default configuration
model = kvmm.models.vit.ViTTiny16()
# For Fine-Tuning (default weight)
model = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3))
# Custom Weight
model = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3), weights="augreg_in21k_224")
# Backbone Support
model = kvmm.models.vit.ViTTiny16(include_top=False, as_backbone=True, input_shape=(224,224,3), weights="augreg_in21k_224")
random_input = np.random.rand(1, 224, 224, 3).astype(np.float32)
features = model(random_input)
print(f"Number of feature maps: {len(features)}")
for i, feature in enumerate(features):
print(f"Feature {i} shape: {feature.shape}")
"""
Output:
Number of feature maps: 13
Feature 0 shape: (1, 197, 192)
Feature 1 shape: (1, 197, 192)
Feature 2 shape: (1, 197, 192)
...
"""
```
#### Example Inference
```python
from keras import ops
from keras.applications.imagenet_utils import decode_predictions
import kvmm
from PIL import Image
model = kvmm.models.swin.SwinTinyP4W7(input_shape=[224, 224, 3])
image = Image.open("bird.png").resize((224, 224))
x = ops.convert_to_tensor(image)
x = ops.expand_dims(x, axis=0)
# Predict
preds = model.predict(x)
print("Predicted:", decode_predictions(preds, top=3)[0])
#output:
Predicted: [('n01537544', 'indigo_bunting', np.float32(0.9135666)), ('n01806143', 'peacock', np.float32(0.0003379386)), ('n02017213', 'European_gallinule', np.float32(0.00027174334))]
```
<h3><b>π§© Segmentation </b></h3>
#### π οΈ Basic Usage
```python
import kvmm
# Pre-Trained weights (cityscapes or ade20kor mit(in1k))
# ade20k and cityscapes can be used for fine-tuning by giving custom `num_classes`
# If `num_classes` is not specified by default for ade20k it will be 150 and for cityscapes it will be 19
model = kvmm.models.segformer.SegFormerB0(weights="ade20k", input_shape=(512,512,3))
model = kvmm.models.segformer.SegFormerB0(weights="cityscapes", input_shape=(512,512,3))
# Fine-Tune using `MiT` backbone (This will load `in1k` weights)
model = kvmm.models.segformer.SegFormerB0(weights="mit", input_shape=(512,512,3))
```
#### π Custom Backbone Support
```python
import kvmm
# With no backbone weights
backbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights=None, include_top=False, input_shape=(224,224,3))
segformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))
# With backbone weights
import kvmm
backbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights="tv_in1k", include_top=False, input_shape=(224,224,3))
segformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))
```
#### π Example Inference
```python
import kvmm
from PIL import Image
import numpy as np
model = kvmm.models.segformer.SegFormerB0(weights="ade20k_512")
image = Image.open("ADE_train_00000586.jpg")
processed_img = kvmm.models.segformer.SegFormerImageProcessor(image=image,
do_resize=True,
size={"height": 512, "width": 512},
do_rescale=True,
do_normalize=True)
outs = model.predict(processed_img)
outs = np.argmax(outs[0], axis=-1)
visualize_segmentation(outs, image)
```

<h3><b>VLMS</b></h3>
#### π οΈ Basic Usage
```python
import keras
import kvmm
processor = kvmm.models.clip.CLIPProcessor()
model = kvmm.models.clip.ClipVitBase16(
weights="openai_224",
input_shape=(224, 224, 3), # You can fine-tune or infer with variable size
)
inputs = processor(text=["mountains", "tortoise", "cat"], image_paths="cat1.jpg")
output = model(
{
"images": inputs["images"],
"token_ids": inputs["input_ids"],
"padding_mask": inputs["attention_mask"],
}
)
print("Raw Model Output:")
print(output)
preds = keras.ops.softmax(output["image_logits"]).numpy().squeeze()
result = dict(zip(["mountains", "tortoise", "cat"], preds))
print("\nPrediction probabilities:")
print(result)
#output:
"""{'image_logits': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[11.042501, 10.388493, 18.414747]], dtype=float32)>, 'text_logits': <tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[11.042501],
[10.388493],
[18.414747]], dtype=float32)>}
Prediction probabilities:
{'mountains': np.float32(0.0006278555), 'tortoise': np.float32(0.000326458), 'cat': np.float32(0.99904567)}"""
```
## π Models
- Backbones:
| π·οΈ Model Name | π Reference Paper | π¦ Source of Weights |
|---------------|-------------------|---------------------|
| CaiT | [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239) | `timm` |
| ConvMixer | [Patches Are All You Need?](https://arxiv.org/abs/2201.09792) | `timm` |
| ConvNeXt | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) | `timm` |
| ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) | `timm` |
| DeiT | [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) | `timm` |
| DenseNet | [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993) | `timm` |
| EfficientNet | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |
| EfficientNet-Lite | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |
| EfficientNetV2 | [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298) | `timm` |
| FlexiViT | [FlexiViT: One Model for All Patch Sizes](https://arxiv.org/abs/2212.08013) | `timm` |
| InceptionNeXt | [InceptionNeXt: When Inception Meets ConvNeXt](https://arxiv.org/abs/2303.16900) | `timm` |
| Inception-ResNet-v2 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |
| Inception-v3 | [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/abs/1512.00567) | `timm` |
| Inception-v4 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |
| MiT | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers` |
| MLP-Mixer | [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601) | `timm` |
| MobileNetV2 | [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) | `timm` |
| MobileNetV3 | [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244) | `keras` |
| MobileViT | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) | `timm` |
| MobileViTV2 | [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) | `timm` |
| PiT | [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302) | `timm` |
| PoolFormer | [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) | `timm` |
| Res2Net | [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169) | `timm` |
| ResMLP | [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404) | `timm` |
| ResNet | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) | `timm` |
| ResNetV2 | [Identity Mappings in Deep Residual Networks](https://arxiv.org/abs/1603.05027) | `timm` |
| ResNeXt | [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/abs/1611.05431) | `timm` |
| SENet | [Squeeze-and-Excitation Networks](https://arxiv.org/abs/1709.01507) | `timm` |
| Swin Transformer | [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) | `timm` |
| VGG | [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556) | `timm` |
| ViT | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) | `timm` |
| Xception | [Xception: Deep Learning with Depthwise Separable Convolutions](https://arxiv.org/abs/1610.02357) | `keras` |
<br>
- Segmentation
| π·οΈ Model Name | π Reference Paper | π¦ Source of Weights |
|---------------|-------------------|---------------------|
| SegFormer | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers`|
<br>
- Vision-Language-Models (VLMs)
| π·οΈ Model Name | π Reference Paper | π¦ Source of Weights |
|---------------|-------------------|---------------------|
| CLIP | [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) | `transformers`|
| SigLIP | [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) | `transformers`|
| SigLIP2 | [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://arxiv.org/abs/2502.14786) | `transformers`|
## π License
This project leverages [timm](https://github.com/huggingface/pytorch-image-models#licenses) and [transformers](https://github.com/huggingface/transformers#license) for converting pretrained weights from PyTorch to Keras. For licensing details, please refer to the respective repositories.
- π **kvmm Code**: This repository is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
## π Credits
- The [Keras](https://github.com/keras-team/keras) team for their powerful and user-friendly deep learning framework
- The [Transformers](https://github.com/huggingface/transformers) library for its robust tools for loading and adapting pretrained models
- The [pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) project for pioneering many computer vision model implementations
- All contributors to the original papers and architectures implemented in this library
## Citing
### BibTeX
```bash
@misc{gc2025kvmm,
author = {Gitesh Chawda},
title = {Keras Vision Models},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/IMvision12/keras-vision-models}}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "kvmm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "machine-learning, jax, computer-vision, neural-networks, tensorflow, torch, deep-learning, keras, imagenet, pretrained-weights, convolutional-neural-networks, transfer-learning, python-ml, data-science, ai-research, vision-transformer, image-classification, model-training, pytorch",
"author": null,
"author_email": "Gitesh Chawda <gitesh.ch.0912@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/e2/d7/45871f57ec38b26d7c291325fe6157996306b366b9465af565fff264a687/kvmm-0.1.8.tar.gz",
"platform": null,
"description": "# KVMM: Keras Vision Models \ud83d\ude80\n\n[](https://opensource.org/licenses/Apache-2.0)\n[](https://github.com/keras-team/keras)\n\n\n## \ud83d\udccc Table of Contents\n\n- \ud83d\udcd6 [Introduction](#introduction)\n- \u26a1 [Installation](#installation)\n- \ud83d\udee0\ufe0f [Usage](#usage)\n- \ud83d\udcd1 [Models](#models)\n- \ud83d\udcdc [License](#license)\n- \ud83c\udf1f [Credits](#Credits)\n\n## \ud83d\udcd6 Introduction\n\nKeras Vision Models (KVMM) is a collection of vision models with pretrained weights, built entirely with Keras 3. It supports a range of tasks, including segmentation, object detection, vision-language modeling (VLMs), and classification. KVMM includes custom layers and backbone support, providing flexibility and efficiency across various vision applications. For backbones, there are various weight variants like `in1k`, `in21k`, `fb_dist_in1k`, `ms_in22k`, `fb_in22k_ft_in1k`, `ns_jft_in1k`, `aa_in1k`, `cvnets_in1k`, `augreg_in21k_ft_in1k`, `augreg_in21k`, and many more.\n\n## \u26a1Installation \n\nFrom PyPI (recommended)\n\n```shell\npip install -U kvmm\n```\n\nFrom Source\n\n```shell\npip install -U git+https://github.com/IMvision12/keras-vision-models\n```\n\n## \ud83d\udee0\ufe0f Usage\n\n<h3><b>\ud83d\udd0e Listing Available Models</b></h3>\n\nShows all available models, including backbones, segmentation models, object detection models, and vision-language models (VLMs). It also includes the names of the weights available for each specific model variant.\n \n```python\nimport kvmm\nprint(kvmm.list_models())\n\n## Output:\n\"\"\"\nCaiTM36 : fb_dist_in1k_384\nCaiTM48 : fb_dist_in1k_448\nCaiTS24 : fb_dist_in1k_224, fb_dist_in1k_384\n...\nConvMixer1024D20 : in1k\nConvMixer1536D20 : in1k\n...\nConvNeXtAtto : d2_in1k\nConvNeXtBase : fb_in1k, fb_in22k, fb_in22k_ft_in1k, fb_in22k_ft_in1k_384\n...\n\"\"\"\n```\n<h3><b>\ud83d\udd0e List Specific Model Variant</b></h3>\n\n```python\nimport kvmm\nprint(kvmm.list_models(\"swin\"))\n\n# Output:\n\"\"\"\nSwinBaseP4W12 : ms_in1k, ms_in22k, ms_in22k_ft_in1k\nSwinBaseP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k\nSwinLargeP4W12 : ms_in22k, ms_in22k_ft_in1k\nSwinLargeP4W7 : ms_in22k, ms_in22k_ft_in1k\nSwinSmallP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k\nSwinTinyP4W7 : ms_in1k, ms_in22k\n\"\"\"\n```\n\n<h3><b>\u2699\ufe0f Layers </b></h3>\nKVMM provides various custom layers like StochasticDepth, LayerScale, EfficientMultiheadSelfAttention, and more. These layers can be seamlessly integrated into your custom models and workflows \ud83d\ude80\n\n```python\nimport kvmm\n\n# Example 1\nlayer = kvmm.layers.StochasticDepth(drop_path_rate=0.1)\noutput = layer(input_tensor, training=True)\n\n# Example 2\nwindow_partition = WindowPartition(window_size=7)\nwindowed_features = window_partition(features, height=28, width=28)\n```\n\n<h3><b>\ud83c\udfd7\ufe0f Backbone Usage (Classification) </b></h3>\n\n#### \ud83d\udee0\ufe0f Basic Usage\n```python\nimport kvmm\nimport numpy as np\n\n# default configuration\nmodel = kvmm.models.vit.ViTTiny16()\n\n# For Fine-Tuning (default weight)\nmodel = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3))\n# Custom Weight\nmodel = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3), weights=\"augreg_in21k_224\")\n\n# Backbone Support\nmodel = kvmm.models.vit.ViTTiny16(include_top=False, as_backbone=True, input_shape=(224,224,3), weights=\"augreg_in21k_224\")\nrandom_input = np.random.rand(1, 224, 224, 3).astype(np.float32)\nfeatures = model(random_input)\nprint(f\"Number of feature maps: {len(features)}\")\nfor i, feature in enumerate(features):\n print(f\"Feature {i} shape: {feature.shape}\")\n\n\"\"\"\nOutput:\n\nNumber of feature maps: 13\nFeature 0 shape: (1, 197, 192)\nFeature 1 shape: (1, 197, 192)\nFeature 2 shape: (1, 197, 192)\n...\n\"\"\" \n```\n\n#### Example Inference\n\n```python\nfrom keras import ops\nfrom keras.applications.imagenet_utils import decode_predictions\nimport kvmm\nfrom PIL import Image\n\nmodel = kvmm.models.swin.SwinTinyP4W7(input_shape=[224, 224, 3])\n\nimage = Image.open(\"bird.png\").resize((224, 224))\nx = ops.convert_to_tensor(image)\nx = ops.expand_dims(x, axis=0)\n\n# Predict\npreds = model.predict(x)\nprint(\"Predicted:\", decode_predictions(preds, top=3)[0])\n\n#output:\nPredicted: [('n01537544', 'indigo_bunting', np.float32(0.9135666)), ('n01806143', 'peacock', np.float32(0.0003379386)), ('n02017213', 'European_gallinule', np.float32(0.00027174334))]\n```\n\n<h3><b>\ud83e\udde9 Segmentation </b></h3>\n\n#### \ud83d\udee0\ufe0f Basic Usage\n \n```python\nimport kvmm\n\n# Pre-Trained weights (cityscapes or ade20kor mit(in1k))\n# ade20k and cityscapes can be used for fine-tuning by giving custom `num_classes`\n# If `num_classes` is not specified by default for ade20k it will be 150 and for cityscapes it will be 19\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"ade20k\", input_shape=(512,512,3))\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"cityscapes\", input_shape=(512,512,3))\n\n# Fine-Tune using `MiT` backbone (This will load `in1k` weights)\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"mit\", input_shape=(512,512,3))\n```\n\n#### \ud83d\ude80 Custom Backbone Support\n\n```python\nimport kvmm\n\n# With no backbone weights\nbackbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights=None, include_top=False, input_shape=(224,224,3))\nsegformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))\n\n# With backbone weights\nimport kvmm\nbackbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights=\"tv_in1k\", include_top=False, input_shape=(224,224,3))\nsegformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))\n```\n\n#### \ud83d\ude80 Example Inference\n\n```python\nimport kvmm\nfrom PIL import Image\nimport numpy as np\n\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"ade20k_512\")\n\nimage = Image.open(\"ADE_train_00000586.jpg\")\nprocessed_img = kvmm.models.segformer.SegFormerImageProcessor(image=image,\n do_resize=True,\n size={\"height\": 512, \"width\": 512},\n do_rescale=True,\n do_normalize=True)\nouts = model.predict(processed_img)\nouts = np.argmax(outs[0], axis=-1)\nvisualize_segmentation(outs, image)\n```\n\n\n\n<h3><b>VLMS</b></h3>\n\n#### \ud83d\udee0\ufe0f Basic Usage\n\n```python\nimport keras\n\nimport kvmm\n\nprocessor = kvmm.models.clip.CLIPProcessor()\nmodel = kvmm.models.clip.ClipVitBase16(\n weights=\"openai_224\",\n input_shape=(224, 224, 3), # You can fine-tune or infer with variable size \n)\ninputs = processor(text=[\"mountains\", \"tortoise\", \"cat\"], image_paths=\"cat1.jpg\")\noutput = model(\n {\n \"images\": inputs[\"images\"],\n \"token_ids\": inputs[\"input_ids\"],\n \"padding_mask\": inputs[\"attention_mask\"],\n }\n)\n\nprint(\"Raw Model Output:\")\nprint(output)\n\npreds = keras.ops.softmax(output[\"image_logits\"]).numpy().squeeze()\nresult = dict(zip([\"mountains\", \"tortoise\", \"cat\"], preds))\nprint(\"\\nPrediction probabilities:\")\nprint(result)\n\n#output:\n\"\"\"{'image_logits': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[11.042501, 10.388493, 18.414747]], dtype=float32)>, 'text_logits': <tf.Tensor: shape=(3, 1), dtype=float32, numpy=\narray([[11.042501],\n [10.388493],\n [18.414747]], dtype=float32)>}\n\nPrediction probabilities:\n{'mountains': np.float32(0.0006278555), 'tortoise': np.float32(0.000326458), 'cat': np.float32(0.99904567)}\"\"\"\n```\n## \ud83d\udcd1 Models\n\n- Backbones:\n\n | \ud83c\udff7\ufe0f Model Name | \ud83d\udcdc Reference Paper | \ud83d\udce6 Source of Weights |\n |---------------|-------------------|---------------------|\n | CaiT | [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239) | `timm` |\n | ConvMixer | [Patches Are All You Need?](https://arxiv.org/abs/2201.09792) | `timm` |\n | ConvNeXt | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) | `timm` |\n | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) | `timm` |\n | DeiT | [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) | `timm` |\n | DenseNet | [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993) | `timm` |\n | EfficientNet | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |\n | EfficientNet-Lite | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |\n | EfficientNetV2 | [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298) | `timm` |\n | FlexiViT | [FlexiViT: One Model for All Patch Sizes](https://arxiv.org/abs/2212.08013) | `timm` |\n | InceptionNeXt | [InceptionNeXt: When Inception Meets ConvNeXt](https://arxiv.org/abs/2303.16900) | `timm` |\n | Inception-ResNet-v2 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |\n | Inception-v3 | [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/abs/1512.00567) | `timm` |\n | Inception-v4 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |\n | MiT | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers` |\n | MLP-Mixer | [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601) | `timm` |\n | MobileNetV2 | [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) | `timm` |\n | MobileNetV3 | [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244) | `keras` |\n | MobileViT | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) | `timm` |\n | MobileViTV2 | [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) | `timm` |\n | PiT | [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302) | `timm` |\n | PoolFormer | [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) | `timm` |\n | Res2Net | [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169) | `timm` |\n | ResMLP | [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404) | `timm` |\n | ResNet | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) | `timm` |\n | ResNetV2 | [Identity Mappings in Deep Residual Networks](https://arxiv.org/abs/1603.05027) | `timm` |\n | ResNeXt | [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/abs/1611.05431) | `timm` |\n | SENet | [Squeeze-and-Excitation Networks](https://arxiv.org/abs/1709.01507) | `timm` |\n | Swin Transformer | [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) | `timm` |\n | VGG | [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556) | `timm` |\n | ViT | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) | `timm` |\n | Xception | [Xception: Deep Learning with Depthwise Separable Convolutions](https://arxiv.org/abs/1610.02357) | `keras` |\n\n<br>\n\n- Segmentation\n\n | \ud83c\udff7\ufe0f Model Name | \ud83d\udcdc Reference Paper | \ud83d\udce6 Source of Weights |\n |---------------|-------------------|---------------------|\n | SegFormer | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers`|\n\n<br>\n\n- Vision-Language-Models (VLMs)\n\n | \ud83c\udff7\ufe0f Model Name | \ud83d\udcdc Reference Paper | \ud83d\udce6 Source of Weights |\n |---------------|-------------------|---------------------|\n | CLIP | [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) | `transformers`|\n | SigLIP | [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) | `transformers`|\n | SigLIP2 | [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://arxiv.org/abs/2502.14786) | `transformers`|\n \n## \ud83d\udcdc License\n\nThis project leverages [timm](https://github.com/huggingface/pytorch-image-models#licenses) and [transformers](https://github.com/huggingface/transformers#license) for converting pretrained weights from PyTorch to Keras. For licensing details, please refer to the respective repositories.\n\n- \ud83d\udd16 **kvmm Code**: This repository is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).\n\n\n## \ud83c\udf1f Credits\n\n- The [Keras](https://github.com/keras-team/keras) team for their powerful and user-friendly deep learning framework\n- The [Transformers](https://github.com/huggingface/transformers) library for its robust tools for loading and adapting pretrained models \n- The [pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) project for pioneering many computer vision model implementations\n- All contributors to the original papers and architectures implemented in this library\n\n## Citing\n\n### BibTeX\n\n```bash\n@misc{gc2025kvmm,\n author = {Gitesh Chawda},\n title = {Keras Vision Models},\n year = {2025},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/IMvision12/keras-vision-models}}\n}\n```\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Pretrained keras 3 vision models",
"version": "0.1.8",
"project_urls": {
"documentation": "https://github.com/IMvision12/keras-vision-models",
"homepage": "https://github.com/IMvision12/keras-vision-models",
"repository": "https://github.com/IMvision12/keras-vision-models.git"
},
"split_keywords": [
"machine-learning",
" jax",
" computer-vision",
" neural-networks",
" tensorflow",
" torch",
" deep-learning",
" keras",
" imagenet",
" pretrained-weights",
" convolutional-neural-networks",
" transfer-learning",
" python-ml",
" data-science",
" ai-research",
" vision-transformer",
" image-classification",
" model-training",
" pytorch"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "03e5c0e65cee1ec94cd0e814b607fce0cea782f9dddb45445e48feac51db6751",
"md5": "367905b601b953266079468abc57b02b",
"sha256": "1a8b0243635b9afbddbf0820a709bb6f23c3754927070a85fa9c9092e87b8484"
},
"downloads": -1,
"filename": "kvmm-0.1.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "367905b601b953266079468abc57b02b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 318798,
"upload_time": "2025-08-04T20:30:56",
"upload_time_iso_8601": "2025-08-04T20:30:56.770884Z",
"url": "https://files.pythonhosted.org/packages/03/e5/c0e65cee1ec94cd0e814b607fce0cea782f9dddb45445e48feac51db6751/kvmm-0.1.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e2d745871f57ec38b26d7c291325fe6157996306b366b9465af565fff264a687",
"md5": "a08e01d7b09dbdc8524015848bc2519a",
"sha256": "a796b2326b5846af1ee1c859c2786ea23d16f3dcce7c4c369783a4eb65b909f6"
},
"downloads": -1,
"filename": "kvmm-0.1.8.tar.gz",
"has_sig": false,
"md5_digest": "a08e01d7b09dbdc8524015848bc2519a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 239185,
"upload_time": "2025-08-04T20:30:58",
"upload_time_iso_8601": "2025-08-04T20:30:58.440628Z",
"url": "https://files.pythonhosted.org/packages/e2/d7/45871f57ec38b26d7c291325fe6157996306b366b9465af565fff264a687/kvmm-0.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-04 20:30:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "IMvision12",
"github_project": "keras-vision-models",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "tensorflow-cpu",
"specs": [
[
"~=",
"2.18.1"
]
]
},
{
"name": "tensorflow",
"specs": [
[
"~=",
"2.18.1"
]
]
},
{
"name": "tf_keras",
"specs": []
},
{
"name": "tf2onnx",
"specs": []
},
{
"name": "torch",
"specs": [
[
"==",
"2.6.0"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.6.0"
]
]
},
{
"name": "torch-xla",
"specs": [
[
"==",
"2.6.0"
]
]
},
{
"name": "jax",
"specs": [
[
"==",
"0.5.0"
]
]
},
{
"name": "flax",
"specs": []
},
{
"name": "keras",
"specs": [
[
">=",
"3.10.0"
]
]
},
{
"name": "pillow",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
}
],
"lcname": "kvmm"
}