kvmm


Namekvmm JSON
Version 0.1.8 PyPI version JSON
download
home_pageNone
SummaryPretrained keras 3 vision models
upload_time2025-08-04 20:30:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseApache License 2.0
keywords machine-learning jax computer-vision neural-networks tensorflow torch deep-learning keras imagenet pretrained-weights convolutional-neural-networks transfer-learning python-ml data-science ai-research vision-transformer image-classification model-training pytorch
VCS
bugtrack_url
requirements tensorflow-cpu tensorflow tf_keras tf2onnx torch torch torch-xla jax flax keras pillow sentencepiece
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # KVMM: Keras Vision Models πŸš€

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Keras](https://img.shields.io/badge/keras-v3.5.0+-success.svg)](https://github.com/keras-team/keras)
![Python](https://img.shields.io/badge/python-v3.10.0+-success.svg)

## πŸ“Œ Table of Contents

- πŸ“– [Introduction](#introduction)
- ⚑ [Installation](#installation)
- πŸ› οΈ [Usage](#usage)
- πŸ“‘ [Models](#models)
- πŸ“œ [License](#license)
- 🌟 [Credits](#Credits)

## πŸ“– Introduction

Keras Vision Models (KVMM) is a collection of vision models with pretrained weights, built entirely with Keras 3. It supports a range of tasks, including segmentation, object detection, vision-language modeling (VLMs), and classification. KVMM includes custom layers and backbone support, providing flexibility and efficiency across various vision applications. For backbones, there are various weight variants like `in1k`, `in21k`, `fb_dist_in1k`, `ms_in22k`, `fb_in22k_ft_in1k`, `ns_jft_in1k`, `aa_in1k`, `cvnets_in1k`, `augreg_in21k_ft_in1k`, `augreg_in21k`, and many more.

## ⚑Installation 

From PyPI (recommended)

```shell
pip install -U kvmm
```

From Source

```shell
pip install -U git+https://github.com/IMvision12/keras-vision-models
```

## πŸ› οΈ Usage

<h3><b>πŸ”Ž Listing Available Models</b></h3>

Shows all available models, including backbones, segmentation models, object detection models, and vision-language models (VLMs). It also includes the names of the weights available for each specific model variant.
    
```python
import kvmm
print(kvmm.list_models())

## Output:
"""
CaiTM36 : fb_dist_in1k_384
CaiTM48 : fb_dist_in1k_448
CaiTS24 : fb_dist_in1k_224, fb_dist_in1k_384
...
ConvMixer1024D20 : in1k
ConvMixer1536D20 : in1k
...
ConvNeXtAtto : d2_in1k
ConvNeXtBase : fb_in1k, fb_in22k, fb_in22k_ft_in1k, fb_in22k_ft_in1k_384
...
"""
```
<h3><b>πŸ”Ž List Specific Model Variant</b></h3>

```python
import kvmm
print(kvmm.list_models("swin"))

# Output:
"""
SwinBaseP4W12 : ms_in1k, ms_in22k, ms_in22k_ft_in1k
SwinBaseP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k
SwinLargeP4W12 : ms_in22k, ms_in22k_ft_in1k
SwinLargeP4W7 : ms_in22k, ms_in22k_ft_in1k
SwinSmallP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k
SwinTinyP4W7 : ms_in1k, ms_in22k
"""
```

<h3><b>βš™οΈ Layers </b></h3>
KVMM provides various custom layers like StochasticDepth, LayerScale, EfficientMultiheadSelfAttention, and more. These layers can be seamlessly integrated into your custom models and workflows πŸš€

```python
import kvmm

# Example 1
layer = kvmm.layers.StochasticDepth(drop_path_rate=0.1)
output = layer(input_tensor, training=True)

# Example 2
window_partition = WindowPartition(window_size=7)
windowed_features = window_partition(features, height=28, width=28)
```

<h3><b>πŸ—οΈ Backbone Usage (Classification) </b></h3>

#### πŸ› οΈ Basic Usage
```python
import kvmm
import numpy as np

# default configuration
model = kvmm.models.vit.ViTTiny16()

# For Fine-Tuning (default weight)
model = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3))
# Custom Weight
model = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3), weights="augreg_in21k_224")

# Backbone Support
model = kvmm.models.vit.ViTTiny16(include_top=False, as_backbone=True, input_shape=(224,224,3), weights="augreg_in21k_224")
random_input = np.random.rand(1, 224, 224, 3).astype(np.float32)
features = model(random_input)
print(f"Number of feature maps: {len(features)}")
for i, feature in enumerate(features):
    print(f"Feature {i} shape: {feature.shape}")

"""
Output:

Number of feature maps: 13
Feature 0 shape: (1, 197, 192)
Feature 1 shape: (1, 197, 192)
Feature 2 shape: (1, 197, 192)
...
"""    
```

#### Example Inference

```python
from keras import ops
from keras.applications.imagenet_utils import decode_predictions
import kvmm
from PIL import Image

model = kvmm.models.swin.SwinTinyP4W7(input_shape=[224, 224, 3])

image = Image.open("bird.png").resize((224, 224))
x = ops.convert_to_tensor(image)
x = ops.expand_dims(x, axis=0)

# Predict
preds = model.predict(x)
print("Predicted:", decode_predictions(preds, top=3)[0])

#output:
Predicted: [('n01537544', 'indigo_bunting', np.float32(0.9135666)), ('n01806143', 'peacock', np.float32(0.0003379386)), ('n02017213', 'European_gallinule', np.float32(0.00027174334))]
```

<h3><b>🧩 Segmentation </b></h3>

#### πŸ› οΈ Basic Usage
 
```python
import kvmm

# Pre-Trained weights (cityscapes or ade20kor mit(in1k))
# ade20k and cityscapes can be used for fine-tuning by giving custom `num_classes`
# If `num_classes` is not specified by default for ade20k it will be 150 and for cityscapes it will be 19
model = kvmm.models.segformer.SegFormerB0(weights="ade20k", input_shape=(512,512,3))
model = kvmm.models.segformer.SegFormerB0(weights="cityscapes", input_shape=(512,512,3))

# Fine-Tune using `MiT` backbone (This will load `in1k` weights)
model = kvmm.models.segformer.SegFormerB0(weights="mit", input_shape=(512,512,3))
```

#### πŸš€ Custom Backbone Support

```python
import kvmm

# With no backbone weights
backbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights=None, include_top=False, input_shape=(224,224,3))
segformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))

# With backbone weights
import kvmm
backbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights="tv_in1k", include_top=False, input_shape=(224,224,3))
segformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))
```

#### πŸš€ Example Inference

```python
import kvmm
from PIL import Image
import numpy as np

model = kvmm.models.segformer.SegFormerB0(weights="ade20k_512")

image = Image.open("ADE_train_00000586.jpg")
processed_img = kvmm.models.segformer.SegFormerImageProcessor(image=image,
    do_resize=True,
    size={"height": 512, "width": 512},
    do_rescale=True,
    do_normalize=True)
outs = model.predict(processed_img)
outs = np.argmax(outs[0], axis=-1)
visualize_segmentation(outs, image)
```
![output](images/seg_output.png)


<h3><b>VLMS</b></h3>

#### πŸ› οΈ Basic Usage

```python
import keras

import kvmm

processor = kvmm.models.clip.CLIPProcessor()
model = kvmm.models.clip.ClipVitBase16(
    weights="openai_224",
    input_shape=(224, 224, 3), # You can fine-tune or infer with variable size 
)
inputs = processor(text=["mountains", "tortoise", "cat"], image_paths="cat1.jpg")
output = model(
    {
        "images": inputs["images"],
        "token_ids": inputs["input_ids"],
        "padding_mask": inputs["attention_mask"],
    }
)

print("Raw Model Output:")
print(output)

preds = keras.ops.softmax(output["image_logits"]).numpy().squeeze()
result = dict(zip(["mountains", "tortoise", "cat"], preds))
print("\nPrediction probabilities:")
print(result)

#output:
"""{'image_logits': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[11.042501, 10.388493, 18.414747]], dtype=float32)>, 'text_logits': <tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[11.042501],
       [10.388493],
       [18.414747]], dtype=float32)>}

Prediction probabilities:
{'mountains': np.float32(0.0006278555), 'tortoise': np.float32(0.000326458), 'cat': np.float32(0.99904567)}"""
```
## πŸ“‘ Models

- Backbones:

    | 🏷️ Model Name | πŸ“œ Reference Paper | πŸ“¦ Source of Weights |
    |---------------|-------------------|---------------------|
    | CaiT | [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239) | `timm` |
    | ConvMixer | [Patches Are All You Need?](https://arxiv.org/abs/2201.09792) | `timm` |
    | ConvNeXt | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) | `timm` |
    | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) | `timm` |
    | DeiT | [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) | `timm` |
    | DenseNet | [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993) | `timm` |
    | EfficientNet | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |
    | EfficientNet-Lite | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |
    | EfficientNetV2 | [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298) | `timm` |
    | FlexiViT | [FlexiViT: One Model for All Patch Sizes](https://arxiv.org/abs/2212.08013) | `timm` |
    | InceptionNeXt | [InceptionNeXt: When Inception Meets ConvNeXt](https://arxiv.org/abs/2303.16900) | `timm` |
    | Inception-ResNet-v2 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |
    | Inception-v3 | [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/abs/1512.00567) | `timm` |
    | Inception-v4 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |
    | MiT | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers` |
    | MLP-Mixer | [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601) | `timm` |
    | MobileNetV2 | [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) | `timm` |
    | MobileNetV3 | [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244) | `keras` |
    | MobileViT | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) | `timm` |
    | MobileViTV2 | [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) | `timm` |
    | PiT | [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302) | `timm` |
    | PoolFormer | [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) | `timm` |
    | Res2Net | [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169) | `timm` |
    | ResMLP | [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404) | `timm` |
    | ResNet | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) | `timm` |
    | ResNetV2 | [Identity Mappings in Deep Residual Networks](https://arxiv.org/abs/1603.05027) | `timm` |
    | ResNeXt | [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/abs/1611.05431) | `timm` |
    | SENet | [Squeeze-and-Excitation Networks](https://arxiv.org/abs/1709.01507) | `timm` |
    | Swin Transformer | [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) | `timm` |
    | VGG | [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556) | `timm` |
    | ViT | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) | `timm` |
    | Xception | [Xception: Deep Learning with Depthwise Separable Convolutions](https://arxiv.org/abs/1610.02357) | `keras` |

<br>

- Segmentation

    | 🏷️ Model Name | πŸ“œ Reference Paper | πŸ“¦ Source of Weights |
    |---------------|-------------------|---------------------|
    | SegFormer | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers`|

<br>

- Vision-Language-Models (VLMs)

    | 🏷️ Model Name | πŸ“œ Reference Paper | πŸ“¦ Source of Weights |
    |---------------|-------------------|---------------------|
    | CLIP | [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) | `transformers`|
    | SigLIP | [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) | `transformers`|
    | SigLIP2 | [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://arxiv.org/abs/2502.14786) | `transformers`|
  
## πŸ“œ License

This project leverages [timm](https://github.com/huggingface/pytorch-image-models#licenses) and [transformers](https://github.com/huggingface/transformers#license) for converting pretrained weights from PyTorch to Keras. For licensing details, please refer to the respective repositories.

- πŸ”– **kvmm Code**: This repository is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).


## 🌟 Credits

- The [Keras](https://github.com/keras-team/keras) team for their powerful and user-friendly deep learning framework
- The [Transformers](https://github.com/huggingface/transformers) library for its robust tools for loading and adapting pretrained models  
- The [pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) project for pioneering many computer vision model implementations
- All contributors to the original papers and architectures implemented in this library

## Citing

### BibTeX

```bash
@misc{gc2025kvmm,
  author = {Gitesh Chawda},
  title = {Keras Vision Models},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/IMvision12/keras-vision-models}}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kvmm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "machine-learning, jax, computer-vision, neural-networks, tensorflow, torch, deep-learning, keras, imagenet, pretrained-weights, convolutional-neural-networks, transfer-learning, python-ml, data-science, ai-research, vision-transformer, image-classification, model-training, pytorch",
    "author": null,
    "author_email": "Gitesh Chawda <gitesh.ch.0912@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e2/d7/45871f57ec38b26d7c291325fe6157996306b366b9465af565fff264a687/kvmm-0.1.8.tar.gz",
    "platform": null,
    "description": "# KVMM: Keras Vision Models \ud83d\ude80\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Keras](https://img.shields.io/badge/keras-v3.5.0+-success.svg)](https://github.com/keras-team/keras)\n![Python](https://img.shields.io/badge/python-v3.10.0+-success.svg)\n\n## \ud83d\udccc Table of Contents\n\n- \ud83d\udcd6 [Introduction](#introduction)\n- \u26a1 [Installation](#installation)\n- \ud83d\udee0\ufe0f [Usage](#usage)\n- \ud83d\udcd1 [Models](#models)\n- \ud83d\udcdc [License](#license)\n- \ud83c\udf1f [Credits](#Credits)\n\n## \ud83d\udcd6 Introduction\n\nKeras Vision Models (KVMM) is a collection of vision models with pretrained weights, built entirely with Keras 3. It supports a range of tasks, including segmentation, object detection, vision-language modeling (VLMs), and classification. KVMM includes custom layers and backbone support, providing flexibility and efficiency across various vision applications. For backbones, there are various weight variants like `in1k`, `in21k`, `fb_dist_in1k`, `ms_in22k`, `fb_in22k_ft_in1k`, `ns_jft_in1k`, `aa_in1k`, `cvnets_in1k`, `augreg_in21k_ft_in1k`, `augreg_in21k`, and many more.\n\n## \u26a1Installation \n\nFrom PyPI (recommended)\n\n```shell\npip install -U kvmm\n```\n\nFrom Source\n\n```shell\npip install -U git+https://github.com/IMvision12/keras-vision-models\n```\n\n## \ud83d\udee0\ufe0f Usage\n\n<h3><b>\ud83d\udd0e Listing Available Models</b></h3>\n\nShows all available models, including backbones, segmentation models, object detection models, and vision-language models (VLMs). It also includes the names of the weights available for each specific model variant.\n    \n```python\nimport kvmm\nprint(kvmm.list_models())\n\n## Output:\n\"\"\"\nCaiTM36 : fb_dist_in1k_384\nCaiTM48 : fb_dist_in1k_448\nCaiTS24 : fb_dist_in1k_224, fb_dist_in1k_384\n...\nConvMixer1024D20 : in1k\nConvMixer1536D20 : in1k\n...\nConvNeXtAtto : d2_in1k\nConvNeXtBase : fb_in1k, fb_in22k, fb_in22k_ft_in1k, fb_in22k_ft_in1k_384\n...\n\"\"\"\n```\n<h3><b>\ud83d\udd0e List Specific Model Variant</b></h3>\n\n```python\nimport kvmm\nprint(kvmm.list_models(\"swin\"))\n\n# Output:\n\"\"\"\nSwinBaseP4W12 : ms_in1k, ms_in22k, ms_in22k_ft_in1k\nSwinBaseP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k\nSwinLargeP4W12 : ms_in22k, ms_in22k_ft_in1k\nSwinLargeP4W7 : ms_in22k, ms_in22k_ft_in1k\nSwinSmallP4W7 : ms_in1k, ms_in22k, ms_in22k_ft_in1k\nSwinTinyP4W7 : ms_in1k, ms_in22k\n\"\"\"\n```\n\n<h3><b>\u2699\ufe0f Layers </b></h3>\nKVMM provides various custom layers like StochasticDepth, LayerScale, EfficientMultiheadSelfAttention, and more. These layers can be seamlessly integrated into your custom models and workflows \ud83d\ude80\n\n```python\nimport kvmm\n\n# Example 1\nlayer = kvmm.layers.StochasticDepth(drop_path_rate=0.1)\noutput = layer(input_tensor, training=True)\n\n# Example 2\nwindow_partition = WindowPartition(window_size=7)\nwindowed_features = window_partition(features, height=28, width=28)\n```\n\n<h3><b>\ud83c\udfd7\ufe0f Backbone Usage (Classification) </b></h3>\n\n#### \ud83d\udee0\ufe0f Basic Usage\n```python\nimport kvmm\nimport numpy as np\n\n# default configuration\nmodel = kvmm.models.vit.ViTTiny16()\n\n# For Fine-Tuning (default weight)\nmodel = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3))\n# Custom Weight\nmodel = kvmm.models.vit.ViTTiny16(include_top=False, input_shape=(224,224,3), weights=\"augreg_in21k_224\")\n\n# Backbone Support\nmodel = kvmm.models.vit.ViTTiny16(include_top=False, as_backbone=True, input_shape=(224,224,3), weights=\"augreg_in21k_224\")\nrandom_input = np.random.rand(1, 224, 224, 3).astype(np.float32)\nfeatures = model(random_input)\nprint(f\"Number of feature maps: {len(features)}\")\nfor i, feature in enumerate(features):\n    print(f\"Feature {i} shape: {feature.shape}\")\n\n\"\"\"\nOutput:\n\nNumber of feature maps: 13\nFeature 0 shape: (1, 197, 192)\nFeature 1 shape: (1, 197, 192)\nFeature 2 shape: (1, 197, 192)\n...\n\"\"\"    \n```\n\n#### Example Inference\n\n```python\nfrom keras import ops\nfrom keras.applications.imagenet_utils import decode_predictions\nimport kvmm\nfrom PIL import Image\n\nmodel = kvmm.models.swin.SwinTinyP4W7(input_shape=[224, 224, 3])\n\nimage = Image.open(\"bird.png\").resize((224, 224))\nx = ops.convert_to_tensor(image)\nx = ops.expand_dims(x, axis=0)\n\n# Predict\npreds = model.predict(x)\nprint(\"Predicted:\", decode_predictions(preds, top=3)[0])\n\n#output:\nPredicted: [('n01537544', 'indigo_bunting', np.float32(0.9135666)), ('n01806143', 'peacock', np.float32(0.0003379386)), ('n02017213', 'European_gallinule', np.float32(0.00027174334))]\n```\n\n<h3><b>\ud83e\udde9 Segmentation </b></h3>\n\n#### \ud83d\udee0\ufe0f Basic Usage\n \n```python\nimport kvmm\n\n# Pre-Trained weights (cityscapes or ade20kor mit(in1k))\n# ade20k and cityscapes can be used for fine-tuning by giving custom `num_classes`\n# If `num_classes` is not specified by default for ade20k it will be 150 and for cityscapes it will be 19\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"ade20k\", input_shape=(512,512,3))\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"cityscapes\", input_shape=(512,512,3))\n\n# Fine-Tune using `MiT` backbone (This will load `in1k` weights)\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"mit\", input_shape=(512,512,3))\n```\n\n#### \ud83d\ude80 Custom Backbone Support\n\n```python\nimport kvmm\n\n# With no backbone weights\nbackbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights=None, include_top=False, input_shape=(224,224,3))\nsegformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))\n\n# With backbone weights\nimport kvmm\nbackbone = kvmm.models.resnet.ResNet50(as_backbone=True, weights=\"tv_in1k\", include_top=False, input_shape=(224,224,3))\nsegformer = kvmm.models.segformer.SegFormerB0(weights=None, backbone=backbone, num_classes=10, input_shape=(224,224,3))\n```\n\n#### \ud83d\ude80 Example Inference\n\n```python\nimport kvmm\nfrom PIL import Image\nimport numpy as np\n\nmodel = kvmm.models.segformer.SegFormerB0(weights=\"ade20k_512\")\n\nimage = Image.open(\"ADE_train_00000586.jpg\")\nprocessed_img = kvmm.models.segformer.SegFormerImageProcessor(image=image,\n    do_resize=True,\n    size={\"height\": 512, \"width\": 512},\n    do_rescale=True,\n    do_normalize=True)\nouts = model.predict(processed_img)\nouts = np.argmax(outs[0], axis=-1)\nvisualize_segmentation(outs, image)\n```\n![output](images/seg_output.png)\n\n\n<h3><b>VLMS</b></h3>\n\n#### \ud83d\udee0\ufe0f Basic Usage\n\n```python\nimport keras\n\nimport kvmm\n\nprocessor = kvmm.models.clip.CLIPProcessor()\nmodel = kvmm.models.clip.ClipVitBase16(\n    weights=\"openai_224\",\n    input_shape=(224, 224, 3), # You can fine-tune or infer with variable size \n)\ninputs = processor(text=[\"mountains\", \"tortoise\", \"cat\"], image_paths=\"cat1.jpg\")\noutput = model(\n    {\n        \"images\": inputs[\"images\"],\n        \"token_ids\": inputs[\"input_ids\"],\n        \"padding_mask\": inputs[\"attention_mask\"],\n    }\n)\n\nprint(\"Raw Model Output:\")\nprint(output)\n\npreds = keras.ops.softmax(output[\"image_logits\"]).numpy().squeeze()\nresult = dict(zip([\"mountains\", \"tortoise\", \"cat\"], preds))\nprint(\"\\nPrediction probabilities:\")\nprint(result)\n\n#output:\n\"\"\"{'image_logits': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[11.042501, 10.388493, 18.414747]], dtype=float32)>, 'text_logits': <tf.Tensor: shape=(3, 1), dtype=float32, numpy=\narray([[11.042501],\n       [10.388493],\n       [18.414747]], dtype=float32)>}\n\nPrediction probabilities:\n{'mountains': np.float32(0.0006278555), 'tortoise': np.float32(0.000326458), 'cat': np.float32(0.99904567)}\"\"\"\n```\n## \ud83d\udcd1 Models\n\n- Backbones:\n\n    | \ud83c\udff7\ufe0f Model Name | \ud83d\udcdc Reference Paper | \ud83d\udce6 Source of Weights |\n    |---------------|-------------------|---------------------|\n    | CaiT | [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239) | `timm` |\n    | ConvMixer | [Patches Are All You Need?](https://arxiv.org/abs/2201.09792) | `timm` |\n    | ConvNeXt | [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) | `timm` |\n    | ConvNeXt V2 | [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) | `timm` |\n    | DeiT | [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) | `timm` |\n    | DenseNet | [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993) | `timm` |\n    | EfficientNet | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |\n    | EfficientNet-Lite | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) | `timm` |\n    | EfficientNetV2 | [EfficientNetV2: Smaller Models and Faster Training](https://arxiv.org/abs/2104.00298) | `timm` |\n    | FlexiViT | [FlexiViT: One Model for All Patch Sizes](https://arxiv.org/abs/2212.08013) | `timm` |\n    | InceptionNeXt | [InceptionNeXt: When Inception Meets ConvNeXt](https://arxiv.org/abs/2303.16900) | `timm` |\n    | Inception-ResNet-v2 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |\n    | Inception-v3 | [Rethinking the Inception Architecture for Computer Vision](https://arxiv.org/abs/1512.00567) | `timm` |\n    | Inception-v4 | [Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning](https://arxiv.org/abs/1602.07261) | `timm` |\n    | MiT | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers` |\n    | MLP-Mixer | [MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/abs/2105.01601) | `timm` |\n    | MobileNetV2 | [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) | `timm` |\n    | MobileNetV3 | [Searching for MobileNetV3](https://arxiv.org/abs/1905.02244) | `keras` |\n    | MobileViT | [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) | `timm` |\n    | MobileViTV2 | [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) | `timm` |\n    | PiT | [Rethinking Spatial Dimensions of Vision Transformers](https://arxiv.org/abs/2103.16302) | `timm` |\n    | PoolFormer | [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) | `timm` |\n    | Res2Net | [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169) | `timm` |\n    | ResMLP | [ResMLP: Feedforward networks for image classification with data-efficient training](https://arxiv.org/abs/2105.03404) | `timm` |\n    | ResNet | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) | `timm` |\n    | ResNetV2 | [Identity Mappings in Deep Residual Networks](https://arxiv.org/abs/1603.05027) | `timm` |\n    | ResNeXt | [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/abs/1611.05431) | `timm` |\n    | SENet | [Squeeze-and-Excitation Networks](https://arxiv.org/abs/1709.01507) | `timm` |\n    | Swin Transformer | [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) | `timm` |\n    | VGG | [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556) | `timm` |\n    | ViT | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) | `timm` |\n    | Xception | [Xception: Deep Learning with Depthwise Separable Convolutions](https://arxiv.org/abs/1610.02357) | `keras` |\n\n<br>\n\n- Segmentation\n\n    | \ud83c\udff7\ufe0f Model Name | \ud83d\udcdc Reference Paper | \ud83d\udce6 Source of Weights |\n    |---------------|-------------------|---------------------|\n    | SegFormer | [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) | `transformers`|\n\n<br>\n\n- Vision-Language-Models (VLMs)\n\n    | \ud83c\udff7\ufe0f Model Name | \ud83d\udcdc Reference Paper | \ud83d\udce6 Source of Weights |\n    |---------------|-------------------|---------------------|\n    | CLIP | [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) | `transformers`|\n    | SigLIP | [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) | `transformers`|\n    | SigLIP2 | [SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features](https://arxiv.org/abs/2502.14786) | `transformers`|\n  \n## \ud83d\udcdc License\n\nThis project leverages [timm](https://github.com/huggingface/pytorch-image-models#licenses) and [transformers](https://github.com/huggingface/transformers#license) for converting pretrained weights from PyTorch to Keras. For licensing details, please refer to the respective repositories.\n\n- \ud83d\udd16 **kvmm Code**: This repository is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).\n\n\n## \ud83c\udf1f Credits\n\n- The [Keras](https://github.com/keras-team/keras) team for their powerful and user-friendly deep learning framework\n- The [Transformers](https://github.com/huggingface/transformers) library for its robust tools for loading and adapting pretrained models  \n- The [pytorch-image-models (timm)](https://github.com/huggingface/pytorch-image-models) project for pioneering many computer vision model implementations\n- All contributors to the original papers and architectures implemented in this library\n\n## Citing\n\n### BibTeX\n\n```bash\n@misc{gc2025kvmm,\n  author = {Gitesh Chawda},\n  title = {Keras Vision Models},\n  year = {2025},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/IMvision12/keras-vision-models}}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Pretrained keras 3 vision models",
    "version": "0.1.8",
    "project_urls": {
        "documentation": "https://github.com/IMvision12/keras-vision-models",
        "homepage": "https://github.com/IMvision12/keras-vision-models",
        "repository": "https://github.com/IMvision12/keras-vision-models.git"
    },
    "split_keywords": [
        "machine-learning",
        " jax",
        " computer-vision",
        " neural-networks",
        " tensorflow",
        " torch",
        " deep-learning",
        " keras",
        " imagenet",
        " pretrained-weights",
        " convolutional-neural-networks",
        " transfer-learning",
        " python-ml",
        " data-science",
        " ai-research",
        " vision-transformer",
        " image-classification",
        " model-training",
        " pytorch"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "03e5c0e65cee1ec94cd0e814b607fce0cea782f9dddb45445e48feac51db6751",
                "md5": "367905b601b953266079468abc57b02b",
                "sha256": "1a8b0243635b9afbddbf0820a709bb6f23c3754927070a85fa9c9092e87b8484"
            },
            "downloads": -1,
            "filename": "kvmm-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "367905b601b953266079468abc57b02b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 318798,
            "upload_time": "2025-08-04T20:30:56",
            "upload_time_iso_8601": "2025-08-04T20:30:56.770884Z",
            "url": "https://files.pythonhosted.org/packages/03/e5/c0e65cee1ec94cd0e814b607fce0cea782f9dddb45445e48feac51db6751/kvmm-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e2d745871f57ec38b26d7c291325fe6157996306b366b9465af565fff264a687",
                "md5": "a08e01d7b09dbdc8524015848bc2519a",
                "sha256": "a796b2326b5846af1ee1c859c2786ea23d16f3dcce7c4c369783a4eb65b909f6"
            },
            "downloads": -1,
            "filename": "kvmm-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "a08e01d7b09dbdc8524015848bc2519a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 239185,
            "upload_time": "2025-08-04T20:30:58",
            "upload_time_iso_8601": "2025-08-04T20:30:58.440628Z",
            "url": "https://files.pythonhosted.org/packages/e2/d7/45871f57ec38b26d7c291325fe6157996306b366b9465af565fff264a687/kvmm-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 20:30:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "IMvision12",
    "github_project": "keras-vision-models",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "tensorflow-cpu",
            "specs": [
                [
                    "~=",
                    "2.18.1"
                ]
            ]
        },
        {
            "name": "tensorflow",
            "specs": [
                [
                    "~=",
                    "2.18.1"
                ]
            ]
        },
        {
            "name": "tf_keras",
            "specs": []
        },
        {
            "name": "tf2onnx",
            "specs": []
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "torch-xla",
            "specs": [
                [
                    "==",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "jax",
            "specs": [
                [
                    "==",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "flax",
            "specs": []
        },
        {
            "name": "keras",
            "specs": [
                [
                    ">=",
                    "3.10.0"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": []
        },
        {
            "name": "sentencepiece",
            "specs": []
        }
    ],
    "lcname": "kvmm"
}
        
Elapsed time: 1.39737s