[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)
# Qwen-VL
My personal implementation of the model from "Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities", they haven't released model code yet sooo...
For more details, please refer to theĀ [full paper](https://doi.org/10.48550/arXiv.2308.12966).
# Install
`pip3 install qwen`
---
# Usage
```python
# Importing the necessary libraries
import torch
from qwen import Qwen
# Creating an instance of the Qwen model
model = Qwen()
# Generating random text and image tensors
text = torch.randint(0, 20000, (1, 1024))
img = torch.randn(1, 3, 256, 256)
# Passing the image and text tensors through the model
out = model(img, text) # (1, 1024, 20000)
```
# Todo
- [ ] Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm
- [ ] Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens <box> and </box> are added to the beginning and end of the bounding box string. + another sed of special tokens (<ref> and </ref>) is introduced.
# Citations
Please use the following to cite this work:
```bibtex
@article{bai2023qwen,
title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023},
url={https://doi.org/10.48550/arXiv.2308.12966}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/kyegomez/Qwen-VL",
"name": "qwen",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6,<4.0",
"maintainer_email": "",
"keywords": "artificial intelligence,attention mechanism,transformers",
"author": "Kye Gomez",
"author_email": "kye@apac.ai",
"download_url": "https://files.pythonhosted.org/packages/55/ec/182ead9028328d988eb8f55b1da46d0e90789cfaa733e6cacae0d6c671dc/qwen-0.1.1.tar.gz",
"platform": null,
"description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n\n# Qwen-VL\nMy personal implementation of the model from \"Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities\", they haven't released model code yet sooo... \nFor more details, please refer to the\u00a0[full paper](https://doi.org/10.48550/arXiv.2308.12966).\n\n\n# Install\n`pip3 install qwen`\n\n---\n\n# Usage\n```python\n\n# Importing the necessary libraries\nimport torch\nfrom qwen import Qwen\n\n# Creating an instance of the Qwen model\nmodel = Qwen()\n\n# Generating random text and image tensors\ntext = torch.randint(0, 20000, (1, 1024))\nimg = torch.randn(1, 3, 256, 256)\n\n# Passing the image and text tensors through the model\nout = model(img, text) # (1, 1024, 20000)\n\n```\n\n# Todo\n\n- [ ] Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm\n\n- [ ] Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens <box> and </box> are added to the beginning and end of the bounding box string. + another sed of special tokens (<ref> and </ref>) is introduced.\n\n# Citations\n\nPlease use the following to cite this work:\n\n```bibtex\n@article{bai2023qwen,\n title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},\n author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},\n journal={arXiv preprint arXiv:2308.12966},\n year={2023},\n url={https://doi.org/10.48550/arXiv.2308.12966}\n}\n\n```",
"bugtrack_url": null,
"license": "MIT",
"summary": "Qwen VL - Pytorch",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/kyegomez/Qwen-VL"
},
"split_keywords": [
"artificial intelligence",
"attention mechanism",
"transformers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c2ad74d014e77c54a5221a67167184a233b18936cb9fb24ea58e0562ec781aea",
"md5": "b4ea28885a28d24779956b2323c1b5eb",
"sha256": "5c18e1e895195079ea7be7ee332c6eb2159a3dfddef2b47ef56daee5bd104d6c"
},
"downloads": -1,
"filename": "qwen-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b4ea28885a28d24779956b2323c1b5eb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6,<4.0",
"size": 4263,
"upload_time": "2024-01-29T18:49:15",
"upload_time_iso_8601": "2024-01-29T18:49:15.435058Z",
"url": "https://files.pythonhosted.org/packages/c2/ad/74d014e77c54a5221a67167184a233b18936cb9fb24ea58e0562ec781aea/qwen-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "55ec182ead9028328d988eb8f55b1da46d0e90789cfaa733e6cacae0d6c671dc",
"md5": "288143190cff778089d83febc14f526e",
"sha256": "3aa2d2afd1c2842909f2e59ffce16a53fb6c02ba0993633d128dee17905c6afe"
},
"downloads": -1,
"filename": "qwen-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "288143190cff778089d83febc14f526e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6,<4.0",
"size": 4410,
"upload_time": "2024-01-29T18:49:16",
"upload_time_iso_8601": "2024-01-29T18:49:16.574003Z",
"url": "https://files.pythonhosted.org/packages/55/ec/182ead9028328d988eb8f55b1da46d0e90789cfaa733e6cacae0d6c671dc/qwen-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-29 18:49:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyegomez",
"github_project": "Qwen-VL",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "qwen"
}