[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)
# Kosmos-2
My personal implementation of Kosmos-2, Kosmos-2: Grounding Multimodal Large Language Models to the World, much simpler codebase
# Install
`pip3 install qwen`
---
# Usage
```python
import torch
from qwen.model import QwenVL
#usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))
model = QwenVL()
output = model(img, caption)
print(output.shape)
```
----
# Inference
```python
from qwen.inference import QwenVLChat
qwen_chat = QwenVLChat(model_name="Qwen/Qwen-VL-Chat", device_map="cuda")
response = qwen_chat.chat([
{"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"text": "这是什么?"}
])
print(response)
```
# Training
* [There is a file with a table of all the datasets used in the paper here](docs/datasets.md)
```python
from qwen.train import Train
def train():
os.environ['MASTER_ADDR'] #'localhost'
os.environ['MASTER_PORT'] #= '9994'
# # [CRITICAL] Pay attention to this when scaling to multiple GPUs and clusters
os.environ['RANK'] #= str(0) # Number of nodes (servers)
os.environ['WORLD_SIZE'] # = str(torch.cuda.device_count())
dist.init_process_group(backend='nccl') #init_method="env://")
Train()
if __name__ == '__main__':
train()
```
1. Set the environment variables:
- `ENTITY_NAME`: Your wandb project name
- `OUTPUT_DIR`: Directory to save the weights (e.g., `./weights`)
- `MASTER_ADDR`: For distributed training
- `MASTER_PORT` For master port distributed training
- `RANK`- Number of nodes services
- `WORLD_SIZE` Number of gpus
2. Configure the training:
- Accelerate Config
- Enable Deepspeed 3
- Accelerate launch train_distributed_accelerate.py
For more information, refer to the [Training SOP](DOCs/TRAINING.md).
----
# Todo
- [ ] Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm
- [ ] Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens <box> and </box> are added to the beginning and end of the bounding box string. + another sed of special tokens (<ref> and </ref>) is introduced.
# Citations
Please use the following to cite this work:
```latex
@article{bai2023qwen,
title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023},
url={https://doi.org/10.48550/arXiv.2308.12966}
}
```
For more details, please refer to the [full paper](https://doi.org/10.48550/arXiv.2308.12966).
Raw data
{
"_id": null,
"home_page": "https://github.com/kyegomez/kosmos-2",
"name": "kosmos-2",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6,<4.0",
"maintainer_email": "",
"keywords": "artificial intelligence,attention mechanism,transformers",
"author": "Kye Gomez",
"author_email": "kye@apac.ai",
"download_url": "https://files.pythonhosted.org/packages/55/9b/b741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78/kosmos_2-0.1.0.tar.gz",
"platform": null,
"description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n\n# Kosmos-2\nMy personal implementation of Kosmos-2, Kosmos-2: Grounding Multimodal Large Language Models to the World, much simpler codebase\n\n# Install\n`pip3 install qwen`\n\n---\n\n# Usage\n```python\n\nimport torch\nfrom qwen.model import QwenVL\n\n#usage\nimg = torch.randn(1, 3, 256, 256)\ncaption = torch.randint(0, 20000, (1, 1024))\n\nmodel = QwenVL()\noutput = model(img, caption)\nprint(output.shape)\n\n```\n\n----\n\n# Inference\n```python\n\nfrom qwen.inference import QwenVLChat\n\n\nqwen_chat = QwenVLChat(model_name=\"Qwen/Qwen-VL-Chat\", device_map=\"cuda\")\nresponse = qwen_chat.chat([\n {\"image\": \"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\"},\n {\"text\": \"\u8fd9\u662f\u4ec0\u4e48?\"}\n])\nprint(response)\n\n\n\n```\n\n\n# Training\n* [There is a file with a table of all the datasets used in the paper here](docs/datasets.md)\n\n```python\nfrom qwen.train import Train\n\n\ndef train():\n os.environ['MASTER_ADDR'] #'localhost'\n os.environ['MASTER_PORT'] #= '9994'\n \n # # [CRITICAL] Pay attention to this when scaling to multiple GPUs and clusters\n os.environ['RANK'] #= str(0) # Number of nodes (servers)\n os.environ['WORLD_SIZE'] # = str(torch.cuda.device_count())\n\n dist.init_process_group(backend='nccl') #init_method=\"env://\")\n \n Train()\n\nif __name__ == '__main__':\n train()\n\n\n```\n\n1. Set the environment variables:\n - `ENTITY_NAME`: Your wandb project name\n - `OUTPUT_DIR`: Directory to save the weights (e.g., `./weights`)\n - `MASTER_ADDR`: For distributed training\n - `MASTER_PORT` For master port distributed training\n - `RANK`- Number of nodes services\n - `WORLD_SIZE` Number of gpus\n\n2. Configure the training:\n - Accelerate Config\n - Enable Deepspeed 3\n - Accelerate launch train_distributed_accelerate.py\n\nFor more information, refer to the [Training SOP](DOCs/TRAINING.md).\n\n\n----\n\n\n\n# Todo\n\n- [ ] Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm\n\n- [ ] Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens <box> and </box> are added to the beginning and end of the bounding box string. + another sed of special tokens (<ref> and </ref>) is introduced.\n\n# Citations\n\nPlease use the following to cite this work:\n\n```latex\n@article{bai2023qwen,\n title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},\n author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},\n journal={arXiv preprint arXiv:2308.12966},\n year={2023},\n url={https://doi.org/10.48550/arXiv.2308.12966}\n}\n\n```\n\nFor more details, please refer to the\u00a0[full paper](https://doi.org/10.48550/arXiv.2308.12966).\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "kosmos-2 - Pytorch",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/kyegomez/kosmos-2"
},
"split_keywords": [
"artificial intelligence",
"attention mechanism",
"transformers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f8dd5c959ec87491cb2f9a17e8de871b1b29b5f4d8d5aa3b80c9a575846a9716",
"md5": "ac364beff954b9e28c26db9db1e31f17",
"sha256": "5f3e5777745fc909b381405f9a3b69aa816a0d4eb3c52023e000562c7d66d9a0"
},
"downloads": -1,
"filename": "kosmos_2-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ac364beff954b9e28c26db9db1e31f17",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6,<4.0",
"size": 23180,
"upload_time": "2023-08-28T17:22:58",
"upload_time_iso_8601": "2023-08-28T17:22:58.026221Z",
"url": "https://files.pythonhosted.org/packages/f8/dd/5c959ec87491cb2f9a17e8de871b1b29b5f4d8d5aa3b80c9a575846a9716/kosmos_2-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "559bb741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78",
"md5": "959cc7a88ebbe936831392c1d265e162",
"sha256": "9e920d7f8365f5e42f87831058542ce9921e3c4dc6c60400bb07e370e6cca7cc"
},
"downloads": -1,
"filename": "kosmos_2-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "959cc7a88ebbe936831392c1d265e162",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6,<4.0",
"size": 24186,
"upload_time": "2023-08-28T17:22:59",
"upload_time_iso_8601": "2023-08-28T17:22:59.365593Z",
"url": "https://files.pythonhosted.org/packages/55/9b/b741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78/kosmos_2-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-28 17:22:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kyegomez",
"github_project": "kosmos-2",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "kosmos-2"
}