kosmos-2

Name	kosmos-2 JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/kyegomez/kosmos-2
Summary	kosmos-2 - Pytorch
upload_time	2023-08-28 17:22:59
maintainer
docs_url	None
author	Kye Gomez
requires_python	>=3.6,<4.0
license	MIT
keywords	artificial intelligence attention mechanism transformers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)


# Kosmos-2
My personal implementation of Kosmos-2, Kosmos-2: Grounding Multimodal Large Language Models to the World, much simpler codebase

# Install
`pip3 install qwen`

---

# Usage
```python

import torch
from qwen.model import QwenVL

#usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

model = QwenVL()
output = model(img, caption)
print(output.shape)

```

----

# Inference
```python

from qwen.inference import QwenVLChat


qwen_chat = QwenVLChat(model_name="Qwen/Qwen-VL-Chat", device_map="cuda")
response = qwen_chat.chat([
    {"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"text": "这是什么?"}
])
print(response)



```


# Training
* [There is a file with a table of all the datasets used in the paper here](docs/datasets.md)

```python
from qwen.train import Train


def train():
    os.environ['MASTER_ADDR'] #'localhost'
    os.environ['MASTER_PORT'] #= '9994'
    
    # # [CRITICAL] Pay attention to this when scaling to multiple GPUs and clusters
    os.environ['RANK']       #= str(0) # Number of nodes (servers)
    os.environ['WORLD_SIZE'] # = str(torch.cuda.device_count())

    dist.init_process_group(backend='nccl') #init_method="env://")
    
    Train()

if __name__ == '__main__':
    train()


```

1. Set the environment variables:
   - `ENTITY_NAME`: Your wandb project name
   - `OUTPUT_DIR`: Directory to save the weights (e.g., `./weights`)
   - `MASTER_ADDR`: For distributed training
   - `MASTER_PORT` For master port distributed training
   - `RANK`- Number of nodes services
   - `WORLD_SIZE` Number of gpus

2. Configure the training:
   - Accelerate Config
   - Enable Deepspeed 3
   - Accelerate launch train_distributed_accelerate.py

For more information, refer to the [Training SOP](DOCs/TRAINING.md).


----



# Todo

- [ ] Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm

- [ ] Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens <box> and </box> are added to the beginning and end of the bounding box string. + another sed of special tokens (<ref> and </ref>) is introduced.

# Citations

Please use the following to cite this work:

```latex
@article{bai2023qwen,
  title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023},
  url={https://doi.org/10.48550/arXiv.2308.12966}
}

```

For more details, please refer to the [full paper](https://doi.org/10.48550/arXiv.2308.12966).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kyegomez/kosmos-2",
    "name": "kosmos-2",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<4.0",
    "maintainer_email": "",
    "keywords": "artificial intelligence,attention mechanism,transformers",
    "author": "Kye Gomez",
    "author_email": "kye@apac.ai",
    "download_url": "https://files.pythonhosted.org/packages/55/9b/b741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78/kosmos_2-0.1.0.tar.gz",
    "platform": null,
    "description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n\n# Kosmos-2\nMy personal implementation of Kosmos-2, Kosmos-2: Grounding Multimodal Large Language Models to the World, much simpler codebase\n\n# Install\n`pip3 install qwen`\n\n---\n\n# Usage\n```python\n\nimport torch\nfrom qwen.model import QwenVL\n\n#usage\nimg = torch.randn(1, 3, 256, 256)\ncaption = torch.randint(0, 20000, (1, 1024))\n\nmodel = QwenVL()\noutput = model(img, caption)\nprint(output.shape)\n\n```\n\n----\n\n# Inference\n```python\n\nfrom qwen.inference import QwenVLChat\n\n\nqwen_chat = QwenVLChat(model_name=\"Qwen/Qwen-VL-Chat\", device_map=\"cuda\")\nresponse = qwen_chat.chat([\n    {\"image\": \"https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg\"},\n    {\"text\": \"\u8fd9\u662f\u4ec0\u4e48?\"}\n])\nprint(response)\n\n\n\n```\n\n\n# Training\n* [There is a file with a table of all the datasets used in the paper here](docs/datasets.md)\n\n```python\nfrom qwen.train import Train\n\n\ndef train():\n    os.environ['MASTER_ADDR'] #'localhost'\n    os.environ['MASTER_PORT'] #= '9994'\n    \n    # # [CRITICAL] Pay attention to this when scaling to multiple GPUs and clusters\n    os.environ['RANK']       #= str(0) # Number of nodes (servers)\n    os.environ['WORLD_SIZE'] # = str(torch.cuda.device_count())\n\n    dist.init_process_group(backend='nccl') #init_method=\"env://\")\n    \n    Train()\n\nif __name__ == '__main__':\n    train()\n\n\n```\n\n1. Set the environment variables:\n   - `ENTITY_NAME`: Your wandb project name\n   - `OUTPUT_DIR`: Directory to save the weights (e.g., `./weights`)\n   - `MASTER_ADDR`: For distributed training\n   - `MASTER_PORT` For master port distributed training\n   - `RANK`- Number of nodes services\n   - `WORLD_SIZE` Number of gpus\n\n2. Configure the training:\n   - Accelerate Config\n   - Enable Deepspeed 3\n   - Accelerate launch train_distributed_accelerate.py\n\nFor more information, refer to the [Training SOP](DOCs/TRAINING.md).\n\n\n----\n\n\n\n# Todo\n\n- [ ] Position aware vision language adapter, compresses image features. Singer layer cross attention module inited randomly => group of trainable embeddings as query vectors + image features from the visual encoder as keys for cross attention ops => OUTPUT: compresses visual feature sequence to a fixed lnegth of 256, 2d absolute positional encodings are integrated into the cross attentions mechanisms query key pairs => compressed feature sequence of length of 256 => fed into decoder llm\n\n- [ ] Bounding Boxes, for any given accurate bounding box, a norm process is applied in the range [0, 1000] and transformed into a string format (Xtope, Ytople)(Xottomright, Ybottomright) -> the string is tokenized as text and does not require positional vocabulary. Detection strings and regular text strings, two special tokens <box> and </box> are added to the beginning and end of the bounding box string. + another sed of special tokens (<ref> and </ref>) is introduced.\n\n# Citations\n\nPlease use the following to cite this work:\n\n```latex\n@article{bai2023qwen,\n  title={Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities},\n  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},\n  journal={arXiv preprint arXiv:2308.12966},\n  year={2023},\n  url={https://doi.org/10.48550/arXiv.2308.12966}\n}\n\n```\n\nFor more details, please refer to the\u00a0[full paper](https://doi.org/10.48550/arXiv.2308.12966).\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "kosmos-2 - Pytorch",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/kyegomez/kosmos-2"
    },
    "split_keywords": [
        "artificial intelligence",
        "attention mechanism",
        "transformers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f8dd5c959ec87491cb2f9a17e8de871b1b29b5f4d8d5aa3b80c9a575846a9716",
                "md5": "ac364beff954b9e28c26db9db1e31f17",
                "sha256": "5f3e5777745fc909b381405f9a3b69aa816a0d4eb3c52023e000562c7d66d9a0"
            },
            "downloads": -1,
            "filename": "kosmos_2-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ac364beff954b9e28c26db9db1e31f17",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<4.0",
            "size": 23180,
            "upload_time": "2023-08-28T17:22:58",
            "upload_time_iso_8601": "2023-08-28T17:22:58.026221Z",
            "url": "https://files.pythonhosted.org/packages/f8/dd/5c959ec87491cb2f9a17e8de871b1b29b5f4d8d5aa3b80c9a575846a9716/kosmos_2-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "559bb741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78",
                "md5": "959cc7a88ebbe936831392c1d265e162",
                "sha256": "9e920d7f8365f5e42f87831058542ce9921e3c4dc6c60400bb07e370e6cca7cc"
            },
            "downloads": -1,
            "filename": "kosmos_2-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "959cc7a88ebbe936831392c1d265e162",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<4.0",
            "size": 24186,
            "upload_time": "2023-08-28T17:22:59",
            "upload_time_iso_8601": "2023-08-28T17:22:59.365593Z",
            "url": "https://files.pythonhosted.org/packages/55/9b/b741084a44d63278ddb68a9b2dcad2668feea8d21439fd479e56b428fe78/kosmos_2-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-28 17:22:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kyegomez",
    "github_project": "kosmos-2",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "kosmos-2"
}

Kye Gomez