progen-torch

Name	progen-torch JSON
Version	0.0.4 JSON
	download
home_page	https://github.com/kyegomez/Progen
Summary	Paper - Pytorch
upload_time	2023-10-09 13:33:20
maintainer
docs_url	None
author	Kye Gomez
requires_python	>=3.6,<4.0
license	MIT
keywords	artificial intelligence deep learning optimizers prompt engineering
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# Progen
Implementation of Progen in Pytorch, from the paper "ProGen: Language Modeling for Protein Generation"

GPT for proteins sequences

[Paper Link](https://arxiv.org/pdf/2004.03497.pdf)

# Appreciation
* Lucidrains
* Agorians

# Install
`pip install progen-torch`

# Usage
```python
import torch
from progen.model import ProGen

x = torch.randint(0, 100, (1, 1024))
import torch
from progen.model import ProGen

x = torch.randint(0, 100, (1, 1024))

# Initialize the model with specific parameters
model = ProGen(
    num_tokens=100,  # The size of the vocabulary
    dim=512,  # The dimension of the embeddings
    seq_len=1024,  # The length of the sequences
    depth=6,  # The number of layers in the model
    window_size=256,  # The size of the window for local attention
    global_mlp_depth=2,  # The depth of the MLP in the global attention mechanism
    heads=8,  # The number of attention heads
    dim_head=512,  # The dimension of each attention head
    ff_mult=4,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu=True,  # Whether to use a GLU activation in the feed-forward network
    attn_dim=None,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate=True,  # Whether to clamp the gate values in the GLU activation
    shift_tokens=True,  # Whether to shift the tokens for the causal attention mechanism
    dropout=0.1,  # The dropout rate
)

# Forward pass through the model
logits = model(x)

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print(logits.shape)  # Should print: torch.Size([1, 1024, 100])


```

# Dataset Strategy
Here is a table of the datasets used in the paper with metadata and source links:

| Dataset | Description | Source |
|-|-|-| 
| Uniparc | Contains protein sequences from various sources | https://www.uniprot.org/uniparc/ |
| UniprotKB | Contains protein sequences and annotations | https://www.uniprot.org/uniprot/ |
| SWISS-PROT | Curated protein sequence database | https://www.uniprot.org/swiss-prot/ |
| TrEMBL | Computer-annotated protein sequences | https://www.uniprot.org/trembl/ |
| Pfam | Database of protein families | https://pfam.xfam.org/ |
| NCBI taxonomy | Taxonomic classification of organisms | https://www.ncbi.nlm.nih.gov/taxonomy |

Here is a diagram showing the data preprocessing flow:

```mermaid
graph TD
    A[Uniparc] --> B[Filter and merge]
    C[UniprotKB] --> B
    D[SWISS-PROT] --> B 
    E[TrEMBL] --> B
    F[Pfam] --> B
    G[NCBI taxonomy] --> B
    B --> H[Train/test split]
    H --> I[Train set]
    H --> J[ID test set] 
    H --> K[OOD test set]
```

The Uniparc, UniprotKB, SWISS-PROT, TrEMBL, Pfam, and NCBI taxonomy datasets are filtered and merged in step B. The aggregated dataset is then split into training, in-distribution test, and out-of-distribution test sets in step H.

# Architecture

# Todo


# License
MIT

# Citations

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kyegomez/Progen",
    "name": "progen-torch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<4.0",
    "maintainer_email": "",
    "keywords": "artificial intelligence,deep learning,optimizers,Prompt Engineering",
    "author": "Kye Gomez",
    "author_email": "kye@apac.ai",
    "download_url": "https://files.pythonhosted.org/packages/ef/39/22208336bddfb277e15fec662aa554254e04cf6955d7e834530e32d5ee00/progen_torch-0.0.4.tar.gz",
    "platform": null,
    "description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n# Progen\nImplementation of Progen in Pytorch, from the paper \"ProGen: Language Modeling for Protein Generation\"\n\nGPT for proteins sequences\n\n[Paper Link](https://arxiv.org/pdf/2004.03497.pdf)\n\n# Appreciation\n* Lucidrains\n* Agorians\n\n# Install\n`pip install progen-torch`\n\n# Usage\n```python\nimport torch\nfrom progen.model import ProGen\n\nx = torch.randint(0, 100, (1, 1024))\nimport torch\nfrom progen.model import ProGen\n\nx = torch.randint(0, 100, (1, 1024))\n\n# Initialize the model with specific parameters\nmodel = ProGen(\n    num_tokens=100,  # The size of the vocabulary\n    dim=512,  # The dimension of the embeddings\n    seq_len=1024,  # The length of the sequences\n    depth=6,  # The number of layers in the model\n    window_size=256,  # The size of the window for local attention\n    global_mlp_depth=2,  # The depth of the MLP in the global attention mechanism\n    heads=8,  # The number of attention heads\n    dim_head=512,  # The dimension of each attention head\n    ff_mult=4,  # The multiplier for the feed-forward network's hidden layer size\n    ff_glu=True,  # Whether to use a GLU activation in the feed-forward network\n    attn_dim=None,  # The dimension of the attention mechanism (None means it defaults to `dim`)\n    clamp_gate=True,  # Whether to clamp the gate values in the GLU activation\n    shift_tokens=True,  # Whether to shift the tokens for the causal attention mechanism\n    dropout=0.1,  # The dropout rate\n)\n\n# Forward pass through the model\nlogits = model(x)\n\n# The output is the logits for each token in the vocabulary, for each position in the input sequences\n# Shape: (batch_size, sequence_length, num_tokens)\nprint(logits.shape)  # Should print: torch.Size([1, 1024, 100])\n\n\n```\n\n# Dataset Strategy\nHere is a table of the datasets used in the paper with metadata and source links:\n\n| Dataset | Description | Source |\n|-|-|-| \n| Uniparc | Contains protein sequences from various sources | https://www.uniprot.org/uniparc/ |\n| UniprotKB | Contains protein sequences and annotations | https://www.uniprot.org/uniprot/ |\n| SWISS-PROT | Curated protein sequence database | https://www.uniprot.org/swiss-prot/ |\n| TrEMBL | Computer-annotated protein sequences | https://www.uniprot.org/trembl/ |\n| Pfam | Database of protein families | https://pfam.xfam.org/ |\n| NCBI taxonomy | Taxonomic classification of organisms | https://www.ncbi.nlm.nih.gov/taxonomy |\n\nHere is a diagram showing the data preprocessing flow:\n\n```mermaid\ngraph TD\n    A[Uniparc] --> B[Filter and merge]\n    C[UniprotKB] --> B\n    D[SWISS-PROT] --> B \n    E[TrEMBL] --> B\n    F[Pfam] --> B\n    G[NCBI taxonomy] --> B\n    B --> H[Train/test split]\n    H --> I[Train set]\n    H --> J[ID test set] \n    H --> K[OOD test set]\n```\n\nThe Uniparc, UniprotKB, SWISS-PROT, TrEMBL, Pfam, and NCBI taxonomy datasets are filtered and merged in step B. The aggregated dataset is then split into training, in-distribution test, and out-of-distribution test sets in step H.\n\n# Architecture\n\n# Todo\n\n\n# License\nMIT\n\n# Citations\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Paper - Pytorch",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/kyegomez/Progen",
        "Repository": "https://github.com/kyegomez/Progen"
    },
    "split_keywords": [
        "artificial intelligence",
        "deep learning",
        "optimizers",
        "prompt engineering"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8cb311053a05868bf7da1f633043f8e43a9563ca6d960639d838c7f763b325f8",
                "md5": "85a96f4adbd263a9ebe94089fce8e959",
                "sha256": "f789a64329f30e26b947606d73b31b5b95e73bad462e48e7dfa80980e372fe70"
            },
            "downloads": -1,
            "filename": "progen_torch-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "85a96f4adbd263a9ebe94089fce8e959",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<4.0",
            "size": 8629,
            "upload_time": "2023-10-09T13:33:18",
            "upload_time_iso_8601": "2023-10-09T13:33:18.936545Z",
            "url": "https://files.pythonhosted.org/packages/8c/b3/11053a05868bf7da1f633043f8e43a9563ca6d960639d838c7f763b325f8/progen_torch-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ef3922208336bddfb277e15fec662aa554254e04cf6955d7e834530e32d5ee00",
                "md5": "e32c533cebd4440d2e5213ffe9d2cf01",
                "sha256": "96b7ee58a600ef61e48761ab5b712e7c8cba50ab2c1d2d3e6a37c5311d32c953"
            },
            "downloads": -1,
            "filename": "progen_torch-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "e32c533cebd4440d2e5213ffe9d2cf01",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<4.0",
            "size": 8195,
            "upload_time": "2023-10-09T13:33:20",
            "upload_time_iso_8601": "2023-10-09T13:33:20.540698Z",
            "url": "https://files.pythonhosted.org/packages/ef/39/22208336bddfb277e15fec662aa554254e04cf6955d7e834530e32d5ee00/progen_torch-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-09 13:33:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kyegomez",
    "github_project": "Progen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "progen-torch"
}

Kye Gomez