uce-model

Name	uce-model JSON
Version	0.1.4 JSON
	download
home_page
Summary	Universal Cell Embedding model for single-cell RNA-seq data
upload_time	2024-02-19 22:59:36
maintainer
docs_url	None
author
requires_python	>=3.8
license	MIT License Copyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	single-cell rna-seq embedding pytorch uce
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Universal Cell Embeddings

This repo includes a PyTorch [HuggingFace Accelerator](https://huggingface.co/docs/accelerate/package_reference/accelerator) implementation of the UCE model, to be used to embed individual anndata datasets.

## Installation
UCE can be installed from PyPI using pip:

```sh
python -m pip install uce-model
```

## Embedding a new dataset

To generate an embedding for a new single-cell RNA sequencing dataset in the AnnData format, use the `uce-eval-single-anndata` command (depending on your environment you might need to run `python -m uce-eval-single-anndata`, so we will write this below).

```
python -m uce-eval-single-anndata --adata_path {path_to_anndata} --dir {output_dir} --species {species} --model_loc {model_loc} --batch_size {batch_size}
```

where
- `adata_path`: a h5ad file. The `.X` slot of the file should be scRNA-seq counts. The `.var_names` slot should correspond to gene names, *not ENSEMBLIDs*.
- `dir`: the working directory in which intermediate and final output files will be saved to skip repeated processing of the same dataset.
- `species`: the species of the dataset you are embedding.
- `model_loc`: the location of the model weights `.torch` file.
- `batch_size`: the per GPU batch size. For the 33 layer model, on a 80GB GPU, you should use 25. For a 4 layer model on the same GPU, you can use 100.

For a sample output on the 10k pbmc dataset, run
```
python -m uce-eval-single-anndata
```
All necessary model files will be downloaded automatically.


Details on all the options for `uce-eval-single-anndata` can be found by running `python -m uce-eval-single-anndata --help`.

<details>
<summary> Output from <code>uce-eval-single-anndata --help</code> </summary>

```
usage: uce-eval-single-anndata [-h] [--adata_path ADATA_PATH] [--dir DIR] [--species SPECIES] [--filter FILTER]
                               [--skip SKIP] [--model_loc MODEL_LOC] [--batch_size BATCH_SIZE]
                               [--pad_length PAD_LENGTH] [--pad_token_idx PAD_TOKEN_IDX]
                               [--chrom_token_left_idx CHROM_TOKEN_LEFT_IDX]
                               [--chrom_token_right_idx CHROM_TOKEN_RIGHT_IDX] [--cls_token_idx CLS_TOKEN_IDX]
                               [--CHROM_TOKEN_OFFSET CHROM_TOKEN_OFFSET] [--sample_size SAMPLE_SIZE] [--CXG CXG]
                               [--nlayers NLAYERS] [--output_dim OUTPUT_DIM] [--d_hid D_HID] [--token_dim TOKEN_DIM]
                               [--multi_gpu MULTI_GPU] [--spec_chrom_csv_path SPEC_CHROM_CSV_PATH]
                               [--token_file TOKEN_FILE] [--protein_embeddings_dir PROTEIN_EMBEDDINGS_DIR]
                               [--offset_pkl_path OFFSET_PKL_PATH]

Embed a single anndata using UCE.

options:
  -h, --help            show this help message and exit
  --adata_path ADATA_PATH
                        Full path to the anndata you want to embed. (default: None)
  --dir DIR             Working folder where all files will be saved. (default: ./)
  --species SPECIES     Species of the anndata. (default: human)
  --filter FILTER       Additional gene/cell filtering on the anndata. (default: True)
  --skip SKIP           Skip datasets that appear to have already been created. (default: True)
  --model_loc MODEL_LOC
                        Location of the model. (default: None)
  --batch_size BATCH_SIZE
                        Batch size. (default: 25)
  --pad_length PAD_LENGTH
                        Batch size. (default: 1536)
  --pad_token_idx PAD_TOKEN_IDX
                        PAD token index (default: 0)
  --chrom_token_left_idx CHROM_TOKEN_LEFT_IDX
                        Chrom token left index (default: 1)
  --chrom_token_right_idx CHROM_TOKEN_RIGHT_IDX
                        Chrom token right index (default: 2)
  --cls_token_idx CLS_TOKEN_IDX
                        CLS token index (default: 3)
  --CHROM_TOKEN_OFFSET CHROM_TOKEN_OFFSET
                        Offset index, tokens after this mark are chromosome identifiers (default: 143574)
  --sample_size SAMPLE_SIZE
                        Number of genes sampled for cell sentence (default: 1024)
  --CXG CXG             Use CXG model. (default: True)
  --nlayers NLAYERS     Number of transformer layers. (default: 4)
  --output_dim OUTPUT_DIM
                        Output dimension. (default: 1280)
  --d_hid D_HID         Hidden dimension. (default: 5120)
  --token_dim TOKEN_DIM
                        Token dimension. (default: 5120)
  --multi_gpu MULTI_GPU
                        Use multiple GPUs (default: False)
  --spec_chrom_csv_path SPEC_CHROM_CSV_PATH
                        CSV Path for species genes to chromosomes and start locations. (default:
                        ./model_files/species_chrom.csv)
  --token_file TOKEN_FILE
                        Path for token embeddings. (default: ./model_files/all_tokens.torch)
  --protein_embeddings_dir PROTEIN_EMBEDDINGS_DIR
                        Directory where protein embedding .pt files are stored. (default:
                        ./model_files/protein_embeddings/)
  --offset_pkl_path OFFSET_PKL_PATH
                        PKL file which contains offsets for each species. (default:
                        ./model_files/species_offsets.pkl)
```

</details>
<br/>


**Note**: This script makes use of additional files, which are described in the code documentation. These are downloaded automatically unless already present in the working directory. The script defaults to the pretrained 4-layer model. For running the pretrained 33-layer model from the paper, please download using this [link](https://figshare.com/articles/dataset/Universal_Cell_Embedding_Model_Files/24320806?file=43423236) and set `--nlayers 33`.

## Output

Final evaluated AnnData: `dir/{dataset_name}.h5ad`. This AnnData will be 
identical to the proccessed input anndata, but have UCE embeddings added in the `.obsm["X_uce"]` slot.

Please see documentation for information on additional output files. All 
outputs from `uce-eval-single-anndata` are stored in the `dir` directory.

## Data

You can download processed datasets used in the papere [here](https://drive.google.com/drive/folders/1f63fh0ykgEhCrkd_EVvIootBw7LYDVI7?usp=drive_link)

**Note:** These datasets were embedded using the 33 layer model. Embeddings for the 33 layer model are not compatible with embeddings from the 4 layer model.


## Using the model in Python
UCE is also a Python library, and the model can be loaded as a PyTorch module.

```python
import uce

model = uce.get_pretrained('small')  # 'small' gets a 4-layer model, 'large' gets a 33-layer model
```

You can also set up a dataloader and a dataset to embed it from Python.
See the documentation for `uce.get_processed_dataset` for more information.
```python
dataset, dataloader = uce.get_processed_dataset(batch_size=32)  # Get default dataset
```

## Citing

If you find our paper and code useful, please consider citing the [preprint](https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1):

```
@article{rosen2023universal,
  title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
  author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "uce-model",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "single-cell,RNA-seq,embedding,pytorch,uce",
    "author": "",
    "author_email": "Yanay Rosen <yanay@stanford.edu>",
    "download_url": "https://files.pythonhosted.org/packages/21/02/9c0bf10a7b23ae2f229c00c60e9ecf366bed4c149364430666ad08ed7b46/uce-model-0.1.4.tar.gz",
    "platform": null,
    "description": "# Universal Cell Embeddings\n\nThis repo includes a PyTorch [HuggingFace Accelerator](https://huggingface.co/docs/accelerate/package_reference/accelerator) implementation of the UCE model, to be used to embed individual anndata datasets.\n\n## Installation\nUCE can be installed from PyPI using pip:\n\n```sh\npython -m pip install uce-model\n```\n\n## Embedding a new dataset\n\nTo generate an embedding for a new single-cell RNA sequencing dataset in the AnnData format, use the `uce-eval-single-anndata` command (depending on your environment you might need to run `python -m uce-eval-single-anndata`, so we will write this below).\n\n```\npython -m uce-eval-single-anndata --adata_path {path_to_anndata} --dir {output_dir} --species {species} --model_loc {model_loc} --batch_size {batch_size}\n```\n\nwhere\n- `adata_path`: a h5ad file. The `.X` slot of the file should be scRNA-seq counts. The `.var_names` slot should correspond to gene names, *not ENSEMBLIDs*.\n- `dir`: the working directory in which intermediate and final output files will be saved to skip repeated processing of the same dataset.\n- `species`: the species of the dataset you are embedding.\n- `model_loc`: the location of the model weights `.torch` file.\n- `batch_size`: the per GPU batch size. For the 33 layer model, on a 80GB GPU, you should use 25. For a 4 layer model on the same GPU, you can use 100.\n\nFor a sample output on the 10k pbmc dataset, run\n```\npython -m uce-eval-single-anndata\n```\nAll necessary model files will be downloaded automatically.\n\n\nDetails on all the options for `uce-eval-single-anndata` can be found by running `python -m uce-eval-single-anndata --help`.\n\n<details>\n<summary> Output from <code>uce-eval-single-anndata --help</code> </summary>\n\n```\nusage: uce-eval-single-anndata [-h] [--adata_path ADATA_PATH] [--dir DIR] [--species SPECIES] [--filter FILTER]\n                               [--skip SKIP] [--model_loc MODEL_LOC] [--batch_size BATCH_SIZE]\n                               [--pad_length PAD_LENGTH] [--pad_token_idx PAD_TOKEN_IDX]\n                               [--chrom_token_left_idx CHROM_TOKEN_LEFT_IDX]\n                               [--chrom_token_right_idx CHROM_TOKEN_RIGHT_IDX] [--cls_token_idx CLS_TOKEN_IDX]\n                               [--CHROM_TOKEN_OFFSET CHROM_TOKEN_OFFSET] [--sample_size SAMPLE_SIZE] [--CXG CXG]\n                               [--nlayers NLAYERS] [--output_dim OUTPUT_DIM] [--d_hid D_HID] [--token_dim TOKEN_DIM]\n                               [--multi_gpu MULTI_GPU] [--spec_chrom_csv_path SPEC_CHROM_CSV_PATH]\n                               [--token_file TOKEN_FILE] [--protein_embeddings_dir PROTEIN_EMBEDDINGS_DIR]\n                               [--offset_pkl_path OFFSET_PKL_PATH]\n\nEmbed a single anndata using UCE.\n\noptions:\n  -h, --help            show this help message and exit\n  --adata_path ADATA_PATH\n                        Full path to the anndata you want to embed. (default: None)\n  --dir DIR             Working folder where all files will be saved. (default: ./)\n  --species SPECIES     Species of the anndata. (default: human)\n  --filter FILTER       Additional gene/cell filtering on the anndata. (default: True)\n  --skip SKIP           Skip datasets that appear to have already been created. (default: True)\n  --model_loc MODEL_LOC\n                        Location of the model. (default: None)\n  --batch_size BATCH_SIZE\n                        Batch size. (default: 25)\n  --pad_length PAD_LENGTH\n                        Batch size. (default: 1536)\n  --pad_token_idx PAD_TOKEN_IDX\n                        PAD token index (default: 0)\n  --chrom_token_left_idx CHROM_TOKEN_LEFT_IDX\n                        Chrom token left index (default: 1)\n  --chrom_token_right_idx CHROM_TOKEN_RIGHT_IDX\n                        Chrom token right index (default: 2)\n  --cls_token_idx CLS_TOKEN_IDX\n                        CLS token index (default: 3)\n  --CHROM_TOKEN_OFFSET CHROM_TOKEN_OFFSET\n                        Offset index, tokens after this mark are chromosome identifiers (default: 143574)\n  --sample_size SAMPLE_SIZE\n                        Number of genes sampled for cell sentence (default: 1024)\n  --CXG CXG             Use CXG model. (default: True)\n  --nlayers NLAYERS     Number of transformer layers. (default: 4)\n  --output_dim OUTPUT_DIM\n                        Output dimension. (default: 1280)\n  --d_hid D_HID         Hidden dimension. (default: 5120)\n  --token_dim TOKEN_DIM\n                        Token dimension. (default: 5120)\n  --multi_gpu MULTI_GPU\n                        Use multiple GPUs (default: False)\n  --spec_chrom_csv_path SPEC_CHROM_CSV_PATH\n                        CSV Path for species genes to chromosomes and start locations. (default:\n                        ./model_files/species_chrom.csv)\n  --token_file TOKEN_FILE\n                        Path for token embeddings. (default: ./model_files/all_tokens.torch)\n  --protein_embeddings_dir PROTEIN_EMBEDDINGS_DIR\n                        Directory where protein embedding .pt files are stored. (default:\n                        ./model_files/protein_embeddings/)\n  --offset_pkl_path OFFSET_PKL_PATH\n                        PKL file which contains offsets for each species. (default:\n                        ./model_files/species_offsets.pkl)\n```\n\n</details>\n<br/>\n\n\n**Note**: This script makes use of additional files, which are described in the code documentation. These are downloaded automatically unless already present in the working directory. The script defaults to the pretrained 4-layer model. For running the pretrained 33-layer model from the paper, please download using this [link](https://figshare.com/articles/dataset/Universal_Cell_Embedding_Model_Files/24320806?file=43423236) and set `--nlayers 33`.\n\n## Output\n\nFinal evaluated AnnData: `dir/{dataset_name}.h5ad`. This AnnData will be \nidentical to the proccessed input anndata, but have UCE embeddings added in the `.obsm[\"X_uce\"]` slot.\n\nPlease see documentation for information on additional output files. All \noutputs from `uce-eval-single-anndata` are stored in the `dir` directory.\n\n## Data\n\nYou can download processed datasets used in the papere [here](https://drive.google.com/drive/folders/1f63fh0ykgEhCrkd_EVvIootBw7LYDVI7?usp=drive_link)\n\n**Note:** These datasets were embedded using the 33 layer model. Embeddings for the 33 layer model are not compatible with embeddings from the 4 layer model.\n\n\n## Using the model in Python\nUCE is also a Python library, and the model can be loaded as a PyTorch module.\n\n```python\nimport uce\n\nmodel = uce.get_pretrained('small')  # 'small' gets a 4-layer model, 'large' gets a 33-layer model\n```\n\nYou can also set up a dataloader and a dataset to embed it from Python.\nSee the documentation for `uce.get_processed_dataset` for more information.\n```python\ndataset, dataloader = uce.get_processed_dataset(batch_size=32)  # Get default dataset\n```\n\n## Citing\n\nIf you find our paper and code useful, please consider citing the [preprint](https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1):\n\n```\n@article{rosen2023universal,\n  title={Universal Cell Embeddings: A Foundation Model for Cell Biology},\n  author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},\n  journal={bioRxiv},\n  pages={2023--11},\n  year={2023},\n  publisher={Cold Spring Harbor Laboratory}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Yanay Rosen, Yusuf Roohani, Jure Leskovec  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Universal Cell Embedding model for single-cell RNA-seq data",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/snap-stanford/UCE",
        "Repository": "https://github.com/snap-stanford/UCE.git"
    },
    "split_keywords": [
        "single-cell",
        "rna-seq",
        "embedding",
        "pytorch",
        "uce"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "13f968d78453dd21421a5f6166545fd6bd66312ca251045bada4b634b166399e",
                "md5": "420ad72ca4e4d1472a9104b35aab052b",
                "sha256": "e3be550fa01087e03787a0e3c6e3323392046a11b90fd7b444b5d155ba38e393"
            },
            "downloads": -1,
            "filename": "uce_model-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "420ad72ca4e4d1472a9104b35aab052b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 32187,
            "upload_time": "2024-02-19T22:59:34",
            "upload_time_iso_8601": "2024-02-19T22:59:34.849444Z",
            "url": "https://files.pythonhosted.org/packages/13/f9/68d78453dd21421a5f6166545fd6bd66312ca251045bada4b634b166399e/uce_model-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "21029c0bf10a7b23ae2f229c00c60e9ecf366bed4c149364430666ad08ed7b46",
                "md5": "f7c4f87b5bb1c0c28b80c4bf734964a2",
                "sha256": "827b3853f73db9b952c7e99204b59ae64fa551dc10ca1b8b00808124ee68a06e"
            },
            "downloads": -1,
            "filename": "uce-model-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "f7c4f87b5bb1c0c28b80c4bf734964a2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 29830,
            "upload_time": "2024-02-19T22:59:36",
            "upload_time_iso_8601": "2024-02-19T22:59:36.110986Z",
            "url": "https://files.pythonhosted.org/packages/21/02/9c0bf10a7b23ae2f229c00c60e9ecf366bed4c149364430666ad08ed7b46/uce-model-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-19 22:59:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "snap-stanford",
    "github_project": "UCE",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "uce-model"
}