scprint


Namescprint JSON
Version 1.6.4 PyPI version JSON
download
home_pageNone
SummaryscPRINT is a Large Cell Model for Gene Network Inference, Denoising and more from scRNAseq data
upload_time2024-11-27 15:37:32
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords grn foundation model gene regulatory network large cell model scprint scrnaseq transformer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            > ℹ️ main place where scprint is built and maintained

# scPRINT: Large Cell Model for scRNAseq data

[![codecov](https://codecov.io/gh/cantinilab/scPRINT/branch/main/graph/badge.svg?token=GRnnData_token_here)](https://codecov.io/gh/cantinilab/scPRINT)
[![CI](https://github.com/cantinilab/scPRINT/actions/workflows/main.yml/badge.svg)](https://github.com/cantinilab/scPRINT/actions/workflows/main.yml)
[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
[![GitHub issues](https://img.shields.io/github/issues/cantinilab/scPRINT)](https://img.shields.io/github/issues/cantinilab/scPRINT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)
[![hugging face](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-md.svg)](https://huggingface.co/jkobject/scPRINT)

![logo](docs/logo.png)

scPRINT is a large transformer model built for the inference of gene networks (connections between genes explaining the cell's expression profile) from scRNAseq data.

It uses novel encoding and decoding of the cell expression profile and new pre-training methodologies to learn a cell model.

scPRINT can be used to perform the following analyses:

- __expression denoising__: increase the resolution of your scRNAseq data
- __cell embedding__: generate a low-dimensional representation of your dataset
- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset

[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT. Have a look at some of my [X-plainers](https://twitter.com/jkobject). 

![figure1](docs/figure1.png)

## Table of Contents

- [scPRINT: Large Cell Model for scRNAseq data](#scprint-large-cell-model-for-scrnaseq-data)
  - [Table of Contents](#table-of-contents)
  - [Install `scPRINT`](#install-scprint)
    - [lamin.ai](#laminai)
    - [install](#install)
    - [pytorch and GPUs](#pytorch-and-gpus)
    - [dev install](#dev-install)
  - [Usage](#usage)
    - [scPRINT's basic commands](#scprints-basic-commands)
    - [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)
    - [Simple tests:](#simple-tests)
  - [FAQ](#faq)
    - [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)
    - [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)
    - [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)
    - [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)
    - [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)
    - [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)
    - [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)
    - [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)
    - [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)
    - [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)
    - [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)
  - [Documentation](#documentation)
  - [Model Weights](#model-weights)
  - [Docker](#docker)
    - [Building the Docker Image](#building-the-docker-image)
    - [Pulling the Docker Image from Docker Hub](#pulling-the-docker-image-from-docker-hub)
    - [Running the Docker Container](#running-the-docker-container)
  - [Development](#development)
  - [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)


## Install `scPRINT`

For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10. Its instalation takes on average 10 minutes.

If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.

### lamin.ai

To use scPRINT, you will need to use [lamin.ai](https://lamin.ai/). This is needed to load biological informations like genes, cell types, organisms etc...

### install

To start you will need to do:

```bash
uv venv -n <env-name> python==3.10 #scprint might work with python >3.10, but it is not tested
#one of
uv pip install scprint # OR
uv pip install scprint[dev] # for the dev dependencies (building etc..) OR
uv pip install scprint[flash] # to use flashattention2 with triton: only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
#OR pip install scPRINT[dev,flash]

lamin init --storage ./testdb --name test --schema bionty
```

if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.

you can do it manually or with our function:

```python
from scdataloader.utils import populate_my_ontology

populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)

populate_my_ontology( #the minimum for scprint to run some inferences (denoising, grn inference)
organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
    sex: List[str] = ["PATO:0000384", "PATO:0000383"],
    celltypes = None,
    ethnicities = None,
    assays = None,
    tissues = None,
    diseases = None,
    dev_stages = None,
)
```

We make use of some additional packages we developed alongside scPRint.

Please refer to their documentation for more information:

- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
- [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
- [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.

### pytorch and GPUs

scPRINT can run on machines without GPUs, but it will be slow. It is highly recommended to use a GPU for inference.

Once you have a GPU, and installed the required drivers, you might need to install a specific version of pytorch that is compatible with your drivers (e.g. nvidia 550 drivers will lead to a nvidia toolkit 11.7 or 11.8 which might mean you need to re-install a different flavor of pytorch for things to work. e.g. using the command:
`pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118` on my case on linux
 ).

I was able to test it with nvidia 11.7, 11.8, 12.2.

### dev install

If you want to use the latest version of scPRINT and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.

```bash
git clone https://github.com/cantinilab/scPRINT
git clone https://github.com/jkobject/scDataLoader
git clone https://github.com/cantinilab/GRnnData
git clone https://github.com/jkobject/benGRN
pip install -e scPRINT[dev]
pip install -e scDataLoader[dev]
pip install -e GRnnData[dev]
pip install -e benGRN[dev]
```

## Usage

### scPRINT's basic commands

This is the most minimal example of how scPRINT works:

```py
from lightning.pytorch import Trainer
from scprint import scPrint
from scdataloader import DataModule

datamodule = DataModule(...)
model = scPrint(...)
# to train / fit / test the model
trainer = Trainer(...)
trainer.fit(model, datamodule=datamodule)
# to do predictions Denoiser, Embedder, GNInfer
denoiser = Denoiser(...)
adata = sc.read_h5ad(...)
denoiser(model, adata=adata)
...
```

or, from a bash command line

```bash
$ scprint fit/train/predict/test/denoise/embed/gninfer --config config/[medium|large|vlarge] ...
```

find out more about the commands by running `scprint --help` or `scprint [command] --help`.

more examples of using the command line are available in the [docs](./docs/usage.md).

### Notes on GPU/CPU usage with triton

If you do not have [triton](https://triton-lang.org/main/python-api/triton.html) installed you will not be able to take advantage of GPU acceleration, but you can still use the model on the CPU.

In that case, if loading from a checkpoint that was trained with flashattention, you will need to specify `transformer="normal"` in the `load_from_checkpoint` function like so:

```python
model = scPrint.load_from_checkpoint(
    '../data/temp/last.ckpt', precpt_gene_emb=None,
    transformer="normal")
```

### Simple tests:

An instalation of scPRINT and a simple test of the denoiser is performed during each commit to the main branch with a [Github action](https://github.com/cantinilab/scPRINT/actions) and [pytest workflow](.github/workflows/main.yml). It also provides an expected runtime for the installation and run of scPRINT.

We now explore the different usages of scPRINT:

## FAQ

### I want to generate gene networks from scRNAseq data:

-> Refer to the section . gene network inference in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).

-> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](./notebooks/bench_omni.ipynb).

### I want to generate cell embeddings and cell label predictions from scRNAseq data:

-> Refer to the embeddings and cell annotations section in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).

### I want to denoise my scRNAseq dataset:

-> Refer to the Denoising of B-cell section in [this notebook](./docs/notebooks/cancer_usecase.ipynb).

-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](./notebooks/bench_denoising.ipynb).

### I want to generate an atlas-level embedding

-> Refer to the notebook [nice_umap.ipynb](./figures/nice_umap.ipynb).

### I need to generate gene tokens using pLLMs

To run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via "precpt_gene_emb"

-> To generate this file please refer to the notebook [generate_gene_embeddings](notebooks/generate_gene_embeddings.ipynb).

### I want to pre-train scPRINT from scratch on my own data

-> Refer to the documentation page [pretrain scprint](docs/pretrain.md)

### how can I find if scPRINT was trained on my data?

If your data is available in cellxgene, scPRINT was likely trained on it. However some cells, datasets were dropped due to low quality data and some were randomly removed to be part of the validation / test sets.

### can I use scPRINT on other organisms rather than human?

scPRINT has been pretrained on both humans and mouse, and can be used on any organism with a similar gene set. If you want to use scPRINT on very different organisms, you will need to generate gene embeddings for that organism and re-train scPRINT

### how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)

please look at our supplementary tables in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1)

### I have different scRNASeq batches. Should I integrate my data before running scPRINT?

scPRINT takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT and it will take care of the rest.

### where to find the gene embeddings?

If you think you need the gene embeddings file for loading the model from a checkpoint, you don't, as the embeddings are also stored in the model weights. You just need to load the weights like this:

```python
model = scPrint.load_from_checkpoint(
    '../../data/temp/last.ckpt',
    precpt_gene_emb=None,
)
```

You can also recreate the gene embedding file through [this notebook](notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.

the file itself is also available on [hugging face](https://huggingface.co/jkobject/scPRINT/tree/main)

## Documentation

For more information on usage please see the documentation in [https://www.jkobject.com/scPRINT/](https://www.jkobject.com/scPRINT/)

## Model Weights

Model weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).

## Docker

By using the `scPRINT Docker image`, you can bypass the complexities of manual package installation, ensuring a consistent deployment environment. Included in this repository is a Dockerfile that lets you craft a container for the project; you have the choice to either build this image on your own or conveniently pull it from Docker Hub.

Make sure that you have the `docker` command line interface installed on your system.

A recommended way to install docker with the correct nvidia drivers on linux is to use this [script](https://gist.github.com/xueerchen1990/baad7baa545cb547e8633bc9e5b84786)

### Building the Docker Image

To build the Docker image from the provided `Dockerfile`, run the following command from the root directory of this repository:

```bash
docker build -t scprint:latest -f Dockerfile .
```

### Pulling the Docker Image from Docker Hub

If you don't want to build the image yourself, you can pull it directly from Docker Hub:

```bash
docker pull jkobject/scprint:1.1.3
docker tag jkobject/scprint:1.1.3 scprint:latest
```

### Running the Docker Container

Once you have the image (either by building it or pulling it), you can start a container with:

```bash
docker run --gpus all --rm -it scprint:latest bash
```

Please note: When running the Docker container, ensure you mount any necessary folders using the -v option to access them inside the container.
`
## Development

Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.

Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how pre-training was performed and the its behavior.

code coverage is not right as I am using the command line interface for now. >50% of the code is covered by my current unit test.

Acknowledgement:
[python template](https://github.com/rochacbruno/python-project-template)
[laminDB](https://lamin.ai/)
[lightning](https://lightning.ai/)

## Work in progress (PR welcomed):

1. remove the triton dependencies
2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene
3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings
4. improve classifier to output uncertainties and topK predictions when unsure
5. setup latest lamindb version

Awesome Large Cell Model created by Jeremie Kalfon.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scprint",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "GRN, foundation model, gene regulatory network, large cell model, scPRINT, scRNAseq, transformer",
    "author": null,
    "author_email": "jeremie kalfon <jkobject@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/66/07/59bc5d8226d6c7c55e2dcede1b2c83cb5b8e506834d503a90656b23e2be1/scprint-1.6.4.tar.gz",
    "platform": null,
    "description": "> \u2139\ufe0f main place where scprint is built and maintained\n\n# scPRINT: Large Cell Model for scRNAseq data\n\n[![codecov](https://codecov.io/gh/cantinilab/scPRINT/branch/main/graph/badge.svg?token=GRnnData_token_here)](https://codecov.io/gh/cantinilab/scPRINT)\n[![CI](https://github.com/cantinilab/scPRINT/actions/workflows/main.yml/badge.svg)](https://github.com/cantinilab/scPRINT/actions/workflows/main.yml)\n[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)\n[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)\n[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)\n[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)\n[![GitHub issues](https://img.shields.io/github/issues/cantinilab/scPRINT)](https://img.shields.io/github/issues/cantinilab/scPRINT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![DOI](https://zenodo.org/badge/391909874.svg)](https://doi.org/10.1101/2024.07.29.605556)\n[![hugging face](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-md.svg)](https://huggingface.co/jkobject/scPRINT)\n\n![logo](docs/logo.png)\n\nscPRINT is a large transformer model built for the inference of gene networks (connections between genes explaining the cell's expression profile) from scRNAseq data.\n\nIt uses novel encoding and decoding of the cell expression profile and new pre-training methodologies to learn a cell model.\n\nscPRINT can be used to perform the following analyses:\n\n- __expression denoising__: increase the resolution of your scRNAseq data\n- __cell embedding__: generate a low-dimensional representation of your dataset\n- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells\n- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset\n\n[Read the manuscript!](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1) if you would like to know more about scPRINT. Have a look at some of my [X-plainers](https://twitter.com/jkobject). \n\n![figure1](docs/figure1.png)\n\n## Table of Contents\n\n- [scPRINT: Large Cell Model for scRNAseq data](#scprint-large-cell-model-for-scrnaseq-data)\n  - [Table of Contents](#table-of-contents)\n  - [Install `scPRINT`](#install-scprint)\n    - [lamin.ai](#laminai)\n    - [install](#install)\n    - [pytorch and GPUs](#pytorch-and-gpus)\n    - [dev install](#dev-install)\n  - [Usage](#usage)\n    - [scPRINT's basic commands](#scprints-basic-commands)\n    - [Notes on GPU/CPU usage with triton](#notes-on-gpucpu-usage-with-triton)\n    - [Simple tests:](#simple-tests)\n  - [FAQ](#faq)\n    - [I want to generate gene networks from scRNAseq data:](#i-want-to-generate-gene-networks-from-scrnaseq-data)\n    - [I want to generate cell embeddings and cell label predictions from scRNAseq data:](#i-want-to-generate-cell-embeddings-and-cell-label-predictions-from-scrnaseq-data)\n    - [I want to denoise my scRNAseq dataset:](#i-want-to-denoise-my-scrnaseq-dataset)\n    - [I want to generate an atlas-level embedding](#i-want-to-generate-an-atlas-level-embedding)\n    - [I need to generate gene tokens using pLLMs](#i-need-to-generate-gene-tokens-using-pllms)\n    - [I want to pre-train scPRINT from scratch on my own data](#i-want-to-pre-train-scprint-from-scratch-on-my-own-data)\n    - [how can I find if scPRINT was trained on my data?](#how-can-i-find-if-scprint-was-trained-on-my-data)\n    - [can I use scPRINT on other organisms rather than human?](#can-i-use-scprint-on-other-organisms-rather-than-human)\n    - [how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)](#how-long-does-scprint-takes-what-kind-of-resources-do-i-need-or-in-alternative-can-i-run-scprint-locally)\n    - [I have different scRNASeq batches. Should I integrate my data before running scPRINT?](#i-have-different-scrnaseq-batches-should-i-integrate-my-data-before-running-scprint)\n    - [where to find the gene embeddings?](#where-to-find-the-gene-embeddings)\n  - [Documentation](#documentation)\n  - [Model Weights](#model-weights)\n  - [Docker](#docker)\n    - [Building the Docker Image](#building-the-docker-image)\n    - [Pulling the Docker Image from Docker Hub](#pulling-the-docker-image-from-docker-hub)\n    - [Running the Docker Container](#running-the-docker-container)\n  - [Development](#development)\n  - [Work in progress (PR welcomed):](#work-in-progress-pr-welcomed)\n\n\n## Install `scPRINT`\n\nFor the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10. Its instalation takes on average 10 minutes.\n\nIf you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.\n\n### lamin.ai\n\nTo use scPRINT, you will need to use [lamin.ai](https://lamin.ai/). This is needed to load biological informations like genes, cell types, organisms etc...\n\n### install\n\nTo start you will need to do:\n\n```bash\nuv venv -n <env-name> python==3.10 #scprint might work with python >3.10, but it is not tested\n#one of\nuv pip install scprint # OR\nuv pip install scprint[dev] # for the dev dependencies (building etc..) OR\nuv pip install scprint[flash] # to use flashattention2 with triton: only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)\n#OR pip install scPRINT[dev,flash]\n\nlamin init --storage ./testdb --name test --schema bionty\n```\n\nif you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.\n\nyou can do it manually or with our function:\n\n```python\nfrom scdataloader.utils import populate_my_ontology\n\npopulate_my_ontology() #to populate everything (recommended) (can take 2-10mns)\n\npopulate_my_ontology( #the minimum for scprint to run some inferences (denoising, grn inference)\norganisms: List[str] = [\"NCBITaxon:10090\", \"NCBITaxon:9606\"],\n    sex: List[str] = [\"PATO:0000384\", \"PATO:0000383\"],\n    celltypes = None,\n    ethnicities = None,\n    assays = None,\n    tissues = None,\n    diseases = None,\n    dev_stages = None,\n)\n```\n\nWe make use of some additional packages we developed alongside scPRint.\n\nPlease refer to their documentation for more information:\n\n- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.\n- [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.\n- [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.\n\n### pytorch and GPUs\n\nscPRINT can run on machines without GPUs, but it will be slow. It is highly recommended to use a GPU for inference.\n\nOnce you have a GPU, and installed the required drivers, you might need to install a specific version of pytorch that is compatible with your drivers (e.g. nvidia 550 drivers will lead to a nvidia toolkit 11.7 or 11.8 which might mean you need to re-install a different flavor of pytorch for things to work. e.g. using the command:\n`pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118` on my case on linux\n ).\n\nI was able to test it with nvidia 11.7, 11.8, 12.2.\n\n### dev install\n\nIf you want to use the latest version of scPRINT and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.\n\n```bash\ngit clone https://github.com/cantinilab/scPRINT\ngit clone https://github.com/jkobject/scDataLoader\ngit clone https://github.com/cantinilab/GRnnData\ngit clone https://github.com/jkobject/benGRN\npip install -e scPRINT[dev]\npip install -e scDataLoader[dev]\npip install -e GRnnData[dev]\npip install -e benGRN[dev]\n```\n\n## Usage\n\n### scPRINT's basic commands\n\nThis is the most minimal example of how scPRINT works:\n\n```py\nfrom lightning.pytorch import Trainer\nfrom scprint import scPrint\nfrom scdataloader import DataModule\n\ndatamodule = DataModule(...)\nmodel = scPrint(...)\n# to train / fit / test the model\ntrainer = Trainer(...)\ntrainer.fit(model, datamodule=datamodule)\n# to do predictions Denoiser, Embedder, GNInfer\ndenoiser = Denoiser(...)\nadata = sc.read_h5ad(...)\ndenoiser(model, adata=adata)\n...\n```\n\nor, from a bash command line\n\n```bash\n$ scprint fit/train/predict/test/denoise/embed/gninfer --config config/[medium|large|vlarge] ...\n```\n\nfind out more about the commands by running `scprint --help` or `scprint [command] --help`.\n\nmore examples of using the command line are available in the [docs](./docs/usage.md).\n\n### Notes on GPU/CPU usage with triton\n\nIf you do not have [triton](https://triton-lang.org/main/python-api/triton.html) installed you will not be able to take advantage of GPU acceleration, but you can still use the model on the CPU.\n\nIn that case, if loading from a checkpoint that was trained with flashattention, you will need to specify `transformer=\"normal\"` in the `load_from_checkpoint` function like so:\n\n```python\nmodel = scPrint.load_from_checkpoint(\n    '../data/temp/last.ckpt', precpt_gene_emb=None,\n    transformer=\"normal\")\n```\n\n### Simple tests:\n\nAn instalation of scPRINT and a simple test of the denoiser is performed during each commit to the main branch with a [Github action](https://github.com/cantinilab/scPRINT/actions) and [pytest workflow](.github/workflows/main.yml). It also provides an expected runtime for the installation and run of scPRINT.\n\nWe now explore the different usages of scPRINT:\n\n## FAQ\n\n### I want to generate gene networks from scRNAseq data:\n\n-> Refer to the section . gene network inference in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).\n\n-> More examples in this notebook [./notebooks/assessments/bench_omni.ipynb](./notebooks/bench_omni.ipynb).\n\n### I want to generate cell embeddings and cell label predictions from scRNAseq data:\n\n-> Refer to the embeddings and cell annotations section in [this notebook](./docs/notebooks/cancer_usecase.ipynb#).\n\n### I want to denoise my scRNAseq dataset:\n\n-> Refer to the Denoising of B-cell section in [this notebook](./docs/notebooks/cancer_usecase.ipynb).\n\n-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](./notebooks/bench_denoising.ipynb).\n\n### I want to generate an atlas-level embedding\n\n-> Refer to the notebook [nice_umap.ipynb](./figures/nice_umap.ipynb).\n\n### I need to generate gene tokens using pLLMs\n\nTo run scPRINT, you can use the option to define the gene tokens using protein language model embeddings of genes. This is done by providing the path to a parquet file of the precomputed set of embeddings for each gene name to scPRINT via \"precpt_gene_emb\"\n\n-> To generate this file please refer to the notebook [generate_gene_embeddings](notebooks/generate_gene_embeddings.ipynb).\n\n### I want to pre-train scPRINT from scratch on my own data\n\n-> Refer to the documentation page [pretrain scprint](docs/pretrain.md)\n\n### how can I find if scPRINT was trained on my data?\n\nIf your data is available in cellxgene, scPRINT was likely trained on it. However some cells, datasets were dropped due to low quality data and some were randomly removed to be part of the validation / test sets.\n\n### can I use scPRINT on other organisms rather than human?\n\nscPRINT has been pretrained on both humans and mouse, and can be used on any organism with a similar gene set. If you want to use scPRINT on very different organisms, you will need to generate gene embeddings for that organism and re-train scPRINT\n\n### how long does scPRINT takes? what kind of resources do I need? (or in alternative: can i run scPRINT locally?)\n\nplease look at our supplementary tables in the [manuscript](https://www.biorxiv.org/content/10.1101/2024.07.29.605556v1)\n\n### I have different scRNASeq batches. Should I integrate my data before running scPRINT?\n\nscPRINT takes raw count as inputs, so please don't use integrated data. Just give the raw counts to scPRINT and it will take care of the rest.\n\n### where to find the gene embeddings?\n\nIf you think you need the gene embeddings file for loading the model from a checkpoint, you don't, as the embeddings are also stored in the model weights. You just need to load the weights like this:\n\n```python\nmodel = scPrint.load_from_checkpoint(\n    '../../data/temp/last.ckpt',\n    precpt_gene_emb=None,\n)\n```\n\nYou can also recreate the gene embedding file through [this notebook](notebooks/generate_gene_embeddings.ipynb). Just call the functions, and it should recreate the file itself.\n\nthe file itself is also available on [hugging face](https://huggingface.co/jkobject/scPRINT/tree/main)\n\n## Documentation\n\nFor more information on usage please see the documentation in [https://www.jkobject.com/scPRINT/](https://www.jkobject.com/scPRINT/)\n\n## Model Weights\n\nModel weights are available on [hugging face](https://huggingface.co/jkobject/scPRINT/).\n\n## Docker\n\nBy using the `scPRINT Docker image`, you can bypass the complexities of manual package installation, ensuring a consistent deployment environment. Included in this repository is a Dockerfile that lets you craft a container for the project; you have the choice to either build this image on your own or conveniently pull it from Docker Hub.\n\nMake sure that you have the `docker` command line interface installed on your system.\n\nA recommended way to install docker with the correct nvidia drivers on linux is to use this [script](https://gist.github.com/xueerchen1990/baad7baa545cb547e8633bc9e5b84786)\n\n### Building the Docker Image\n\nTo build the Docker image from the provided `Dockerfile`, run the following command from the root directory of this repository:\n\n```bash\ndocker build -t scprint:latest -f Dockerfile .\n```\n\n### Pulling the Docker Image from Docker Hub\n\nIf you don't want to build the image yourself, you can pull it directly from Docker Hub:\n\n```bash\ndocker pull jkobject/scprint:1.1.3\ndocker tag jkobject/scprint:1.1.3 scprint:latest\n```\n\n### Running the Docker Container\n\nOnce you have the image (either by building it or pulling it), you can start a container with:\n\n```bash\ndocker run --gpus all --rm -it scprint:latest bash\n```\n\nPlease note: When running the Docker container, ensure you mount any necessary folders using the -v option to access them inside the container.\n`\n## Development\n\nRead the [CONTRIBUTING.md](CONTRIBUTING.md) file.\n\nRead the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how pre-training was performed and the its behavior.\n\ncode coverage is not right as I am using the command line interface for now. >50% of the code is covered by my current unit test.\n\nAcknowledgement:\n[python template](https://github.com/rochacbruno/python-project-template)\n[laminDB](https://lamin.ai/)\n[lightning](https://lightning.ai/)\n\n## Work in progress (PR welcomed):\n\n1. remove the triton dependencies\n2. add version with additional labels (tissues, age) and organisms (mouse, zebrafish) and more datasets from cellxgene\n3. version with separate transformer blocks for the encoding part of the bottleneck learning and for the cell embeddings\n4. improve classifier to output uncertainties and topK predictions when unsure\n5. setup latest lamindb version\n\nAwesome Large Cell Model created by Jeremie Kalfon.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "scPRINT is a Large Cell Model for Gene Network Inference, Denoising and more from scRNAseq data",
    "version": "1.6.4",
    "project_urls": {
        "repository": "https://github.com/jkobject/scPRINT"
    },
    "split_keywords": [
        "grn",
        " foundation model",
        " gene regulatory network",
        " large cell model",
        " scprint",
        " scrnaseq",
        " transformer"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3c3025ea40c98f937389e487ba1090143856750d9075e6b6afe24b6d2708be24",
                "md5": "1b4dd703c7ddfa5ad92694ade4c2a07e",
                "sha256": "130f7353ccce3bb492f93f8f0144a4fdbc96342d12f7547956fce1826bd174b3"
            },
            "downloads": -1,
            "filename": "scprint-1.6.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1b4dd703c7ddfa5ad92694ade4c2a07e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 128412,
            "upload_time": "2024-11-27T15:37:30",
            "upload_time_iso_8601": "2024-11-27T15:37:30.814439Z",
            "url": "https://files.pythonhosted.org/packages/3c/30/25ea40c98f937389e487ba1090143856750d9075e6b6afe24b6d2708be24/scprint-1.6.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "660759bc5d8226d6c7c55e2dcede1b2c83cb5b8e506834d503a90656b23e2be1",
                "md5": "c9b6210cf7bba45a7f0adf51e643e92c",
                "sha256": "a9f6888782d7dabd877d6f410bb4c04db8b8f7f92eff07185029436d5eff9569"
            },
            "downloads": -1,
            "filename": "scprint-1.6.4.tar.gz",
            "has_sig": false,
            "md5_digest": "c9b6210cf7bba45a7f0adf51e643e92c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 99634,
            "upload_time": "2024-11-27T15:37:32",
            "upload_time_iso_8601": "2024-11-27T15:37:32.832892Z",
            "url": "https://files.pythonhosted.org/packages/66/07/59bc5d8226d6c7c55e2dcede1b2c83cb5b8e506834d503a90656b23e2be1/scprint-1.6.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-27 15:37:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jkobject",
    "github_project": "scPRINT",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "scprint"
}
        
Elapsed time: 0.39762s