# cancer_data
This package provides unified methods for accessing popular datasets used in cancer research.
**[Full documentation](https://cancer_data.kevinhu.io)**
## Installation
```bash
pip install cancer_data
```
## System requirements
The raw downloaded files occupy approximately 15 GB, and the processed HDFs take up about 10 GB. On a relatively recent machine with a fast SSD, processing all of the files after download takes about 3-4 hours. At least 16 GB of RAM is recommended for handling the large splicing tables.
## Datasets
A complete description of the datasets may be found in [schema.csv](https://github.com/kevinhu/cancer-data/blob/master/cancer_data/schema.csv).
| Collection | Datasets | Portal |
| --------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| Cancer Cell Line Encyclopedia (CCLE) | Many (see portal) | https://portals.broadinstitute.org/ccle/data (registration required) |
| Cancer Dependency Map (DepMap) | Genome-wide CRISPR-cas9 and RNAi screens, gene expression, mutations, and copy number | https://depmap.org/portal/download/ |
| The Cancer Genome Atlas (TCGA) | Mutations, RNAseq expression and splicing, and copy number | https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 |
| The Genotype-Tissue Expression (GTEx) Project | RNAseq expression and splicing | https://gtexportal.org/home/datasets |
## Features
The goal of this package is to make statistical analysis and coordination of these datasets easier. To that end, it provides the following features:
1. Harmonization: datasets within a collection have sample IDs reduced to the same format. For instance, all CCLE+DepMap datasets have been modified to use Achilles/Arxspan IDs, rather than cell line names.
2. Speed: processed datasets are all stored in high-performance [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), allowing large tables to be loaded orders of magnitude faster than with CSV or TSV formats.
3. Space: tables of purely numerical values (e.g. gene expression, methylation, drug sensitivities) are stored in half-precision format. Compression is used for all tables, resulting in size reductions by factors of over 10 for sparse matrices such as mutation tables, and over 50 for highly-redundant tables such as gene-level copy number estimates.
## How it works
The [schema](https://github.com/kevinhu/cancer-data/blob/master/cancer_data/schema.csv) serves as the reference point for all datasets used. Each dataset is identified by a unique `id` column, which also serves as its access identifier.
Datasets are downloaded from the location specified in `download_url`, after which they are checked against the provided `downloaded_md5` hash.
The next steps depend on the `type` of the dataset:
- `reference` datasets, such as the hg19 FASTA files, are left as-is.
- `primary_dataset` objects are preprocessed and converted into HDF5 format.
- `secondary_dataset` objects are defined as being made from `primary_dataset` objects. These are also processed and converted into HDF5 format.
To keep track of which datasets are necessary for producing another, the `dependencies` column specifies the dataset `id`s that are required for making another. For instance, the `ccle_proteomics` dataset is dependent on the `ccle_annotations` dataset for converting cell line names to Achilles IDs. When running the processing pipeline, the package will automatically check that dependencies are met, and raise an error if they are not found.
## Notes
Some datasets have filtering applied to reduce their size. These are listed below:
- CCLE, GTEx, and TCGA splicing datasets have been filtered to remove splicing events with many missing values as well as those with low standard deviations.
- When constructing binary mutation matrices (`depmap_damaging` and `depmap_hotspot`), a minimum mutation frequency is used to remove especially rare (present in less than four samples) mutations.
- The TCGA MX splicing dataset is extremely large (approximately 10,000 rows by 900,000 columns), so it has been split column-wise into 8 chunks.
Raw data
{
"_id": null,
"home_page": "https://github.com/kevinhu/cancer_data",
"name": "cancer_data",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "cancer, data, genomics",
"author": "Kevin Hu",
"author_email": "kevinhuwest@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/69/5b/46b96e864b44786df7ea88e3e654a690ccf7f716e1b36604666509eae468/cancer_data-0.3.6.tar.gz",
"platform": null,
"description": "# cancer_data\n\nThis package provides unified methods for accessing popular datasets used in cancer research.\n\n**[Full documentation](https://cancer_data.kevinhu.io)**\n\n## Installation\n\n```bash\npip install cancer_data\n```\n\n## System requirements\n\nThe raw downloaded files occupy approximately 15 GB, and the processed HDFs take up about 10 GB. On a relatively recent machine with a fast SSD, processing all of the files after download takes about 3-4 hours. At least 16 GB of RAM is recommended for handling the large splicing tables.\n\n## Datasets\n\nA complete description of the datasets may be found in [schema.csv](https://github.com/kevinhu/cancer-data/blob/master/cancer_data/schema.csv).\n\n| Collection | Datasets | Portal |\n| --------------------------------------------- | ------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |\n| Cancer Cell Line Encyclopedia (CCLE) | Many (see portal) | https://portals.broadinstitute.org/ccle/data (registration required) |\n| Cancer Dependency Map (DepMap) | Genome-wide CRISPR-cas9 and RNAi screens, gene expression, mutations, and copy number | https://depmap.org/portal/download/ |\n| The Cancer Genome Atlas (TCGA) | Mutations, RNAseq expression and splicing, and copy number | https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 |\n| The Genotype-Tissue Expression (GTEx) Project | RNAseq expression and splicing | https://gtexportal.org/home/datasets |\n\n## Features\n\nThe goal of this package is to make statistical analysis and coordination of these datasets easier. To that end, it provides the following features:\n\n1. Harmonization: datasets within a collection have sample IDs reduced to the same format. For instance, all CCLE+DepMap datasets have been modified to use Achilles/Arxspan IDs, rather than cell line names.\n2. Speed: processed datasets are all stored in high-performance [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), allowing large tables to be loaded orders of magnitude faster than with CSV or TSV formats.\n3. Space: tables of purely numerical values (e.g. gene expression, methylation, drug sensitivities) are stored in half-precision format. Compression is used for all tables, resulting in size reductions by factors of over 10 for sparse matrices such as mutation tables, and over 50 for highly-redundant tables such as gene-level copy number estimates.\n\n## How it works\n\nThe [schema](https://github.com/kevinhu/cancer-data/blob/master/cancer_data/schema.csv) serves as the reference point for all datasets used. Each dataset is identified by a unique `id` column, which also serves as its access identifier.\n\nDatasets are downloaded from the location specified in `download_url`, after which they are checked against the provided `downloaded_md5` hash.\n\nThe next steps depend on the `type` of the dataset:\n\n- `reference` datasets, such as the hg19 FASTA files, are left as-is.\n- `primary_dataset` objects are preprocessed and converted into HDF5 format.\n- `secondary_dataset` objects are defined as being made from `primary_dataset` objects. These are also processed and converted into HDF5 format.\n\nTo keep track of which datasets are necessary for producing another, the `dependencies` column specifies the dataset `id`s that are required for making another. For instance, the `ccle_proteomics` dataset is dependent on the `ccle_annotations` dataset for converting cell line names to Achilles IDs. When running the processing pipeline, the package will automatically check that dependencies are met, and raise an error if they are not found.\n\n## Notes\n\nSome datasets have filtering applied to reduce their size. These are listed below:\n\n- CCLE, GTEx, and TCGA splicing datasets have been filtered to remove splicing events with many missing values as well as those with low standard deviations.\n- When constructing binary mutation matrices (`depmap_damaging` and `depmap_hotspot`), a minimum mutation frequency is used to remove especially rare (present in less than four samples) mutations.\n- The TCGA MX splicing dataset is extremely large (approximately 10,000 rows by 900,000 columns), so it has been split column-wise into 8 chunks.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Preprocessing for various cancer genomics datasets",
"version": "0.3.6",
"project_urls": {
"Homepage": "https://github.com/kevinhu/cancer_data",
"Repository": "https://github.com/kevinhu/cancer_data"
},
"split_keywords": [
"cancer",
" data",
" genomics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0b0c31ca191221fa5cf1ae15a605d4efe4d28abb994d662c049bdd925d511ca0",
"md5": "ab12c1b10b440c6413b4aaf75d8f5aa6",
"sha256": "b5053ec75bfc95e4a35fac603b23ba60f54cb3e5aa819fff981e0813809d3fc4"
},
"downloads": -1,
"filename": "cancer_data-0.3.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ab12c1b10b440c6413b4aaf75d8f5aa6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 24392,
"upload_time": "2024-04-07T16:58:59",
"upload_time_iso_8601": "2024-04-07T16:58:59.397192Z",
"url": "https://files.pythonhosted.org/packages/0b/0c/31ca191221fa5cf1ae15a605d4efe4d28abb994d662c049bdd925d511ca0/cancer_data-0.3.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "695b46b96e864b44786df7ea88e3e654a690ccf7f716e1b36604666509eae468",
"md5": "58c2fd2eb3a03b0803fb55c0a859a36b",
"sha256": "a4dbcb804c2aff71be8ebda0baea4083b786432e5f4c6e3ec79459af96e60a1f"
},
"downloads": -1,
"filename": "cancer_data-0.3.6.tar.gz",
"has_sig": false,
"md5_digest": "58c2fd2eb3a03b0803fb55c0a859a36b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 20908,
"upload_time": "2024-04-07T16:59:00",
"upload_time_iso_8601": "2024-04-07T16:59:00.951463Z",
"url": "https://files.pythonhosted.org/packages/69/5b/46b96e864b44786df7ea88e3e654a690ccf7f716e1b36604666509eae468/cancer_data-0.3.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-07 16:59:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kevinhu",
"github_project": "cancer_data",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "cancer_data"
}