scidat


Namescidat JSON
Version 1.0.6 PyPI version JSON
download
home_pagehttps://github.com/ArianeMora/scidat
SummaryDownload-Annotate-TCGA: Facilitates the download of data and annotation with metadata from TCGA
upload_time2022-12-20 00:41:54
maintainer
docs_urlNone
authorAriane Mora
requires_python>=3.6
licenseGPL3
keywords annotation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # Sci-dat: Download Annotate TCGA
[![codecov.io](https://codecov.io/github/ArianeMora/scidat/coverage.svg?branch=master)](https://codecov.io/github/ArianeMora/scidat?branch=master)
[![PyPI](https://img.shields.io/pypi/v/scidat)](https://pypi.org/project/scidat/)

A package developed to enable the download an annotation of TCGA data from `https://portal.gdc.cancer.gov/`

## Docs

https://arianemora.github.io/scidat/ 

## Install

```
pip install scidat
```

## Use
### API
The API combines the functions in Download and Annotation. It removes some of the ability to set specific directories etc but makes it easier to perform the functions.

See example notebook for how we get the following from the TCGA site:
```
    1. manifest_file
    2. gdc_client
    3. clinical_file
    4. sample_file
```

```
api = API(manifest_file, gdc_client, clinical_file, sample_file, requires_lst=None, clin_cols=None,
                 max_cnt=100, sciutil=None, split_manifest_dir='.', download_dir='.', meta_dir='.', sep='_')

```
Step 1. Download manifest data
```
# Downloads every file using default parameters in the manifest file
api.download_data_from_manifest()
# This will also unzip and copy the files all into one directory
```
Step 2. Annotation 
```
# Builds the annotation information
api.build_annotation()
```
Step 3. Download mutation data
```
# Downloads all the mutation data for all the cases in the clinical_file
api.download_mutation_data()
```
Step 4. Generate RNAseq dataframe
```
# Generates the RNA dataframe from the downloaded folder
api.build_rna_df()
```
Step 5. Get cases that have any mutations or specific mutations
```
# Returns a list of cases that have mutations (either in any gene if gene_list = None or in specific genes)
list_of_cases = api.get_cases_with_mutations(gene_list=None, id_type='symbol')

# Get genes with a small deletion
filter_col = 'ssm.consequence.0.transcript.gene.symbol'
genes = api.get_mutation_values_on_filter(filter_col, ['Small deletion'], 'ssm.mutation_subtype')

# Get genes with a specifc genomic change: ssm.genomic_dna_change
filter_col = 'case_id'
cases =  api.get_mutation_values_on_filter(filter_col, ['chr13:g.45340134A>G'], 'ssm.genomic_dna_change')

```
Step 6. Get cases with specific metadata information

Metadata list:
```
submitter_id
project_id
age_at_index
gender
race
vital_status
tumor_stage
normal_samples
tumor_samples
case_files
tumor_stage_num
example: {'gender': ['female'], 'tumor_stage_num': [1, 2]}
```
Method can be `any` i.e. it satisfies any of the conditions, or `all`, a case has to satisfy all the conditions in the meta_dict

```
# Returns cases that have the chosen metadata information e.g. gender, race, tumour_stage_num
cases_list = api.get_cases_with_meta(meta: dict, method="all")
```
Step 7. Get genes with mutations
```
# Returns a list of genes with mutations for specific cases
list_of_genes = api.get_genes_with_mutations(case_ids=None, id_type='symbol')
```
Step 8. Get values from the dataframe
```
# Returns the values, columns, dataframe of a subset of the RNAseq dataframe
values, columns, dataframe = get_values_from_df(df: pd.DataFrame, gene_id_column: str, case_ids=None, gene_ids=None,
                           column_name_includes=None, column_name_method="all")

```

### Download

```
# Downloads data using a manifest file
download = Download(manifest_file, split_manifest_dir, download_dir, gdc_client, max_cnt=100)
download.download()
```

```
# Downloads data from API to complement data from manifest file
# example datatype = mutation (this is the only one implemented for now)
download.download_data_using_api(case_ids: list, data_type: str)
```

### Annotate

** Generate annotation using clinical information from TCGA **
```
annotator = Annotate(output_dir: str, clinical_file: str, sample_file: str, manifest_file: str, file_types: list,
                 sep='_', clin_cols=None)
# Generate the annotate dataframe
annotator.build_annotation()

# Save the dataframe to a csv file
annotator.save_annotation(output_directory: str, filename: str)

# Save the clinical information to a csv file
annotator.save_annotated_clinical_df(output_directory: str, filename: str)

```

** Download mutation data for the cases of interest **
Note we first need to download the data using the `download_data_using_api` from above.
```
annotator.build_mutation_df(mutation_dir)

# Get that dataframe
mutation_df = annotator.get_mutation_df()

# Save the mutation dataframe to a csv
annotator.save_mutation_df(output_directory: str, filename: str)

```






            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ArianeMora/scidat",
    "name": "scidat",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "annotation",
    "author": "Ariane Mora",
    "author_email": "ariane.n.mora@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/90/5e/7b9afef30b27e7a41d822a0edd6998ef018f980c1d6412ad6549f933d054/scidat-1.0.6.tar.gz",
    "platform": null,
    "description": "# Sci-dat: Download Annotate TCGA\n[![codecov.io](https://codecov.io/github/ArianeMora/scidat/coverage.svg?branch=master)](https://codecov.io/github/ArianeMora/scidat?branch=master)\n[![PyPI](https://img.shields.io/pypi/v/scidat)](https://pypi.org/project/scidat/)\n\nA package developed to enable the download an annotation of TCGA data from `https://portal.gdc.cancer.gov/`\n\n## Docs\n\nhttps://arianemora.github.io/scidat/ \n\n## Install\n\n```\npip install scidat\n```\n\n## Use\n### API\nThe API combines the functions in Download and Annotation. It removes some of the ability to set specific directories etc but makes it easier to perform the functions.\n\nSee example notebook for how we get the following from the TCGA site:\n```\n    1. manifest_file\n    2. gdc_client\n    3. clinical_file\n    4. sample_file\n```\n\n```\napi = API(manifest_file, gdc_client, clinical_file, sample_file, requires_lst=None, clin_cols=None,\n                 max_cnt=100, sciutil=None, split_manifest_dir='.', download_dir='.', meta_dir='.', sep='_')\n\n```\nStep 1. Download manifest data\n```\n# Downloads every file using default parameters in the manifest file\napi.download_data_from_manifest()\n# This will also unzip and copy the files all into one directory\n```\nStep 2. Annotation \n```\n# Builds the annotation information\napi.build_annotation()\n```\nStep 3. Download mutation data\n```\n# Downloads all the mutation data for all the cases in the clinical_file\napi.download_mutation_data()\n```\nStep 4. Generate RNAseq dataframe\n```\n# Generates the RNA dataframe from the downloaded folder\napi.build_rna_df()\n```\nStep 5. Get cases that have any mutations or specific mutations\n```\n# Returns a list of cases that have mutations (either in any gene if gene_list = None or in specific genes)\nlist_of_cases = api.get_cases_with_mutations(gene_list=None, id_type='symbol')\n\n# Get genes with a small deletion\nfilter_col = 'ssm.consequence.0.transcript.gene.symbol'\ngenes = api.get_mutation_values_on_filter(filter_col, ['Small deletion'], 'ssm.mutation_subtype')\n\n# Get genes with a specifc genomic change: ssm.genomic_dna_change\nfilter_col = 'case_id'\ncases =  api.get_mutation_values_on_filter(filter_col, ['chr13:g.45340134A>G'], 'ssm.genomic_dna_change')\n\n```\nStep 6. Get cases with specific metadata information\n\nMetadata list:\n```\nsubmitter_id\nproject_id\nage_at_index\ngender\nrace\nvital_status\ntumor_stage\nnormal_samples\ntumor_samples\ncase_files\ntumor_stage_num\nexample: {'gender': ['female'], 'tumor_stage_num': [1, 2]}\n```\nMethod can be `any` i.e. it satisfies any of the conditions, or `all`, a case has to satisfy all the conditions in the meta_dict\n\n```\n# Returns cases that have the chosen metadata information e.g. gender, race, tumour_stage_num\ncases_list = api.get_cases_with_meta(meta: dict, method=\"all\")\n```\nStep 7. Get genes with mutations\n```\n# Returns a list of genes with mutations for specific cases\nlist_of_genes = api.get_genes_with_mutations(case_ids=None, id_type='symbol')\n```\nStep 8. Get values from the dataframe\n```\n# Returns the values, columns, dataframe of a subset of the RNAseq dataframe\nvalues, columns, dataframe = get_values_from_df(df: pd.DataFrame, gene_id_column: str, case_ids=None, gene_ids=None,\n                           column_name_includes=None, column_name_method=\"all\")\n\n```\n\n### Download\n\n```\n# Downloads data using a manifest file\ndownload = Download(manifest_file, split_manifest_dir, download_dir, gdc_client, max_cnt=100)\ndownload.download()\n```\n\n```\n# Downloads data from API to complement data from manifest file\n# example datatype = mutation (this is the only one implemented for now)\ndownload.download_data_using_api(case_ids: list, data_type: str)\n```\n\n### Annotate\n\n** Generate annotation using clinical information from TCGA **\n```\nannotator = Annotate(output_dir: str, clinical_file: str, sample_file: str, manifest_file: str, file_types: list,\n                 sep='_', clin_cols=None)\n# Generate the annotate dataframe\nannotator.build_annotation()\n\n# Save the dataframe to a csv file\nannotator.save_annotation(output_directory: str, filename: str)\n\n# Save the clinical information to a csv file\nannotator.save_annotated_clinical_df(output_directory: str, filename: str)\n\n```\n\n** Download mutation data for the cases of interest **\nNote we first need to download the data using the `download_data_using_api` from above.\n```\nannotator.build_mutation_df(mutation_dir)\n\n# Get that dataframe\nmutation_df = annotator.get_mutation_df()\n\n# Save the mutation dataframe to a csv\nannotator.save_mutation_df(output_directory: str, filename: str)\n\n```\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "GPL3",
    "summary": "Download-Annotate-TCGA: Facilitates the download of data and annotation with metadata from TCGA",
    "version": "1.0.6",
    "split_keywords": [
        "annotation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "8a1d866a9e6e119c950bb86969404319",
                "sha256": "5739545f7d34db0c5e865c29be0616c351f9f13a05d9edf42aea12d836911e6c"
            },
            "downloads": -1,
            "filename": "scidat-1.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8a1d866a9e6e119c950bb86969404319",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 19437629,
            "upload_time": "2022-12-20T00:41:50",
            "upload_time_iso_8601": "2022-12-20T00:41:50.986450Z",
            "url": "https://files.pythonhosted.org/packages/6e/03/042caace705dd4d2a5d6beb5b5069611608d47543f019576469c0df648b8/scidat-1.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "655155c462f0f2c02c7274724b013e56",
                "sha256": "fa581618007e933718a99633d4e734f21d5fb867dc2023c4548a81832392a676"
            },
            "downloads": -1,
            "filename": "scidat-1.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "655155c462f0f2c02c7274724b013e56",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 19414877,
            "upload_time": "2022-12-20T00:41:54",
            "upload_time_iso_8601": "2022-12-20T00:41:54.698458Z",
            "url": "https://files.pythonhosted.org/packages/90/5e/7b9afef30b27e7a41d822a0edd6998ef018f980c1d6412ad6549f933d054/scidat-1.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-20 00:41:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ArianeMora",
    "github_project": "scidat",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "lcname": "scidat"
}
        
Elapsed time: 0.04677s