# hchacha - Human CHromosome Accession CHange
Translate among the different naming systems used for human chromosomes (of the same assembly)
## Background
There are a number of different groups that participant in and/or provide reference human sequence
data from the [Genome Reference Consortium](http://genomereference.org). However, the same reference
sequence data for each chromosome get accessioned under different identifiers. This script converts
among these identifiers (just within versions-- this is not a crossMap or liftOver), for several
commonly-used file formats, including VCF, SAM, FASTA, chain files...
Why? Well, there are several conventions for the naming of human chromosomes. The "ensembl" style
numbers them 1-22 then X and Y. The "ucsc" style (named after the UCSC genome browser, also
used in GATK's reference bundles) prepends these with 'chr'. However, a downside of both of these
is that '11' or 'chr11' do not uniquely identify a sequence (although they may in the context
of a specific assembly version like GRCh38.p13. On the other hand, 'NC_000011.10' is a specific
accessioned sequence (which happens to be the chromosome 11 sequence version used in the
GRCh38 primary assembly. Likewise, the genbank accession rather than the refseq accession could
be used.
## Examples
```
hchacha --help
```
```
zcat input.vcf.gz | hchacha vcf -a 37 -t ensembl | bgzip -c > output.vcf.gz
```
```
samtools view -h input.bam | hchacha sam -a 38 -t refseq | samtools view -b > output.bam
```
## Smarter handling for BAM/CRAM files
Since all you are doing is really renaming the sequences in the header (and individual BAM/CRAM records
refer back to those sequence names by an integer index), you can do things much more quickly and with
less CPU usage using `samtools reheader` if it is available on your system.
For example:
```bash
samtools reheader -P -c 'hchacha sam -a 38 -t ucsc -s' input.bam > output.bam
```
With some clever use of the `tee` command to output the new bam file and continue the shell pipeline
going, you can even make the new index at the same time:
```bash
samtools reheader -P -c 'hchacha sam -a 38 -t ucsc -s' input.bam | tee output.bam | samtools index - output.bam.bai
```
## Data used
NCBI provides a useful file (*.assembly_report.txt) for different GRCh reference versions and patch
levels, for instance [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt),
that maps among these names. To get the data included in the repository (for GRCh versions 37 and 38), I
ran the bash script `get-assembly-reports.bash` (requires curl) whichh writes files to `src/hchacha/data`.
## Mapping for ensembl names
The mapping to ensEMBL names is not quite as straightforward. It looks
like they use the "short" names (like 1, 2, 3, ... X, Y) for the primary chromosomes, then RefSeq
accessions for the others, so that is what this script does.
## License
MIT license, but I am open to re-licensing this simple to script some other way if you have a good reason.
It is my understandig that data derived from RefSeq/NCBI are in the public domain as the work
product of an institution of the governement of the United States of America.
Samtools (included as part of the docker image) is MIT/Expat licensed and Copyright (C) 2008-2023 Genome Research Ltd.
Raw data
{
"_id": null,
"home_page": "https://bitbucket.org/bpow/hchacha",
"name": "hchacha",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7,<4.0",
"maintainer_email": "",
"keywords": "genetics,bam,sam,vcf,variants,bed,fasta,genome,exome",
"author": "Bradford Powell",
"author_email": "bpow@drpowell.org",
"download_url": "https://files.pythonhosted.org/packages/10/8c/f868668755db5a52a42622b562e28b6a860f43aab818f460f3cef28d6832/hchacha-1.0.5.tar.gz",
"platform": null,
"description": "# hchacha - Human CHromosome Accession CHange\n\nTranslate among the different naming systems used for human chromosomes (of the same assembly)\n\n## Background\n\nThere are a number of different groups that participant in and/or provide reference human sequence\ndata from the [Genome Reference Consortium](http://genomereference.org). However, the same reference\nsequence data for each chromosome get accessioned under different identifiers. This script converts\namong these identifiers (just within versions-- this is not a crossMap or liftOver), for several\ncommonly-used file formats, including VCF, SAM, FASTA, chain files...\n\nWhy? Well, there are several conventions for the naming of human chromosomes. The \"ensembl\" style\nnumbers them 1-22 then X and Y. The \"ucsc\" style (named after the UCSC genome browser, also\nused in GATK's reference bundles) prepends these with 'chr'. However, a downside of both of these\nis that '11' or 'chr11' do not uniquely identify a sequence (although they may in the context\nof a specific assembly version like GRCh38.p13. On the other hand, 'NC_000011.10' is a specific\naccessioned sequence (which happens to be the chromosome 11 sequence version used in the\nGRCh38 primary assembly. Likewise, the genbank accession rather than the refseq accession could\nbe used.\n\n## Examples\n\n```\nhchacha --help\n```\n\n```\nzcat input.vcf.gz | hchacha vcf -a 37 -t ensembl | bgzip -c > output.vcf.gz\n```\n\n```\nsamtools view -h input.bam | hchacha sam -a 38 -t refseq | samtools view -b > output.bam\n```\n\n## Smarter handling for BAM/CRAM files\n\nSince all you are doing is really renaming the sequences in the header (and individual BAM/CRAM records\nrefer back to those sequence names by an integer index), you can do things much more quickly and with\nless CPU usage using `samtools reheader` if it is available on your system.\n\nFor example:\n\n```bash\nsamtools reheader -P -c 'hchacha sam -a 38 -t ucsc -s' input.bam > output.bam\n```\n\nWith some clever use of the `tee` command to output the new bam file and continue the shell pipeline\ngoing, you can even make the new index at the same time:\n\n```bash\nsamtools reheader -P -c 'hchacha sam -a 38 -t ucsc -s' input.bam | tee output.bam | samtools index - output.bam.bai\n```\n\n## Data used\n\nNCBI provides a useful file (*.assembly_report.txt) for different GRCh reference versions and patch\nlevels, for instance [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_assembly_report.txt),\nthat maps among these names. To get the data included in the repository (for GRCh versions 37 and 38), I\nran the bash script `get-assembly-reports.bash` (requires curl) whichh writes files to `src/hchacha/data`.\n\n## Mapping for ensembl names\n\nThe mapping to ensEMBL names is not quite as straightforward. It looks\nlike they use the \"short\" names (like 1, 2, 3, ... X, Y) for the primary chromosomes, then RefSeq\naccessions for the others, so that is what this script does.\n\n## License\n\nMIT license, but I am open to re-licensing this simple to script some other way if you have a good reason.\n\nIt is my understandig that data derived from RefSeq/NCBI are in the public domain as the work\nproduct of an institution of the governement of the United States of America.\n\nSamtools (included as part of the docker image) is MIT/Expat licensed and Copyright (C) 2008-2023 Genome Research Ltd.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Human CHromosome Accession CHAnge - Convert between different human chromosome naming systems (of the same assembly/version)",
"version": "1.0.5",
"project_urls": {
"Homepage": "https://bitbucket.org/bpow/hchacha",
"Repository": "https://bitbucket.org/bpow/hchacha"
},
"split_keywords": [
"genetics",
"bam",
"sam",
"vcf",
"variants",
"bed",
"fasta",
"genome",
"exome"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e1b97f15b83c377eb08e9509ecfe5da23be5a6cd0dc7172dac422b21b4612351",
"md5": "1ae6d3c25c0ccd41ebc2ba22595ad947",
"sha256": "318464eb12d29c5dbe2138b5a2aca5d7c98ebeb74541098d3fcc3020057e94f8"
},
"downloads": -1,
"filename": "hchacha-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1ae6d3c25c0ccd41ebc2ba22595ad947",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7,<4.0",
"size": 22339,
"upload_time": "2023-07-22T18:37:13",
"upload_time_iso_8601": "2023-07-22T18:37:13.643008Z",
"url": "https://files.pythonhosted.org/packages/e1/b9/7f15b83c377eb08e9509ecfe5da23be5a6cd0dc7172dac422b21b4612351/hchacha-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "108cf868668755db5a52a42622b562e28b6a860f43aab818f460f3cef28d6832",
"md5": "d36db12743ed5bacb84e05629647e906",
"sha256": "907bf4c5b80c93a90374ad753b93c1ba6d14e750ebe216be0559fa33398b7005"
},
"downloads": -1,
"filename": "hchacha-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "d36db12743ed5bacb84e05629647e906",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7,<4.0",
"size": 23812,
"upload_time": "2023-07-22T18:37:15",
"upload_time_iso_8601": "2023-07-22T18:37:15.183220Z",
"url": "https://files.pythonhosted.org/packages/10/8c/f868668755db5a52a42622b562e28b6a860f43aab818f460f3cef28d6832/hchacha-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-22 18:37:15",
"github": false,
"gitlab": false,
"bitbucket": true,
"codeberg": false,
"bitbucket_user": "bpow",
"bitbucket_project": "hchacha",
"lcname": "hchacha"
}