\n# parseID: suck, parse identifiers or accession numbers
## Introduction
parseID is a bioinformatics data structure library optimized for sucking identifiers or
accession numbers into memory, parse those identifiers accession numbers to each other.
Identifiers or accession numbers are defined and referenced by various biological databases.
Their number could be million size or even billion level.
Some data operations, such as query or parse, are very common.
parseID employs Data structure "trie" and "ditrie". Trie could suck tremendous identifiers into memory at a time.
Ditrie could suck a large number of mapping of identifiers. Through the trie and ditrie,
huge data operations including insert, get, search, delete, scan etc could be quickly called.
## testing
```
pytest -s tests
```
## quick start
There is one example about how huge accession numbers are sucked into Trie.
The mapping file could be downloaded from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_refseq_uniprotkb_collab.gz into local space.
Retrieve 176,513,729 (03/25/2024) UniProt Accession numbers from the file and feed them into Trie.
Showed as the example below, accession numbers are stored in the object uniprotkb_acc_trie.
```
from parseid import ProcessID
infile = 'gene_refseq_uniprotkb_collab'
uniprotkb_acc_trie = ProcessID(infile).uniprotkb_protein_accession()
```
Retrieve pairs of NCBI protein accession number and UniProt Accession numbers
from file and feed them into Ditrie. Showed as the example below,
the mapping fo two accession numbers are stored in the object map_trie,
which is ready for query or parsing.
```
from parseid import ProcessID
infile = 'gene_refseq_uniprotkb_collab'
ncbi_uniprotkb_ditrie = ProcessID(infile).map_ncbi_uniprotkb()
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Tiezhengyuan/parse_identifier",
"name": "parseid",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "pypi, cicd, python",
"author": "Tiezheng Yuan",
"author_email": "tiezhengyuan@hotmail.com",
"download_url": "https://files.pythonhosted.org/packages/46/d7/213d03af70e826d71c2e97973fd76add43efad9e9eb5efce940f6a67119d/parseid-0.2.0.tar.gz",
"platform": null,
"description": "\\n# parseID: suck, parse identifiers or accession numbers\n\n## Introduction\nparseID is a bioinformatics data structure library optimized for sucking identifiers or \naccession numbers into memory, parse those identifiers accession numbers to each other. \n\nIdentifiers or accession numbers are defined and referenced by various biological databases.\nTheir number could be million size or even billion level.\nSome data operations, such as query or parse, are very common.\n\nparseID employs Data structure \"trie\" and \"ditrie\". Trie could suck tremendous identifiers into memory at a time. \nDitrie could suck a large number of mapping of identifiers. Through the trie and ditrie, \nhuge data operations including insert, get, search, delete, scan etc could be quickly called.\n\n## testing\n\n```\npytest -s tests\n```\n\n## quick start\nThere is one example about how huge accession numbers are sucked into Trie.\nThe mapping file could be downloaded from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_refseq_uniprotkb_collab.gz into local space.\nRetrieve 176,513,729 (03/25/2024) UniProt Accession numbers from the file and feed them into Trie.\nShowed as the example below, accession numbers are stored in the object uniprotkb_acc_trie. \n```\nfrom parseid import ProcessID\ninfile = 'gene_refseq_uniprotkb_collab'\nuniprotkb_acc_trie = ProcessID(infile).uniprotkb_protein_accession()\n```\n\nRetrieve pairs of NCBI protein accession number and UniProt Accession numbers\nfrom file and feed them into Ditrie. Showed as the example below, \nthe mapping fo two accession numbers are stored in the object map_trie, \nwhich is ready for query or parsing.\n```\nfrom parseid import ProcessID\ninfile = 'gene_refseq_uniprotkb_collab'\nncbi_uniprotkb_ditrie = ProcessID(infile).map_ncbi_uniprotkb()\n```\n\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "suck, parse identifiers or accession numbers",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/Tiezhengyuan/parse_identifier"
},
"split_keywords": [
"pypi",
" cicd",
" python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3483afd20396c8533dea61108ef1b57e6723a2d0fd6c142a157d7c684a8dbe1a",
"md5": "c5afefc526c19a1ee0bdc649d4ef0b7e",
"sha256": "5f514763e9567d01de36338efa1b390824b83d5d87810dc5d04231ca1ef8061c"
},
"downloads": -1,
"filename": "parseid-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c5afefc526c19a1ee0bdc649d4ef0b7e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 8505,
"upload_time": "2024-05-01T15:06:27",
"upload_time_iso_8601": "2024-05-01T15:06:27.787541Z",
"url": "https://files.pythonhosted.org/packages/34/83/afd20396c8533dea61108ef1b57e6723a2d0fd6c142a157d7c684a8dbe1a/parseid-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "46d7213d03af70e826d71c2e97973fd76add43efad9e9eb5efce940f6a67119d",
"md5": "20d9857d76f13f36f03fecda5c8fd461",
"sha256": "e26142713d1017c776e6cd19dfec9c5755d2309d828adc2d9b6a6c9b89596e2b"
},
"downloads": -1,
"filename": "parseid-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "20d9857d76f13f36f03fecda5c8fd461",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8985,
"upload_time": "2024-05-01T15:06:28",
"upload_time_iso_8601": "2024-05-01T15:06:28.965420Z",
"url": "https://files.pythonhosted.org/packages/46/d7/213d03af70e826d71c2e97973fd76add43efad9e9eb5efce940f6a67119d/parseid-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-01 15:06:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Tiezhengyuan",
"github_project": "parse_identifier",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "parseid"
}