# blasttools
Commands for turning blast queries into pandas dataframes.
Blast against any built blast databases
```sh
blasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot
```
## Install
Install with
```sh
python -m pip install -U blasttools
# *OR*
python -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'
```
Once installed you can update with `blasttools update`
## Common Usages:
Build some blast databases from Ensembl Plants.
```sh
blasttools plants --release=40 build triticum_aestivum zea_mays
```
Find out what species are available:
```sh
blasttools plants --release=40 species
```
Blast against `my.fasta` and save dataframe as a pickle file (the default is to
save as a csv file named `my.fasta.csv`).
```sh
blasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays
```
Get your blast data!
```python
import pandas as pd
df = pd.read_pickle('dataframe.pkl')
```
## Parallelization
When blasting, you can specify `--num-threads` which is passed directly to the
underlying blast command. If you want to parallelize over species, databases or fasta files,
I suggest you use [GNU Parallel](https://www.gnu.org/software/parallel/) [[Tutorial](https://blog.ronin.cloud/gnu-parallel/)].
`parallel` has a much better set of options for controlling how the parallelization works
and is also quite simple for simple things.
e.g. build blast databases from a set of fasta files concurrently:
```sh
parallel blasttools build ::: *.fa.gz
```
Or blast _everything_!
```sh
species=$(blasttools plants species)
parallel blasttools plants build ::: $species
# must have different output files here...
parallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species
# or in batches of 4 species at a time
parallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species
```
Then gather them all together...
```sh
blasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl
```
or programmatically:
```python
from glob import glob
import pandas as pd
df = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)
```
Remember: if you parallelize your blasts _and_ use `--num-threads > 1`
then you are probably going to be fighting for cpu time
amongst yourselves!
## Best matches
Usually if you want the top/best `--best=3` will select the _lowest_ evalue's for
each query sequence. However if you want say the best to, say, be the longest query match
then you can add `--expr='qstart - qend'`. (Remember we are looking for the lowest values).
## XML
Blast offers an xml (`--xml`) output format that adds `query`, `match`, `sbjct` strings. The other
fields are equivalent to adding `--columns='+score gaps nident positive qlen slen'`.
It also offers a way to display the blast match as a pairwise alignment.
```python
from blasttools.blastxml import hsp_match
df = pd.read_csv('results.csv')
df['alignment'] = df.apply(hsp_match, axis=1)
print(df.iloc[0].alignment)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/arabidopsis/blasttools",
"name": "blasttools",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.9",
"maintainer_email": null,
"keywords": "genomics, bioinformatics",
"author": "Ian Castleden",
"author_email": "ian.castleden@uwa.edu.au",
"download_url": "https://files.pythonhosted.org/packages/5b/bf/048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36/blasttools-0.1.16.tar.gz",
"platform": null,
"description": "# blasttools\n\nCommands for turning blast queries into pandas dataframes.\n\nBlast against any built blast databases\n\n```sh\nblasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot\n```\n\n## Install\n\nInstall with\n\n```sh\npython -m pip install -U blasttools\n# *OR*\npython -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'\n```\n\nOnce installed you can update with `blasttools update`\n\n## Common Usages:\n\nBuild some blast databases from Ensembl Plants.\n\n```sh\nblasttools plants --release=40 build triticum_aestivum zea_mays\n```\n\nFind out what species are available:\n\n```sh\nblasttools plants --release=40 species\n```\n\nBlast against `my.fasta` and save dataframe as a pickle file (the default is to\nsave as a csv file named `my.fasta.csv`).\n\n```sh\nblasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays\n```\n\nGet your blast data!\n\n```python\nimport pandas as pd\ndf = pd.read_pickle('dataframe.pkl')\n```\n\n## Parallelization\n\nWhen blasting, you can specify `--num-threads` which is passed directly to the\nunderlying blast command. If you want to parallelize over species, databases or fasta files,\nI suggest you use [GNU Parallel](https://www.gnu.org/software/parallel/) [[Tutorial](https://blog.ronin.cloud/gnu-parallel/)].\n\n`parallel` has a much better set of options for controlling how the parallelization works\nand is also quite simple for simple things.\n\ne.g. build blast databases from a set of fasta files concurrently:\n\n```sh\nparallel blasttools build ::: *.fa.gz\n```\n\nOr blast _everything_!\n\n```sh\nspecies=$(blasttools plants species)\nparallel blasttools plants build ::: $species\n# must have different output files here...\nparallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species\n# or in batches of 4 species at a time\nparallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species\n```\n\nThen gather them all together...\n\n```sh\nblasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl\n```\n\nor programmatically:\n\n```python\nfrom glob import glob\nimport pandas as pd\ndf = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)\n```\n\nRemember: if you parallelize your blasts _and_ use `--num-threads > 1`\nthen you are probably going to be fighting for cpu time\namongst yourselves!\n\n## Best matches\n\nUsually if you want the top/best `--best=3` will select the _lowest_ evalue's for\neach query sequence. However if you want say the best to, say, be the longest query match\nthen you can add `--expr='qstart - qend'`. (Remember we are looking for the lowest values).\n\n## XML\n\nBlast offers an xml (`--xml`) output format that adds `query`, `match`, `sbjct` strings. The other\nfields are equivalent to adding `--columns='+score gaps nident positive qlen slen'`.\n\nIt also offers a way to display the blast match as a pairwise alignment.\n\n```python\nfrom blasttools.blastxml import hsp_match\ndf = pd.read_csv('results.csv')\ndf['alignment'] = df.apply(hsp_match, axis=1)\nprint(df.iloc[0].alignment)\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Commands for turning blast queries into pandas dataframes.",
"version": "0.1.16",
"project_urls": {
"Homepage": "https://github.com/arabidopsis/blasttools",
"Repository": "https://github.com/arabidopsis/blasttools"
},
"split_keywords": [
"genomics",
" bioinformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "554ad3f88eea959293e873036cda60c72ef2ed9a783e9fccaba7404031bf1974",
"md5": "109f8e10746858b84582ca7ef4aeabbd",
"sha256": "7baddd3b3ff7db8f8781514ab17a24b180c429e518a30bfa611b3c087069e8ac"
},
"downloads": -1,
"filename": "blasttools-0.1.16-py3-none-any.whl",
"has_sig": false,
"md5_digest": "109f8e10746858b84582ca7ef4aeabbd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.9",
"size": 24597,
"upload_time": "2024-05-14T10:29:30",
"upload_time_iso_8601": "2024-05-14T10:29:30.403718Z",
"url": "https://files.pythonhosted.org/packages/55/4a/d3f88eea959293e873036cda60c72ef2ed9a783e9fccaba7404031bf1974/blasttools-0.1.16-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5bbf048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36",
"md5": "cf51260b03b80e51baf744a4e7040e92",
"sha256": "6db50477ee48c37b2e72201add54bc7f7187961230593b6cf132643a3ff72e5f"
},
"downloads": -1,
"filename": "blasttools-0.1.16.tar.gz",
"has_sig": false,
"md5_digest": "cf51260b03b80e51baf744a4e7040e92",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.9",
"size": 20759,
"upload_time": "2024-05-14T10:29:32",
"upload_time_iso_8601": "2024-05-14T10:29:32.348905Z",
"url": "https://files.pythonhosted.org/packages/5b/bf/048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36/blasttools-0.1.16.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-14 10:29:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arabidopsis",
"github_project": "blasttools",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "blasttools"
}