blasttools

Name	blasttools JSON
Version	0.1.16 JSON
	download
home_page	https://github.com/arabidopsis/blasttools
Summary	Commands for turning blast queries into pandas dataframes.
upload_time	2024-05-14 10:29:32
maintainer	None
docs_url	None
author	Ian Castleden
requires_python	<3.13,>=3.9
license	MIT
keywords	genomics bioinformatics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # blasttools

Commands for turning blast queries into pandas dataframes.

Blast against any built blast databases

```sh
blasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot
```

## Install

Install with

```sh
python -m pip install -U blasttools
# *OR*
python -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'
```

Once installed you can update with `blasttools update`

## Common Usages:

Build some blast databases from Ensembl Plants.

```sh
blasttools plants --release=40 build triticum_aestivum zea_mays
```

Find out what species are available:

```sh
blasttools plants --release=40 species
```

Blast against `my.fasta` and save dataframe as a pickle file (the default is to
save as a csv file named `my.fasta.csv`).

```sh
blasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays
```

Get your blast data!

```python
import pandas as pd
df = pd.read_pickle('dataframe.pkl')
```

## Parallelization

When blasting, you can specify `--num-threads` which is passed directly to the
underlying blast command. If you want to parallelize over species, databases or fasta files,
I suggest you use [GNU Parallel](https://www.gnu.org/software/parallel/) [[Tutorial](https://blog.ronin.cloud/gnu-parallel/)].

`parallel` has a much better set of options for controlling how the parallelization works
and is also quite simple for simple things.

e.g. build blast databases from a set of fasta files concurrently:

```sh
parallel blasttools build ::: *.fa.gz
```

Or blast _everything_!

```sh
species=$(blasttools plants species)
parallel blasttools plants build ::: $species
# must have different output files here...
parallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species
# or in batches of 4 species at a time
parallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species
```

Then gather them all together...

```sh
blasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl
```

or programmatically:

```python
from glob import glob
import pandas as pd
df = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)
```

Remember: if you parallelize your blasts _and_ use `--num-threads > 1`
then you are probably going to be fighting for cpu time
amongst yourselves!

## Best matches

Usually if you want the top/best `--best=3` will select the _lowest_ evalue's for
each query sequence. However if you want say the best to, say, be the longest query match
then you can add `--expr='qstart - qend'`. (Remember we are looking for the lowest values).

## XML

Blast offers an xml (`--xml`) output format that adds `query`, `match`, `sbjct` strings. The other
fields are equivalent to adding `--columns='+score gaps nident positive qlen slen'`.

It also offers a way to display the blast match as a pairwise alignment.

```python
from blasttools.blastxml import hsp_match
df = pd.read_csv('results.csv')
df['alignment'] = df.apply(hsp_match, axis=1)
print(df.iloc[0].alignment)
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/arabidopsis/blasttools",
    "name": "blasttools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "genomics, bioinformatics",
    "author": "Ian Castleden",
    "author_email": "ian.castleden@uwa.edu.au",
    "download_url": "https://files.pythonhosted.org/packages/5b/bf/048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36/blasttools-0.1.16.tar.gz",
    "platform": null,
    "description": "# blasttools\n\nCommands for turning blast queries into pandas dataframes.\n\nBlast against any built blast databases\n\n```sh\nblasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot\n```\n\n## Install\n\nInstall with\n\n```sh\npython -m pip install -U blasttools\n# *OR*\npython -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'\n```\n\nOnce installed you can update with `blasttools update`\n\n## Common Usages:\n\nBuild some blast databases from Ensembl Plants.\n\n```sh\nblasttools plants --release=40 build triticum_aestivum zea_mays\n```\n\nFind out what species are available:\n\n```sh\nblasttools plants --release=40 species\n```\n\nBlast against `my.fasta` and save dataframe as a pickle file (the default is to\nsave as a csv file named `my.fasta.csv`).\n\n```sh\nblasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays\n```\n\nGet your blast data!\n\n```python\nimport pandas as pd\ndf = pd.read_pickle('dataframe.pkl')\n```\n\n## Parallelization\n\nWhen blasting, you can specify `--num-threads` which is passed directly to the\nunderlying blast command. If you want to parallelize over species, databases or fasta files,\nI suggest you use [GNU Parallel](https://www.gnu.org/software/parallel/) [[Tutorial](https://blog.ronin.cloud/gnu-parallel/)].\n\n`parallel` has a much better set of options for controlling how the parallelization works\nand is also quite simple for simple things.\n\ne.g. build blast databases from a set of fasta files concurrently:\n\n```sh\nparallel blasttools build ::: *.fa.gz\n```\n\nOr blast _everything_!\n\n```sh\nspecies=$(blasttools plants species)\nparallel blasttools plants build ::: $species\n# must have different output files here...\nparallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species\n# or in batches of 4 species at a time\nparallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species\n```\n\nThen gather them all together...\n\n```sh\nblasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl\n```\n\nor programmatically:\n\n```python\nfrom glob import glob\nimport pandas as pd\ndf = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)\n```\n\nRemember: if you parallelize your blasts _and_ use `--num-threads > 1`\nthen you are probably going to be fighting for cpu time\namongst yourselves!\n\n## Best matches\n\nUsually if you want the top/best `--best=3` will select the _lowest_ evalue's for\neach query sequence. However if you want say the best to, say, be the longest query match\nthen you can add `--expr='qstart - qend'`. (Remember we are looking for the lowest values).\n\n## XML\n\nBlast offers an xml (`--xml`) output format that adds `query`, `match`, `sbjct` strings. The other\nfields are equivalent to adding `--columns='+score gaps nident positive qlen slen'`.\n\nIt also offers a way to display the blast match as a pairwise alignment.\n\n```python\nfrom blasttools.blastxml import hsp_match\ndf = pd.read_csv('results.csv')\ndf['alignment'] = df.apply(hsp_match, axis=1)\nprint(df.iloc[0].alignment)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Commands for turning blast queries into pandas dataframes.",
    "version": "0.1.16",
    "project_urls": {
        "Homepage": "https://github.com/arabidopsis/blasttools",
        "Repository": "https://github.com/arabidopsis/blasttools"
    },
    "split_keywords": [
        "genomics",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "554ad3f88eea959293e873036cda60c72ef2ed9a783e9fccaba7404031bf1974",
                "md5": "109f8e10746858b84582ca7ef4aeabbd",
                "sha256": "7baddd3b3ff7db8f8781514ab17a24b180c429e518a30bfa611b3c087069e8ac"
            },
            "downloads": -1,
            "filename": "blasttools-0.1.16-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "109f8e10746858b84582ca7ef4aeabbd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 24597,
            "upload_time": "2024-05-14T10:29:30",
            "upload_time_iso_8601": "2024-05-14T10:29:30.403718Z",
            "url": "https://files.pythonhosted.org/packages/55/4a/d3f88eea959293e873036cda60c72ef2ed9a783e9fccaba7404031bf1974/blasttools-0.1.16-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5bbf048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36",
                "md5": "cf51260b03b80e51baf744a4e7040e92",
                "sha256": "6db50477ee48c37b2e72201add54bc7f7187961230593b6cf132643a3ff72e5f"
            },
            "downloads": -1,
            "filename": "blasttools-0.1.16.tar.gz",
            "has_sig": false,
            "md5_digest": "cf51260b03b80e51baf744a4e7040e92",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 20759,
            "upload_time": "2024-05-14T10:29:32",
            "upload_time_iso_8601": "2024-05-14T10:29:32.348905Z",
            "url": "https://files.pythonhosted.org/packages/5b/bf/048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36/blasttools-0.1.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-14 10:29:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "arabidopsis",
    "github_project": "blasttools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "blasttools"
}

Ian Castleden