# PyMart
Python interface towards ENSEMBL's BioMart.
[](https://raw.githubusercontent.com/irahorecka/sgd-rest/main/LICENSE)
[](https://www.python.org/downloads/)
[](https://github.com/ivanp1994/PyMart/actions/workflows/flaking.yaml) [](https://github.com/ivanp1994/PyMart/actions/workflows/testing.yaml)
# Installation and requirements
The only requirements are requests and pandas library, and those are things you likely already have.
PyMart makes use of `dataclass` which is Python 3.7+ minimum, and so the minimum Python environment if Python 3.8.
Additional things used for testing are specified in "requirements_dev" and those are things like pytest for code testing,
flake8 for code linting, etc.
Simply clone the repository and install via `pip install .`
(Not yet added to PyPi)
# Usage
## Listing available databases
The first drop-down menu [ENSEMBL BioMart](https://www.ensembl.org/info/data/biomart/index.html)'s data mining tool shows all databases
that are found on BioMart servers. To list those databases use the `list_databases` function of the module.
For example:
```
import pymart as pm
database_df = pm.list_databases()
```
Will output the following:
```
Display name 'Ensembl Genes 108' corresponds to 'ENSEMBL_MART_ENSEMBL'
Display name 'Mouse strains 108' corresponds to 'ENSEMBL_MART_MOUSE'
Display name 'Sequence' corresponds to 'ENSEMBL_MART_SEQUENCE'
Display name 'Ontology' corresponds to 'ENSEMBL_MART_ONTOLOGY'
Display name 'Genomic features 108' corresponds to 'ENSEMBL_MART_GENOMIC'
Display name 'Ensembl Variation 108' corresponds to 'ENSEMBL_MART_SNP'
Display name 'Ensembl Regulation 108' corresponds to 'ENSEMBL_MART_FUNCGEN'
```
In the above example, what's displayed when user clicks on 'Ensembl Genes 108' is database under BioMart's internal name 'ENSEMBL_MART_ENSEMBL'.
This database consists of many (usually one per species) datasets. `database_df` is a pandas DataFrame with two columns corresponding
to internal name and the name displayed in the drop-down menu.
## Finding desired dataset
To find the desired dataset, use the `find_dataset` function of the module.
The function takes two arguments, `database_name` which corresponds to a valid BioMart database, and `species` which corresponds to a valid species.
`database_name` can either correspond to case insensitive display name (e.g. *Ensembl Genes 108*) or case insensitive internal name (e.g. *ENSEMBL_MART_ENSEMBL*)
with the caveat that one can pass either spaces replaced by underscores or vice-versa.
Likewise `species` argument needs to be a string contained in either internal name (e.g. *mmusculus*) or displayed name (e.g. *Zebrafish*).
For example:
```
import pymart as pm
datasets = pm.find_dataset("ensembl mart ensembl","mouse")
```
Will output the following:
```
Query database name 'ensembl mart ensembl' corresponds to 'ENSEMBL_MART_ENSEMBL'
For queried species 'mouse', the database contains the following datasets:
Display name 'Ryukyu mouse genes (CAROLI_EIJ_v1.1)' corresponds to 'mcaroli_gene_ensembl'
Display name 'Northern American deer mouse genes (HU_Pman_2.1)' corresponds to 'pmbairdii_gene_ensembl'
Display name 'Mouse genes (GRCm39)' corresponds to 'mmusculus_gene_ensembl'
Display name 'Algerian mouse genes (SPRET_EiJ_v1)' corresponds to 'mspretus_gene_ensembl'
Display name 'Mouse Lemur genes (Mmur_3.0)' corresponds to 'mmurinus_gene_ensembl'
Display name 'Steppe mouse genes (MUSP714)' corresponds to 'mspicilegus_gene_ensembl'
Display name 'Shrew mouse genes (PAHARI_EIJ_v1.1)' corresponds to 'mpahari_gene_ensembl'
```
To narrow the selection, instead of "mouse" use more precise "mmus":
```
import pymart as pm
datasets = pm.find_dataset("ensembl mart ensembl","mouse")
```
The output is now:
```
Query database name 'ensembl mart ensembl' corresponds to 'ENSEMBL_MART_ENSEMBL'
For queried species 'mmus', the database contains the following datasets:
Display name 'Mouse genes (GRCm39)' corresponds to 'mmusculus_gene_ensembl'
```
## Fetching data from a given dataset
The real function is fetching large data from a given BioMart dataset. In the above example,
we've narrowed that the information about genes for *Mus musculus* is found in `mmusculus_gene_ensembl` dataset. Now the main function is to fetch
all genetic information that we want. To do that, use `fetch_data` function. There are three main components to using it properly.
### 1. Specifying datasets
You can specify dataset you want by two main ways. First is to directly pass your dataset as `dataset_name` parameter. In our example, this would be 'mmusculus_gene_ensembl'.
The other way is to specify which database we want and which species we want to fetch dataset from via `database_name` and `species_name`, skipping the `find_dataset` option.
However, an error will occur if there is more than one dataset corresponding to species query. For example, using "mouse" as `species_name` will trigger an error as there are
multiple species with "mouse" in their name.
Example:
```
import pymart as pm
mouse_data_1 = pm.fetch_data(dataset_name="mmusculus_gene_ensembl")
mouse_data_2 = pm.fetch_data(database_name="ensembl mart ensembl",species_name="mmus")
```
The databases fetched are identical.
### 2. Finding out information about given dataset
Once the desired dataset is found, elements of those dataset must be found. Every BioMart dataset has *attributes* which are columns of dataset corresponding to a feature of the database (e.g. attribute *Gene stable ID* represents ENSEMBL Gene ID in the Ensembl Genes 108 database) and *filters* which are used to filter the elements of the dataset (e.g. filtering for a particular chromosome via "Chromosome/scaffold" option). To inspect given dataset with respect to filters and attributes, use the functions `get_filters` and `get_attributes` respectively.
These two functions specify a dataset via `dataset_name` or via `database_name` and `species` parameters, much like the function `fetch_data`. Additional parameter is `display` which if set to True will print out all rows of attributes or filters.
Example:
```
import pymart as pm
dataset_name = "amexicanus_gene_ensembl"
attributes = pm.get_attributes(dataset_name,display=True)
filters = pm.get_filters(dataset_name,display=True)
```
The above code will print out all atributes and filters related to genes of Mexican tetra and return them in the form of pandas dataframe. It can then be inspected and used to decide what will be fetched and filtered.
### 3. Specifying columns ("attributes") of data and filtering.
Attributes are columns of selected dataset, and control the dimensionality of the data. Passing N attributes will result in M x N pandas DataFrame where M are the rows correspoding to elements of a selected dataset, (e.g. a particular gene or a transcript in Ensembl Genes database) and N are columns of said data. Attributes are specified with `attributes` parameter. Every attribute has its internal name (how it's parsed internally) and its display name (how it's displayed for the user). For example, `ensembl_gene_id` corresponds to display name of `Gene stable ID`. You can specify either one. All attributes that are not found in the dataset are quietly ignored.
Example:
```
import pymart as pm
attributes = ["ensembl_gene_id","Chromosome/scaffold name","manbearpig_homology_perc",]
mouse_data_1 = pm.fetch_data(dataset_name="mmusculus_gene_ensembl",attributes = attributes)
```
In the above example, we fetch dataset corresponding to genes of *Mus musculus*. We find the ENSEMBL Gene Stable ID ("ensembl_gene_id"), on which chromosome it's located on, and the last element ("manbearpig_homology_perc") is simply ignored. If no attributes are passed, default attributes are fetched instead.
Additional parameter that can be passed is `filters`. Filters are largely similar to attributes, but instead of passing a simple iterator, a python dictionary should be passed. The keys of that dictionary should correspond to either display name or name of the filter, and values should correspond to desired values. Filters come in differenty types - e.g. boolean (set to `True` or `False`) or text filters (set to some values).
Example:
```
import pymart as pm
filters ={"Type":["pseudogene","protein_coding"],
"chromosome_name": ["1","2"],
"transcript_tsl":False,
"manbearpig_gene":True,
}
mouse_data = pm.fetch_data(dataset_name="mmusculus_gene_ensembl",filters=filters)
```
The above example fetches only pseudogenes and protein coding genes found on chromosomes 1 and 2, who have no Transcript Support Level. There is no such thing as "manbearpig_gene".
### 4. Specifying homologies
There is an additional feature of specifying [gene homology](https://en.wikipedia.org/wiki/Sequence_homology). Every gene dataset contains information about homologies in other species - e.g. Human to Mouse orthologs. There are two parameters in `fetch_dataset` function which deal exclusively with homologies. These are `hom_species` and `hom_query`.
Example:
```
import pymart as pm
dataset_name = "amexicanus_gene_ensembl"
hom_species = ["human","mmusculus","ZebraFish"]
hom_query = ["ensembl_gene","associated_gene_name","orthology_type","orthology_confidence","perc_id"]
data = pm.fetch_data(dataset_name=dataset_name,hom_species=hom_species,hom_query=hom_query)
```
The above example fetches gene data from Mexican tetra (*Astyanax mexicanus*), and tries to find homology towards three species:
1. Humans ("human")
2. Mouse ("mmusculus")
3. Zebrafish ("ZebraFish")
The selected queries are their equivalent ENSEMBL Gene IDs, their name, what type of orthology, how confident the orthology score is, and what percentage is the target gene identical to the queried gene (in our case how similar is the human/mouse/zebrafish gene to its Mexican tetra equivalent).
There will be a total of 15 (3 species x 5 queries) additional homology columns.
Raw data
{
"_id": null,
"home_page": "https://github.com/ivanp1994/PyMart.git",
"name": "PyMart",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "",
"author": "Ivan Pokrovac",
"author_email": "ivan.pokrovac.fbf@gmail.com",
"download_url": "",
"platform": null,
"description": "\r\n\r\n# PyMart\r\nPython interface towards ENSEMBL's BioMart.\r\n\r\n\r\n[](https://raw.githubusercontent.com/irahorecka/sgd-rest/main/LICENSE)\r\n[](https://www.python.org/downloads/)\r\n[](https://github.com/ivanp1994/PyMart/actions/workflows/flaking.yaml) [](https://github.com/ivanp1994/PyMart/actions/workflows/testing.yaml)\r\n\r\n# Installation and requirements\r\nThe only requirements are requests and pandas library, and those are things you likely already have.\r\nPyMart makes use of `dataclass` which is Python 3.7+ minimum, and so the minimum Python environment if Python 3.8.\r\n\r\nAdditional things used for testing are specified in \"requirements_dev\" and those are things like pytest for code testing,\r\nflake8 for code linting, etc.\r\n\r\nSimply clone the repository and install via `pip install .` \r\n(Not yet added to PyPi)\r\n\r\n# Usage\r\n\r\n## Listing available databases\r\n\r\nThe first drop-down menu [ENSEMBL BioMart](https://www.ensembl.org/info/data/biomart/index.html)'s data mining tool shows all databases \r\nthat are found on BioMart servers. To list those databases use the `list_databases` function of the module.\r\nFor example:\r\n```\r\n import pymart as pm\r\n database_df = pm.list_databases()\r\n ```\r\nWill output the following:\r\n```\r\n Display name 'Ensembl Genes 108' corresponds to 'ENSEMBL_MART_ENSEMBL'\r\n Display name 'Mouse strains 108' corresponds to 'ENSEMBL_MART_MOUSE'\r\n Display name 'Sequence' corresponds to 'ENSEMBL_MART_SEQUENCE'\r\n Display name 'Ontology' corresponds to 'ENSEMBL_MART_ONTOLOGY'\r\n Display name 'Genomic features 108' corresponds to 'ENSEMBL_MART_GENOMIC'\r\n Display name 'Ensembl Variation 108' corresponds to 'ENSEMBL_MART_SNP'\r\n Display name 'Ensembl Regulation 108' corresponds to 'ENSEMBL_MART_FUNCGEN'\r\n```\r\n\r\nIn the above example, what's displayed when user clicks on 'Ensembl Genes 108' is database under BioMart's internal name 'ENSEMBL_MART_ENSEMBL'.\r\nThis database consists of many (usually one per species) datasets. `database_df` is a pandas DataFrame with two columns corresponding\r\nto internal name and the name displayed in the drop-down menu.\r\n\r\n## Finding desired dataset\r\n\r\nTo find the desired dataset, use the `find_dataset` function of the module.\r\nThe function takes two arguments, `database_name` which corresponds to a valid BioMart database, and `species` which corresponds to a valid species.\r\n`database_name` can either correspond to case insensitive display name (e.g. *Ensembl Genes 108*) or case insensitive internal name (e.g. *ENSEMBL_MART_ENSEMBL*)\r\nwith the caveat that one can pass either spaces replaced by underscores or vice-versa.\r\n\r\nLikewise `species` argument needs to be a string contained in either internal name (e.g. *mmusculus*) or displayed name (e.g. *Zebrafish*).\r\n\r\nFor example:\r\n```\r\n import pymart as pm\r\n datasets = pm.find_dataset(\"ensembl mart ensembl\",\"mouse\")\r\n```\r\nWill output the following:\r\n```\r\n Query database name 'ensembl mart ensembl' corresponds to 'ENSEMBL_MART_ENSEMBL'\r\n For queried species 'mouse', the database contains the following datasets: \r\n Display name 'Ryukyu mouse genes (CAROLI_EIJ_v1.1)' corresponds to 'mcaroli_gene_ensembl'\r\n Display name 'Northern American deer mouse genes (HU_Pman_2.1)' corresponds to 'pmbairdii_gene_ensembl'\r\n Display name 'Mouse genes (GRCm39)' corresponds to 'mmusculus_gene_ensembl'\r\n Display name 'Algerian mouse genes (SPRET_EiJ_v1)' corresponds to 'mspretus_gene_ensembl'\r\n Display name 'Mouse Lemur genes (Mmur_3.0)' corresponds to 'mmurinus_gene_ensembl'\r\n Display name 'Steppe mouse genes (MUSP714)' corresponds to 'mspicilegus_gene_ensembl'\r\n Display name 'Shrew mouse genes (PAHARI_EIJ_v1.1)' corresponds to 'mpahari_gene_ensembl'\r\n```\r\nTo narrow the selection, instead of \"mouse\" use more precise \"mmus\":\r\n```\r\n import pymart as pm\r\n datasets = pm.find_dataset(\"ensembl mart ensembl\",\"mouse\")\r\n```\r\nThe output is now:\r\n```\r\n Query database name 'ensembl mart ensembl' corresponds to 'ENSEMBL_MART_ENSEMBL'\r\n For queried species 'mmus', the database contains the following datasets: \r\n Display name 'Mouse genes (GRCm39)' corresponds to 'mmusculus_gene_ensembl'\r\n``` \r\n\r\n## Fetching data from a given dataset \r\n\r\nThe real function is fetching large data from a given BioMart dataset. In the above example,\r\nwe've narrowed that the information about genes for *Mus musculus* is found in `mmusculus_gene_ensembl` dataset. Now the main function is to fetch\r\nall genetic information that we want. To do that, use `fetch_data` function. There are three main components to using it properly.\r\n\r\n### 1. Specifying datasets\r\n\r\nYou can specify dataset you want by two main ways. First is to directly pass your dataset as `dataset_name` parameter. In our example, this would be 'mmusculus_gene_ensembl'.\r\nThe other way is to specify which database we want and which species we want to fetch dataset from via `database_name` and `species_name`, skipping the `find_dataset` option.\r\nHowever, an error will occur if there is more than one dataset corresponding to species query. For example, using \"mouse\" as `species_name` will trigger an error as there are\r\nmultiple species with \"mouse\" in their name.\r\n\r\nExample:\r\n```\r\n import pymart as pm\r\n mouse_data_1 = pm.fetch_data(dataset_name=\"mmusculus_gene_ensembl\")\r\n mouse_data_2 = pm.fetch_data(database_name=\"ensembl mart ensembl\",species_name=\"mmus\")\r\n```\r\n\r\nThe databases fetched are identical.\r\n\r\n\r\n### 2. Finding out information about given dataset\r\n\r\nOnce the desired dataset is found, elements of those dataset must be found. Every BioMart dataset has *attributes* which are columns of dataset corresponding to a feature of the database (e.g. attribute *Gene stable ID* represents ENSEMBL Gene ID in the Ensembl Genes 108 database) and *filters* which are used to filter the elements of the dataset (e.g. filtering for a particular chromosome via \"Chromosome/scaffold\" option). To inspect given dataset with respect to filters and attributes, use the functions `get_filters` and `get_attributes` respectively.\r\nThese two functions specify a dataset via `dataset_name` or via `database_name` and `species` parameters, much like the function `fetch_data`. Additional parameter is `display` which if set to True will print out all rows of attributes or filters.\r\n\r\nExample:\r\n```\r\n import pymart as pm\r\n dataset_name = \"amexicanus_gene_ensembl\"\r\n attributes = pm.get_attributes(dataset_name,display=True)\r\n filters = pm.get_filters(dataset_name,display=True)\r\n```\r\n\r\nThe above code will print out all atributes and filters related to genes of Mexican tetra and return them in the form of pandas dataframe. It can then be inspected and used to decide what will be fetched and filtered.\r\n\r\n\r\n### 3. Specifying columns (\"attributes\") of data and filtering.\r\n\r\nAttributes are columns of selected dataset, and control the dimensionality of the data. Passing N attributes will result in M x N pandas DataFrame where M are the rows correspoding to elements of a selected dataset, (e.g. a particular gene or a transcript in Ensembl Genes database) and N are columns of said data. Attributes are specified with `attributes` parameter. Every attribute has its internal name (how it's parsed internally) and its display name (how it's displayed for the user). For example, `ensembl_gene_id` corresponds to display name of `Gene stable ID`. You can specify either one. All attributes that are not found in the dataset are quietly ignored.\r\n\r\nExample:\r\n```\r\n import pymart as pm\r\n attributes = [\"ensembl_gene_id\",\"Chromosome/scaffold name\",\"manbearpig_homology_perc\",]\r\n mouse_data_1 = pm.fetch_data(dataset_name=\"mmusculus_gene_ensembl\",attributes = attributes)\r\n```\r\nIn the above example, we fetch dataset corresponding to genes of *Mus musculus*. We find the ENSEMBL Gene Stable ID (\"ensembl_gene_id\"), on which chromosome it's located on, and the last element (\"manbearpig_homology_perc\") is simply ignored. If no attributes are passed, default attributes are fetched instead. \r\n\r\nAdditional parameter that can be passed is `filters`. Filters are largely similar to attributes, but instead of passing a simple iterator, a python dictionary should be passed. The keys of that dictionary should correspond to either display name or name of the filter, and values should correspond to desired values. Filters come in differenty types - e.g. boolean (set to `True` or `False`) or text filters (set to some values). \r\n\r\nExample:\r\n```\r\n import pymart as pm\r\n filters ={\"Type\":[\"pseudogene\",\"protein_coding\"],\r\n \"chromosome_name\": [\"1\",\"2\"],\r\n \"transcript_tsl\":False,\r\n \"manbearpig_gene\":True,\r\n }\r\n mouse_data = pm.fetch_data(dataset_name=\"mmusculus_gene_ensembl\",filters=filters)\r\n```\r\n\r\nThe above example fetches only pseudogenes and protein coding genes found on chromosomes 1 and 2, who have no Transcript Support Level. There is no such thing as \"manbearpig_gene\".\r\n\r\n### 4. Specifying homologies\r\n\r\nThere is an additional feature of specifying [gene homology](https://en.wikipedia.org/wiki/Sequence_homology). Every gene dataset contains information about homologies in other species - e.g. Human to Mouse orthologs. There are two parameters in `fetch_dataset` function which deal exclusively with homologies. These are `hom_species` and `hom_query`. \r\n\r\nExample:\r\n```\r\n import pymart as pm\r\n dataset_name = \"amexicanus_gene_ensembl\"\r\n hom_species = [\"human\",\"mmusculus\",\"ZebraFish\"]\r\n hom_query = [\"ensembl_gene\",\"associated_gene_name\",\"orthology_type\",\"orthology_confidence\",\"perc_id\"]\r\n data = pm.fetch_data(dataset_name=dataset_name,hom_species=hom_species,hom_query=hom_query)\r\n```\r\nThe above example fetches gene data from Mexican tetra (*Astyanax mexicanus*), and tries to find homology towards three species:\r\n 1. Humans (\"human\")\r\n 2. Mouse (\"mmusculus\")\r\n 3. Zebrafish (\"ZebraFish\")\r\nThe selected queries are their equivalent ENSEMBL Gene IDs, their name, what type of orthology, how confident the orthology score is, and what percentage is the target gene identical to the queried gene (in our case how similar is the human/mouse/zebrafish gene to its Mexican tetra equivalent).\r\n\r\nThere will be a total of 15 (3 species x 5 queries) additional homology columns.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python based API wrapper around Ensembl's BioMart",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/ivanp1994/PyMart.git"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bb7314deacf51e3da8f8a10852cd936d5a3b0baf2028bb9881ad12fc3119e06c",
"md5": "e441689372b937cbc64b9db389130175",
"sha256": "6e8236708fcbf99d33d6b48d76c57cd286c6465605ce297619a6a3f822d5f0e0"
},
"downloads": -1,
"filename": "PyMart-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e441689372b937cbc64b9db389130175",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 16132,
"upload_time": "2023-08-02T07:59:19",
"upload_time_iso_8601": "2023-08-02T07:59:19.072344Z",
"url": "https://files.pythonhosted.org/packages/bb/73/14deacf51e3da8f8a10852cd936d5a3b0baf2028bb9881ad12fc3119e06c/PyMart-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-02 07:59:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ivanp1994",
"github_project": "PyMart",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.27.1"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.4.1"
]
]
}
],
"lcname": "pymart"
}