# GEO subsampler
Geo_subsampler subsamples a given phylogenetic tree to rebalance the samples at different locations
according to user-specified proportions. Moreover, for each location the kept samples are chosen
in a balanced way over the sampling intervals (e.g. months).
With these constraints in mind, the script uses phylogenetic diversity [[Faith 1992]](https://www.sciencedirect.com/science/article/pii/0006320792912013)
to pick the samples to be removed.
Additional options allow to keep all the samples before a certain data,
and to ensure a minimal number of samples picked by location, despite the other criteria.
### Article
If you find geo_sampler useful, please cite:
A Zhukova, L Blassel, F Lemoine, M Morel, J Voznica, O Gascuel (2021) __Origin, evolution and global spread of SARS-CoV-2__
CRAS 344(1): 57-75 doi:[10.5802/crbiol.29](https://doi.org/10.5802/crbiol.29).
## Installation
To install geo_subsampler, first install python 3, then run:
```bash
pip3 install geo_subsampler
```
## Input data
As an input, one needs to provide a **NON**-dated phylogenetical tree in [newick](https://en.wikipedia.org/wiki/Newick_format) format,
a metadata table containing tip names, locations and states,
in tab-delimited (by default) or csv format (to be specified with *'--sep ,'* option).
To subsample according to user-specified proportions, one should also input a location case counts,
as tab(or comma, see above)-separated table whose first column contains locations and the second case counts.
### Example
The folder [example_data](example_data) contains an example of an input tree ([covid.nwk](example_data/covid.nwk))
representing an early SARS-COV-2 epidemic,
the corresponding metadata table ([metadata.tab](example_data/metadata.tab)), and a case count table ([cases.tab](example_data/cases.tab)).
The input tree contains 11 167 sampled tips.
The metadata table is a tab-separated file, containing tip ids in the first column,
their countries of sampling in the second column, and the sampling dates in the third column:
id | country | sampling date
----- | ----- | -----
EPI_ISL_402119 | China | 30/12/2019
EPI_ISL_402123 | China | 24/12/2019
EPI_ISL_403962 | Thailand | 08/01/2020
... | ... | ...
The case count table contains numbers of declared cases for each country:
country | cases
----- | -----
China | 84024
Thailand | 3017
... | ...
The following geo_subsampler command subsamples the input tree according to the case proportions and (as much as possible) equally between the months,
in order to keep 1000 tips:
```bash
geo_subsampler --tree example_data/covid.nwk --metadata example_data/metadata.tab \
--location_column country --date_column "sampling date" --cases example_data/cases.tab \
--output_dir example_data/results --size 1000
```
The resulting tree is put into [example_data/results](example_data/results) folder:
([covid.subsampled.0.nwk](example_data/results/covid.subsampled.0.nwk)). This folder also contains the ids of the tips retained in the subsampled tree:
([covid.subsampled.0.ids](example_data/results/covid.subsampled.0.ids)), and two tables with the statistics on the subsampling:
[case_counts.tab](example_data/results/case_counts.tab) and [case_counts_per_time.tab](example_data/results/case_counts_per_time.tab).
## Detailed options
- **--tree TREE** Path to the input phylogeny (NOT time-scaled) in newick format.
- **--metadata METADATA** Path to the metadata table containing location and date annotations, in a tab-delimited format.
- **--sep SEP** Separator used in the metadata and case tables. By default a tab-separated table is assumed.
- **--index_column INDEX_COLUMN**
number (starting from zero) of the index column (containing tree tip names) in the metadata table. By default is the first column (corresponding to the number 0)
- **--location_column LOCATION_COLUMN**
name of the column containing location annotations in the metadata table.
- **--date_column DATE_COLUMN**
name of the column containing date annotations in the metadata table.
- **--cases CASES** A tab-separated file with two columns. The first column lists the locations, while the second column contains the numbers of declared cases or proportions for the
corresponding locations
- **--start_date START_DATE**
If specified, all the cases before this date will be included in all the sub-sampled data sets.
- **--size SIZE** Target size of the sub-sampled data set (in number of samples). By default, will be set to a half of the data set represented by the input tree.
- **--repetitions REPETITIONS** Number of sub-sampled trees to produce. By default 1.
- **--output_dir OUTPUT_DIR**
Path to the directory where the sub-sampled results should be saved.
- **--min_cases MIN_CASES**
Minimum number of samples to retain for each location.
- **--date_precision {year,month,day}**
Precision for homogeneous subsampling over time within each location. By default (month) will aim at distributing selected location samples equally over months.
Raw data
{
"_id": null,
"home_page": "https://github.com/evolbioinfo/geo_subsampler",
"name": "geo-subsampler",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "phylogenetics, subsampling, phylogenetic diversity, phylogeography",
"author": "Anna Zhukova",
"author_email": "anna.zhukova@pasteur.fr",
"download_url": "https://files.pythonhosted.org/packages/42/bc/24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b/geo_subsampler-0.2.tar.gz",
"platform": null,
"description": "# GEO subsampler\n\nGeo_subsampler subsamples a given phylogenetic tree to rebalance the samples at different locations \naccording to user-specified proportions. Moreover, for each location the kept samples are chosen \nin a balanced way over the sampling intervals (e.g. months).\nWith these constraints in mind, the script uses phylogenetic diversity [[Faith 1992]](https://www.sciencedirect.com/science/article/pii/0006320792912013) \nto pick the samples to be removed.\nAdditional options allow to keep all the samples before a certain data, \nand to ensure a minimal number of samples picked by location, despite the other criteria.\n\n### Article\n\nIf you find geo_sampler useful, please cite: \n\nA Zhukova, L Blassel, F Lemoine, M Morel, J Voznica, O Gascuel (2021) __Origin, evolution and global spread of SARS-CoV-2__\nCRAS 344(1): 57-75 doi:[10.5802/crbiol.29](https://doi.org/10.5802/crbiol.29).\n\n\n## Installation\nTo install geo_subsampler, first install python 3, then run:\n\n```bash\npip3 install geo_subsampler\n```\n\n\n\n## Input data\nAs an input, one needs to provide a **NON**-dated phylogenetical tree in [newick](https://en.wikipedia.org/wiki/Newick_format) format,\na metadata table containing tip names, locations and states, \nin tab-delimited (by default) or csv format (to be specified with *'--sep ,'* option).\nTo subsample according to user-specified proportions, one should also input a location case counts, \nas tab(or comma, see above)-separated table whose first column contains locations and the second case counts.\n\n### Example\nThe folder [example_data](example_data) contains an example of an input tree ([covid.nwk](example_data/covid.nwk)) \nrepresenting an early SARS-COV-2 epidemic,\nthe corresponding metadata table ([metadata.tab](example_data/metadata.tab)), and a case count table ([cases.tab](example_data/cases.tab)).\n\nThe input tree contains 11 167 sampled tips.\n\n\nThe metadata table is a tab-separated file, containing tip ids in the first column, \ntheir countries of sampling in the second column, and the sampling dates in the third column:\n\nid\t| country\t| sampling date\n----- | ----- | -----\nEPI_ISL_402119\t| China\t| 30/12/2019\nEPI_ISL_402123\t| China\t| 24/12/2019\nEPI_ISL_403962\t| Thailand\t| 08/01/2020\n... | ... | ...\n\nThe case count table contains numbers of declared cases for each country:\n\ncountry\t| cases\n----- | ----- \nChina |\t84024\nThailand |\t3017\n... | ...\n\nThe following geo_subsampler command subsamples the input tree according to the case proportions and (as much as possible) equally between the months,\nin order to keep 1000 tips:\n\n```bash\ngeo_subsampler --tree example_data/covid.nwk --metadata example_data/metadata.tab \\\n--location_column country --date_column \"sampling date\" --cases example_data/cases.tab \\\n--output_dir example_data/results --size 1000\n```\n\nThe resulting tree is put into [example_data/results](example_data/results) folder:\n([covid.subsampled.0.nwk](example_data/results/covid.subsampled.0.nwk)). This folder also contains the ids of the tips retained in the subsampled tree:\n([covid.subsampled.0.ids](example_data/results/covid.subsampled.0.ids)), and two tables with the statistics on the subsampling:\n[case_counts.tab](example_data/results/case_counts.tab) and [case_counts_per_time.tab](example_data/results/case_counts_per_time.tab).\n\n\n## Detailed options\n- **--tree TREE** Path to the input phylogeny (NOT time-scaled) in newick format.\n- **--metadata METADATA** Path to the metadata table containing location and date annotations, in a tab-delimited format.\n- **--sep SEP** Separator used in the metadata and case tables. By default a tab-separated table is assumed.\n- **--index_column INDEX_COLUMN**\n number (starting from zero) of the index column (containing tree tip names) in the metadata table. By default is the first column (corresponding to the number 0)\n- **--location_column LOCATION_COLUMN**\n name of the column containing location annotations in the metadata table.\n- **--date_column DATE_COLUMN**\n name of the column containing date annotations in the metadata table.\n- **--cases CASES** A tab-separated file with two columns. The first column lists the locations, while the second column contains the numbers of declared cases or proportions for the\n corresponding locations\n- **--start_date START_DATE**\n If specified, all the cases before this date will be included in all the sub-sampled data sets.\n- **--size SIZE** Target size of the sub-sampled data set (in number of samples). By default, will be set to a half of the data set represented by the input tree.\n- **--repetitions REPETITIONS** Number of sub-sampled trees to produce. By default 1.\n- **--output_dir OUTPUT_DIR**\n Path to the directory where the sub-sampled results should be saved.\n- **--min_cases MIN_CASES**\n Minimum number of samples to retain for each location.\n- **--date_precision {year,month,day}**\n Precision for homogeneous subsampling over time within each location. By default (month) will aim at distributing selected location samples equally over months.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Subsampling of rooted phylogenetic trees using phylogenetic diversity and location proportions.",
"version": "0.2",
"project_urls": {
"Homepage": "https://github.com/evolbioinfo/geo_subsampler"
},
"split_keywords": [
"phylogenetics",
" subsampling",
" phylogenetic diversity",
" phylogeography"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7b168b5dc05c60fbace79a6fb2a8fea66af7d3a51308a6b4d1c963cd59dd8dc3",
"md5": "6c4fc4dd27956e3f6396e106a415910b",
"sha256": "561fc9207043c1d46532cf5f5bb3189819efbfbf38c832b0a361c382b6c67ba3"
},
"downloads": -1,
"filename": "geo_subsampler-0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6c4fc4dd27956e3f6396e106a415910b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 23277,
"upload_time": "2024-08-10T00:44:34",
"upload_time_iso_8601": "2024-08-10T00:44:34.859913Z",
"url": "https://files.pythonhosted.org/packages/7b/16/8b5dc05c60fbace79a6fb2a8fea66af7d3a51308a6b4d1c963cd59dd8dc3/geo_subsampler-0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "42bc24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b",
"md5": "e26b2b2ae5483568bc5faa25d2aff15d",
"sha256": "8a0c023301e24d49ff078fababb7cccda979ff97cd01754b91d414cd7e3883df"
},
"downloads": -1,
"filename": "geo_subsampler-0.2.tar.gz",
"has_sig": false,
"md5_digest": "e26b2b2ae5483568bc5faa25d2aff15d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22037,
"upload_time": "2024-08-10T00:44:36",
"upload_time_iso_8601": "2024-08-10T00:44:36.163549Z",
"url": "https://files.pythonhosted.org/packages/42/bc/24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b/geo_subsampler-0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-10 00:44:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "evolbioinfo",
"github_project": "geo_subsampler",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "geo-subsampler"
}