geo-subsampler


Namegeo-subsampler JSON
Version 0.2 PyPI version JSON
download
home_pagehttps://github.com/evolbioinfo/geo_subsampler
SummarySubsampling of rooted phylogenetic trees using phylogenetic diversity and location proportions.
upload_time2024-08-10 00:44:36
maintainerNone
docs_urlNone
authorAnna Zhukova
requires_pythonNone
licenseNone
keywords phylogenetics subsampling phylogenetic diversity phylogeography
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GEO subsampler

Geo_subsampler subsamples a given phylogenetic tree to rebalance the samples at different locations 
according to user-specified proportions. Moreover, for each location the kept samples are chosen 
in a balanced way over the sampling intervals (e.g. months).
With these constraints in mind, the script uses phylogenetic diversity [[Faith 1992]](https://www.sciencedirect.com/science/article/pii/0006320792912013) 
to pick the samples to be removed.
Additional options allow to keep all the samples before a certain data, 
and to ensure a minimal number of samples picked by location, despite the other criteria.

### Article

If you find geo_sampler useful, please cite: 

A Zhukova, L Blassel, F Lemoine, M Morel, J Voznica, O Gascuel (2021) __Origin, evolution and global spread of SARS-CoV-2__
CRAS 344(1): 57-75 doi:[10.5802/crbiol.29](https://doi.org/10.5802/crbiol.29).


## Installation
To install geo_subsampler, first install python 3, then run:

```bash
pip3 install geo_subsampler
```



## Input data
As an input, one needs to provide a **NON**-dated phylogenetical tree in [newick](https://en.wikipedia.org/wiki/Newick_format) format,
a metadata table containing tip names, locations and states, 
in tab-delimited (by default) or csv format (to be specified with *'--sep ,'* option).
To subsample according to user-specified proportions, one should also input a location case counts, 
as tab(or comma, see above)-separated table whose first column contains locations and the second case counts.

### Example
The folder [example_data](example_data) contains an example of an input tree ([covid.nwk](example_data/covid.nwk)) 
representing an early SARS-COV-2 epidemic,
the corresponding metadata table ([metadata.tab](example_data/metadata.tab)), and a case count table ([cases.tab](example_data/cases.tab)).

The input tree contains 11 167 sampled tips.


The metadata table is a tab-separated file, containing tip ids in the first column, 
their countries of sampling in the second column, and the sampling dates in the third column:

id	| country	| sampling date
----- |  ----- | -----
EPI_ISL_402119	| China	| 30/12/2019
EPI_ISL_402123	| China	| 24/12/2019
EPI_ISL_403962	| Thailand	| 08/01/2020
... | ... | ...

The case count table contains numbers of declared cases for each country:

country	| cases
----- |  ----- 
China |	84024
Thailand |	3017
... | ...

The following geo_subsampler command subsamples the input tree according to the case proportions and (as much as possible) equally between the months,
in order to keep 1000 tips:

```bash
geo_subsampler --tree example_data/covid.nwk --metadata example_data/metadata.tab \
--location_column country --date_column "sampling date" --cases example_data/cases.tab \
--output_dir example_data/results --size 1000
```

The resulting tree is put into [example_data/results](example_data/results) folder:
([covid.subsampled.0.nwk](example_data/results/covid.subsampled.0.nwk)). This folder also contains the ids of the tips retained in the subsampled tree:
([covid.subsampled.0.ids](example_data/results/covid.subsampled.0.ids)), and two tables with the statistics on the subsampling:
[case_counts.tab](example_data/results/case_counts.tab) and [case_counts_per_time.tab](example_data/results/case_counts_per_time.tab).


## Detailed options
- **--tree TREE**           Path to the input phylogeny (NOT time-scaled) in newick format.
- **--metadata METADATA**   Path to the metadata table containing location and date annotations, in a tab-delimited format.
- **--sep SEP**             Separator used in the metadata and case tables. By default a tab-separated table is assumed.
- **--index_column INDEX_COLUMN**
                        number (starting from zero) of the index column (containing tree tip names) in the metadata table. By default is the first column (corresponding to the number 0)
- **--location_column LOCATION_COLUMN**
                        name of the column containing location annotations in the metadata table.
- **--date_column DATE_COLUMN**
                        name of the column containing date annotations in the metadata table.
- **--cases CASES**         A tab-separated file with two columns. The first column lists the locations, while the second column contains the numbers of declared cases or proportions for the
                        corresponding locations
- **--start_date START_DATE**
                        If specified, all the cases before this date will be included in all the sub-sampled data sets.
- **--size SIZE**           Target size of the sub-sampled data set (in number of samples). By default, will be set to a half of the data set represented by the input tree.
- **--repetitions REPETITIONS** Number of sub-sampled trees to produce. By default 1.
- **--output_dir OUTPUT_DIR**
                        Path to the directory where the sub-sampled results should be saved.
- **--min_cases MIN_CASES**
                        Minimum number of samples to retain for each location.
- **--date_precision {year,month,day}**
                        Precision for homogeneous subsampling over time within each location. By default (month) will aim at distributing selected location samples equally over months.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/evolbioinfo/geo_subsampler",
    "name": "geo-subsampler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "phylogenetics, subsampling, phylogenetic diversity, phylogeography",
    "author": "Anna Zhukova",
    "author_email": "anna.zhukova@pasteur.fr",
    "download_url": "https://files.pythonhosted.org/packages/42/bc/24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b/geo_subsampler-0.2.tar.gz",
    "platform": null,
    "description": "# GEO subsampler\n\nGeo_subsampler subsamples a given phylogenetic tree to rebalance the samples at different locations \naccording to user-specified proportions. Moreover, for each location the kept samples are chosen \nin a balanced way over the sampling intervals (e.g. months).\nWith these constraints in mind, the script uses phylogenetic diversity [[Faith 1992]](https://www.sciencedirect.com/science/article/pii/0006320792912013) \nto pick the samples to be removed.\nAdditional options allow to keep all the samples before a certain data, \nand to ensure a minimal number of samples picked by location, despite the other criteria.\n\n### Article\n\nIf you find geo_sampler useful, please cite: \n\nA Zhukova, L Blassel, F Lemoine, M Morel, J Voznica, O Gascuel (2021) __Origin, evolution and global spread of SARS-CoV-2__\nCRAS 344(1): 57-75 doi:[10.5802/crbiol.29](https://doi.org/10.5802/crbiol.29).\n\n\n## Installation\nTo install geo_subsampler, first install python 3, then run:\n\n```bash\npip3 install geo_subsampler\n```\n\n\n\n## Input data\nAs an input, one needs to provide a **NON**-dated phylogenetical tree in [newick](https://en.wikipedia.org/wiki/Newick_format) format,\na metadata table containing tip names, locations and states, \nin tab-delimited (by default) or csv format (to be specified with *'--sep ,'* option).\nTo subsample according to user-specified proportions, one should also input a location case counts, \nas tab(or comma, see above)-separated table whose first column contains locations and the second case counts.\n\n### Example\nThe folder [example_data](example_data) contains an example of an input tree ([covid.nwk](example_data/covid.nwk)) \nrepresenting an early SARS-COV-2 epidemic,\nthe corresponding metadata table ([metadata.tab](example_data/metadata.tab)), and a case count table ([cases.tab](example_data/cases.tab)).\n\nThe input tree contains 11 167 sampled tips.\n\n\nThe metadata table is a tab-separated file, containing tip ids in the first column, \ntheir countries of sampling in the second column, and the sampling dates in the third column:\n\nid\t| country\t| sampling date\n----- |  ----- | -----\nEPI_ISL_402119\t| China\t| 30/12/2019\nEPI_ISL_402123\t| China\t| 24/12/2019\nEPI_ISL_403962\t| Thailand\t| 08/01/2020\n... | ... | ...\n\nThe case count table contains numbers of declared cases for each country:\n\ncountry\t| cases\n----- |  ----- \nChina |\t84024\nThailand |\t3017\n... | ...\n\nThe following geo_subsampler command subsamples the input tree according to the case proportions and (as much as possible) equally between the months,\nin order to keep 1000 tips:\n\n```bash\ngeo_subsampler --tree example_data/covid.nwk --metadata example_data/metadata.tab \\\n--location_column country --date_column \"sampling date\" --cases example_data/cases.tab \\\n--output_dir example_data/results --size 1000\n```\n\nThe resulting tree is put into [example_data/results](example_data/results) folder:\n([covid.subsampled.0.nwk](example_data/results/covid.subsampled.0.nwk)). This folder also contains the ids of the tips retained in the subsampled tree:\n([covid.subsampled.0.ids](example_data/results/covid.subsampled.0.ids)), and two tables with the statistics on the subsampling:\n[case_counts.tab](example_data/results/case_counts.tab) and [case_counts_per_time.tab](example_data/results/case_counts_per_time.tab).\n\n\n## Detailed options\n- **--tree TREE**           Path to the input phylogeny (NOT time-scaled) in newick format.\n- **--metadata METADATA**   Path to the metadata table containing location and date annotations, in a tab-delimited format.\n- **--sep SEP**             Separator used in the metadata and case tables. By default a tab-separated table is assumed.\n- **--index_column INDEX_COLUMN**\n                        number (starting from zero) of the index column (containing tree tip names) in the metadata table. By default is the first column (corresponding to the number 0)\n- **--location_column LOCATION_COLUMN**\n                        name of the column containing location annotations in the metadata table.\n- **--date_column DATE_COLUMN**\n                        name of the column containing date annotations in the metadata table.\n- **--cases CASES**         A tab-separated file with two columns. The first column lists the locations, while the second column contains the numbers of declared cases or proportions for the\n                        corresponding locations\n- **--start_date START_DATE**\n                        If specified, all the cases before this date will be included in all the sub-sampled data sets.\n- **--size SIZE**           Target size of the sub-sampled data set (in number of samples). By default, will be set to a half of the data set represented by the input tree.\n- **--repetitions REPETITIONS** Number of sub-sampled trees to produce. By default 1.\n- **--output_dir OUTPUT_DIR**\n                        Path to the directory where the sub-sampled results should be saved.\n- **--min_cases MIN_CASES**\n                        Minimum number of samples to retain for each location.\n- **--date_precision {year,month,day}**\n                        Precision for homogeneous subsampling over time within each location. By default (month) will aim at distributing selected location samples equally over months.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Subsampling of rooted phylogenetic trees using phylogenetic diversity and location proportions.",
    "version": "0.2",
    "project_urls": {
        "Homepage": "https://github.com/evolbioinfo/geo_subsampler"
    },
    "split_keywords": [
        "phylogenetics",
        " subsampling",
        " phylogenetic diversity",
        " phylogeography"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7b168b5dc05c60fbace79a6fb2a8fea66af7d3a51308a6b4d1c963cd59dd8dc3",
                "md5": "6c4fc4dd27956e3f6396e106a415910b",
                "sha256": "561fc9207043c1d46532cf5f5bb3189819efbfbf38c832b0a361c382b6c67ba3"
            },
            "downloads": -1,
            "filename": "geo_subsampler-0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6c4fc4dd27956e3f6396e106a415910b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 23277,
            "upload_time": "2024-08-10T00:44:34",
            "upload_time_iso_8601": "2024-08-10T00:44:34.859913Z",
            "url": "https://files.pythonhosted.org/packages/7b/16/8b5dc05c60fbace79a6fb2a8fea66af7d3a51308a6b4d1c963cd59dd8dc3/geo_subsampler-0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "42bc24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b",
                "md5": "e26b2b2ae5483568bc5faa25d2aff15d",
                "sha256": "8a0c023301e24d49ff078fababb7cccda979ff97cd01754b91d414cd7e3883df"
            },
            "downloads": -1,
            "filename": "geo_subsampler-0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e26b2b2ae5483568bc5faa25d2aff15d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 22037,
            "upload_time": "2024-08-10T00:44:36",
            "upload_time_iso_8601": "2024-08-10T00:44:36.163549Z",
            "url": "https://files.pythonhosted.org/packages/42/bc/24c831821b6787b817c331780870176b0db1c143bfdc47f1764314f1514b/geo_subsampler-0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-10 00:44:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "evolbioinfo",
    "github_project": "geo_subsampler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "geo-subsampler"
}
        
Elapsed time: 0.57929s