Name | snailz JSON |
Version |
0.1.15
JSON |
| download |
home_page | None |
Summary | Synthetic data generator for snail mutation survey |
upload_time | 2024-12-14 11:35:27 |
maintainer | None |
docs_url | None |
author | None |
requires_python | None |
license | None |
keywords |
open science
synthetic data
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Snailz
<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/snail-logo.svg" alt="snail logo" width="200px">
These data generators model genomic analysis of snails in the Pacific Northwest
that are growing to unusual size as a result of exposure to pollution.
- One or more *surveys* are conducted at one or more *sites*.
- Each survey collects *genomes* and *sizes* of snails.
- A *grid* at each site is marked out to show the presence or absence of pollution.
- *Laboratory staff* perform *assays* of the snails' genetic material.
- Each assay plate has a *design* showing the material applied and *readings* showing the measured response.
- Plates may be *invalidated* after the fact if a staff member believes it is contaminated.
<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/survey.png" alt="survey sites">
## Usage
1. Create a fresh Python environment: `uv venv`
2. Activate that environment: `source .venv/bin/activate`
3. Build development version of package: `uv pip install -e .`
4. View available commands: `snailz --help`
5. Copy default parameter files: `snailz params --outdir ./params`
6. See how to regenerate datasets: `python -c 'import snailz; help(snailz)'`
To regenerate all data using the default parameters provided, run:
```
snailz everything --paramsdir ./params --datadir ./data --verbose
```
## Database
The final database `data/lab.db` is structured as shown below.
Note that the data from the file `assays.json` is split between several tables.
Note also that the SQLite database file is *not* included in this repository
because its binary representation changes each time it is regenerated
(even though the values it contains stay the same).
The map of survey locations in `data/survey.png` is not included in the repository for the same reason,
but a duplicate is manually saved in `img/survey.png`.
<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/db-schema.svg" alt="database schema">
- `site`: survey site
- `site_id`: primary key (text)
- `lon`: longitude of site reference marker (float deg)
- `lat`: latitude of site reference marker (float deg)
- `survey`
- `survey_id`: primary key (text)
- `site_id`: foreign key of site where survey was conducted (text)
- `date`: date that survey was conducted (date, YYYY-MM-DD)
- `sample`: sample taken from survey
- `sample_id`: primary key (int, 1-1 with `experiment.sample_id`)
- `survey_id`: foreign key of survey (int)
- `lon`: longitude of sample site (float deg)
- `lat`: latitude of sample site (float deg)
- `sequence`: genome sequence of sample (text)
- `size`: snail size (float)
- `experiment`: experiment done on sample
- `sample_id`: primary key (int, 1-1 with `sample.sample_id`)
- `kind`: kind of experiment (text, either 'ELISA' or 'JESS')
- `start`: start date (date, YYYY-MM-DD)
- `end`: end date (date, YYYY-MM-DD, null if experiment is ongoing)
- `staff`
- `staff_id`: primary key (int)
- `personal`: personal name (text)
- `family`: family name (text)
- `performed`: join table showing which staff members performed which experiments
- `staff_id`: foreign key of staff member
- `sample_id`: foreign key of sample/experiment
- `plate`: information about single assay plate
- `plate_id`: primary key (int)
- `sample_id`: foreign key of sample/experiment (int)
- `date`: date that plate was run (date, YYYY-MM-DD)
- `filename`: filename of design/results file (text)
- `invalidated`: invalidated plates
- `plate_id`: foreign key of plate (int)
- `staff_id`: foreign key of staff member who did invalidation (int)
- `date`: when plate was invalidated
## Data Files
`./data` contains a generated dataset for reference.
As noted above,
it does *not* contain the SQLite database file `lab.db`;
run `snailz db` to regenerate it.
(See `help(snailz)` for an example invocation.)
- Staff: `staff.csv`
- `staff_id`: unique staff member identifier (int > 0)
- `personal`: personal name (text)
- `family`: family name (text)
- Genomes: `genomes.json`
- `length`: number of base pairs (int > 0)
- `reference`: the unmutated reference genome (text)
- `individuals`: sequences for individuals (list of text)
- `locations`: locations of mutations (list of int)
- `susceptible_loc`: location of mutation of interest (int >= 0)
- `susceptible_base`: mutated base responsible for size change (char)
- Grids: `grids/*.csv` (one file per site)
- values are contamination levels at sample points (0 means no contamination)
- Samples: `grids/samples.csv`
- `sample_id`: unique ID for genetic sample (text)
- `survey_id`: which survey it was taken in (text)
- `lon`: longitude of sample site (float)
- `lat`: latitude of sample site (float)
- `sequence`: sampled gene sequence (text)
- `size`: snail weight (float, grams)
- Assays: `assays.json`
- `experiment`: experiment details
- `sample_id`: sample that experiment used (int > 0)
- `kind`: "ELISA" or "JESS" (text)
- `start`: start date (date, YYYY-MM-DD)
- `end`: end date (date, YYYY-MM-DD or None if experiment incomplete)
- `performed`: join table showing who performed which experiments
- `staff_id`: foreign key to `staff`
- `sample_id`: foreign key to `experiment`
- `plate`: details of assay plates used in experiments
- `plate_id`: unique plate identifier (int > 0)
- `sample_id`: foreign key to `sample` (text)
- `date`: date plate was run (date, YYYY-MM-DD)
- `filename`: name of design and results files (text)
- `invalidated`: which plates have been invalidated
- `plate_id`: foreign key to plate (text)
- `staff_id`: foreign key to staff member responsible (text)
- `date`: invalidation date (date, YYYY-MM-DD)
- Plates are represented by matching files in the `designs` and `readings` directories
- `designs/*.csv`: assay plate designs
- header: machine type, file type ("design" or "readings"), staff ID
- blank line
- table with column and row titles showing material in each well
- `readings/*.csv`: assay plate readings
- header: machine type, file type ("design" or "readings"), staff ID
- blank line
- table with column and row titles showing reading from each well
- To simulate the messiness of real experimental data,
the tidy assay plate files in `readings/*.csv` are copied to `mangled/*.csv`
with random changes:
- Some files have a staff member's name added in the first row.
- Some have an extra header row containing the experiment date.
- Some have a footer with the staff member's ID.
- In some, the values are offset one column to the right.
## Workflow
The workflow used to generate the database and data files is shown below:
- `snailz` or `snailz --help`: show available commands
- `snailz clean`: remove all datasets
- `snailz everything`: make all datasets
- `snailz grids`: synthesize pollution grids
- `snailz genomes`: synthesize genomes
- `snailz samples`: sample snails from survey sites
- `snailz staff`: synthesize staff
- `snailz assays`: generate assay files
- `snailz plates`: generate plate files
- `snailz mangle`: create mangled plate reading files
- `snailz db`: generate database
- `snailz map`: generate SVG map of sample locations (in progress)
<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/workflow.svg" alt="data generation workflow">
## Parameters
`./snailz/params` contains the parameter files used to control generation of the reference dataset.
These are included in the package and can be copied into the current directory using `snailz params --outdir .`
(replace `.` with another directory name as desired).
`snailz params` also copies a Makefile that can re-run commands with appropriate parameters;
see the table of commands given earlier for options.
- Sites: `sites.csv`
- `site_id`: unique label for site (text)
- `lon`: longitude of site reference marker (deg)
- `lat`: latitude of site reference marker (deg)
- Grids: `grids.json`
- `depth`: range of random values per cell (int > 0)
- `height`: number of cells on Y axis (int > 0)
- `seed`: RNG seed (int > 0)
- `width`: number of cells on X axis (int > 0)
- Surveys: `surveys.csv`
- `survey_id`: unique label for survey (text)
- `site_id`: ID of site where survey was conducted (text)
- `date`: date that survey was conducted (date, YYYY-MM-DD)
- `spacing`: spacing of measurement point (float, meters)
- Genomes: `genomes.json`
- `length`: number of base pairs in sequences (int > 0)
- `num_genomes`: how many individuals to generate (int > 0)
- `num_snp`: number of single nucleotide polymorphisms (int > 0)
- `prob_other`: probability of non-significant mutations (float in 0..1)
- `seed`: RNG seed (int > 0)
- `snp_probs`: probability of selecting various bases (list of 4 float summing to 1.0)
- Staff: `staff.json`
- `locale`: locale to use when generating staff names (text)
- `num`: number of staff (int > 0)
- `seed`: RNG seed (int > 0)
- Assays: `assays.json`
- `assay_duration`: range of days for each assay (ordered pair of int >= 0)
- `assay_plates`: range of plates per assay (ordered pair of int >= 1)
- `assay_staff`: range of staff in each assay (ordered pair of int > 0)
- `assay_types`: types of assays (list of text)
- `control_val`: nominal reading value for control wells (float > 0)
- `controls`: labels to used for control wells (list of text)
- `enddate`: end of all experiments
- `filename_length`: length of stem of design/readings filenames (int > 0)
- `fraction`: fraction of samples that have been used in experiments
- `invalid`: probability of plate being invalidated (float in 0..1)
- `seed`: RNG seed (int > 0)
- `startdate`: start of all experiments
- `stdev`: standard deviation on readings (float > 0)
- `treated_val`: nominal reading value for treated well (float > 0)
- `treatment`: label to use for treated wells (text)
Raw data
{
"_id": null,
"home_page": null,
"name": "snailz",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": "Greg Wilson <gvwilson@third-bit.com>",
"keywords": "open science, synthetic data",
"author": null,
"author_email": "Greg Wilson <gvwilson@third-bit.com>",
"download_url": "https://files.pythonhosted.org/packages/60/3f/b673180a1f53301a8d4e943e4b90224bf539020368916500a9b535f03fb0/snailz-0.1.15.tar.gz",
"platform": null,
"description": "# Snailz\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/snail-logo.svg\" alt=\"snail logo\" width=\"200px\">\n\nThese data generators model genomic analysis of snails in the Pacific Northwest\nthat are growing to unusual size as a result of exposure to pollution.\n\n- One or more *surveys* are conducted at one or more *sites*.\n- Each survey collects *genomes* and *sizes* of snails.\n- A *grid* at each site is marked out to show the presence or absence of pollution.\n- *Laboratory staff* perform *assays* of the snails' genetic material.\n- Each assay plate has a *design* showing the material applied and *readings* showing the measured response.\n- Plates may be *invalidated* after the fact if a staff member believes it is contaminated.\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/survey.png\" alt=\"survey sites\">\n\n## Usage\n\n1. Create a fresh Python environment: `uv venv`\n2. Activate that environment: `source .venv/bin/activate`\n3. Build development version of package: `uv pip install -e .`\n4. View available commands: `snailz --help`\n5. Copy default parameter files: `snailz params --outdir ./params`\n6. See how to regenerate datasets: `python -c 'import snailz; help(snailz)'`\n\nTo regenerate all data using the default parameters provided, run:\n\n```\nsnailz everything --paramsdir ./params --datadir ./data --verbose\n```\n\n## Database\n\nThe final database `data/lab.db` is structured as shown below.\nNote that the data from the file `assays.json` is split between several tables.\nNote also that the SQLite database file is *not* included in this repository\nbecause its binary representation changes each time it is regenerated\n(even though the values it contains stay the same).\nThe map of survey locations in `data/survey.png` is not included in the repository for the same reason,\nbut a duplicate is manually saved in `img/survey.png`.\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/db-schema.svg\" alt=\"database schema\">\n\n- `site`: survey site\n - `site_id`: primary key (text)\n - `lon`: longitude of site reference marker (float deg)\n - `lat`: latitude of site reference marker (float deg)\n- `survey`\n - `survey_id`: primary key (text)\n - `site_id`: foreign key of site where survey was conducted (text)\n - `date`: date that survey was conducted (date, YYYY-MM-DD)\n- `sample`: sample taken from survey\n - `sample_id`: primary key (int, 1-1 with `experiment.sample_id`)\n - `survey_id`: foreign key of survey (int)\n - `lon`: longitude of sample site (float deg)\n - `lat`: latitude of sample site (float deg)\n - `sequence`: genome sequence of sample (text)\n - `size`: snail size (float)\n- `experiment`: experiment done on sample\n - `sample_id`: primary key (int, 1-1 with `sample.sample_id`)\n - `kind`: kind of experiment (text, either 'ELISA' or 'JESS')\n - `start`: start date (date, YYYY-MM-DD)\n - `end`: end date (date, YYYY-MM-DD, null if experiment is ongoing)\n- `staff`\n - `staff_id`: primary key (int)\n - `personal`: personal name (text)\n - `family`: family name (text)\n- `performed`: join table showing which staff members performed which experiments\n - `staff_id`: foreign key of staff member\n - `sample_id`: foreign key of sample/experiment\n- `plate`: information about single assay plate\n - `plate_id`: primary key (int)\n - `sample_id`: foreign key of sample/experiment (int)\n - `date`: date that plate was run (date, YYYY-MM-DD)\n - `filename`: filename of design/results file (text)\n- `invalidated`: invalidated plates\n - `plate_id`: foreign key of plate (int)\n - `staff_id`: foreign key of staff member who did invalidation (int)\n - `date`: when plate was invalidated\n\n## Data Files\n\n`./data` contains a generated dataset for reference.\nAs noted above,\nit does *not* contain the SQLite database file `lab.db`;\nrun `snailz db` to regenerate it.\n(See `help(snailz)` for an example invocation.)\n\n- Staff: `staff.csv`\n - `staff_id`: unique staff member identifier (int > 0)\n - `personal`: personal name (text)\n - `family`: family name (text)\n- Genomes: `genomes.json`\n - `length`: number of base pairs (int > 0)\n - `reference`: the unmutated reference genome (text)\n - `individuals`: sequences for individuals (list of text)\n - `locations`: locations of mutations (list of int)\n - `susceptible_loc`: location of mutation of interest (int >= 0)\n - `susceptible_base`: mutated base responsible for size change (char)\n- Grids: `grids/*.csv` (one file per site)\n - values are contamination levels at sample points (0 means no contamination)\n- Samples: `grids/samples.csv`\n - `sample_id`: unique ID for genetic sample (text)\n - `survey_id`: which survey it was taken in (text)\n - `lon`: longitude of sample site (float)\n - `lat`: latitude of sample site (float)\n - `sequence`: sampled gene sequence (text)\n - `size`: snail weight (float, grams)\n- Assays: `assays.json`\n - `experiment`: experiment details\n - `sample_id`: sample that experiment used (int > 0)\n - `kind`: \"ELISA\" or \"JESS\" (text)\n - `start`: start date (date, YYYY-MM-DD)\n - `end`: end date (date, YYYY-MM-DD or None if experiment incomplete)\n - `performed`: join table showing who performed which experiments\n - `staff_id`: foreign key to `staff`\n - `sample_id`: foreign key to `experiment`\n - `plate`: details of assay plates used in experiments\n - `plate_id`: unique plate identifier (int > 0)\n - `sample_id`: foreign key to `sample` (text)\n - `date`: date plate was run (date, YYYY-MM-DD)\n - `filename`: name of design and results files (text)\n - `invalidated`: which plates have been invalidated\n - `plate_id`: foreign key to plate (text)\n - `staff_id`: foreign key to staff member responsible (text)\n - `date`: invalidation date (date, YYYY-MM-DD)\n- Plates are represented by matching files in the `designs` and `readings` directories\n - `designs/*.csv`: assay plate designs\n - header: machine type, file type (\"design\" or \"readings\"), staff ID\n - blank line\n - table with column and row titles showing material in each well\n - `readings/*.csv`: assay plate readings\n - header: machine type, file type (\"design\" or \"readings\"), staff ID\n - blank line\n - table with column and row titles showing reading from each well\n- To simulate the messiness of real experimental data,\n the tidy assay plate files in `readings/*.csv` are copied to `mangled/*.csv`\n with random changes:\n - Some files have a staff member's name added in the first row.\n - Some have an extra header row containing the experiment date.\n - Some have a footer with the staff member's ID.\n - In some, the values are offset one column to the right.\n\n## Workflow\n\nThe workflow used to generate the database and data files is shown below:\n\n- `snailz` or `snailz --help`: show available commands\n- `snailz clean`: remove all datasets\n- `snailz everything`: make all datasets\n- `snailz grids`: synthesize pollution grids\n- `snailz genomes`: synthesize genomes\n- `snailz samples`: sample snails from survey sites\n- `snailz staff`: synthesize staff\n- `snailz assays`: generate assay files\n- `snailz plates`: generate plate files\n- `snailz mangle`: create mangled plate reading files\n- `snailz db`: generate database\n- `snailz map`: generate SVG map of sample locations (in progress)\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/workflow.svg\" alt=\"data generation workflow\">\n\n## Parameters\n\n`./snailz/params` contains the parameter files used to control generation of the reference dataset.\nThese are included in the package and can be copied into the current directory using `snailz params --outdir .`\n(replace `.` with another directory name as desired).\n`snailz params` also copies a Makefile that can re-run commands with appropriate parameters;\nsee the table of commands given earlier for options.\n\n- Sites: `sites.csv`\n - `site_id`: unique label for site (text)\n - `lon`: longitude of site reference marker (deg)\n - `lat`: latitude of site reference marker (deg)\n- Grids: `grids.json`\n - `depth`: range of random values per cell (int > 0)\n - `height`: number of cells on Y axis (int > 0)\n - `seed`: RNG seed (int > 0)\n - `width`: number of cells on X axis (int > 0)\n- Surveys: `surveys.csv`\n - `survey_id`: unique label for survey (text)\n - `site_id`: ID of site where survey was conducted (text)\n - `date`: date that survey was conducted (date, YYYY-MM-DD)\n - `spacing`: spacing of measurement point (float, meters)\n- Genomes: `genomes.json`\n - `length`: number of base pairs in sequences (int > 0)\n - `num_genomes`: how many individuals to generate (int > 0)\n - `num_snp`: number of single nucleotide polymorphisms (int > 0)\n - `prob_other`: probability of non-significant mutations (float in 0..1)\n - `seed`: RNG seed (int > 0)\n - `snp_probs`: probability of selecting various bases (list of 4 float summing to 1.0)\n- Staff: `staff.json`\n - `locale`: locale to use when generating staff names (text)\n - `num`: number of staff (int > 0)\n - `seed`: RNG seed (int > 0)\n- Assays: `assays.json`\n - `assay_duration`: range of days for each assay (ordered pair of int >= 0)\n - `assay_plates`: range of plates per assay (ordered pair of int >= 1)\n - `assay_staff`: range of staff in each assay (ordered pair of int > 0)\n - `assay_types`: types of assays (list of text)\n - `control_val`: nominal reading value for control wells (float > 0)\n - `controls`: labels to used for control wells (list of text)\n - `enddate`: end of all experiments\n - `filename_length`: length of stem of design/readings filenames (int > 0)\n - `fraction`: fraction of samples that have been used in experiments\n - `invalid`: probability of plate being invalidated (float in 0..1)\n - `seed`: RNG seed (int > 0)\n - `startdate`: start of all experiments\n - `stdev`: standard deviation on readings (float > 0)\n - `treated_val`: nominal reading value for treated well (float > 0)\n - `treatment`: label to use for treated wells (text)\n",
"bugtrack_url": null,
"license": null,
"summary": "Synthetic data generator for snail mutation survey",
"version": "0.1.15",
"project_urls": null,
"split_keywords": [
"open science",
" synthetic data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "59fcf0f07f41800a39e69404ec800069981ddba4e42426578338f5f9c675177f",
"md5": "465a87e3ee7356b182e83fc59355c85b",
"sha256": "3904f7e3ba691e6544bc7f1ec30a910ea43d3e97d29ae52579b01e84e4fe6455"
},
"downloads": -1,
"filename": "snailz-0.1.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "465a87e3ee7356b182e83fc59355c85b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 26008,
"upload_time": "2024-12-14T11:35:25",
"upload_time_iso_8601": "2024-12-14T11:35:25.211583Z",
"url": "https://files.pythonhosted.org/packages/59/fc/f0f07f41800a39e69404ec800069981ddba4e42426578338f5f9c675177f/snailz-0.1.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "603fb673180a1f53301a8d4e943e4b90224bf539020368916500a9b535f03fb0",
"md5": "d9bbedc919a47ec792c2ee32da63bc6e",
"sha256": "c71d49d4291c012a53b4a997cd66048b7c1d09095ec6d64c7052f34dd752165c"
},
"downloads": -1,
"filename": "snailz-0.1.15.tar.gz",
"has_sig": false,
"md5_digest": "d9bbedc919a47ec792c2ee32da63bc6e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 25117,
"upload_time": "2024-12-14T11:35:27",
"upload_time_iso_8601": "2024-12-14T11:35:27.619525Z",
"url": "https://files.pythonhosted.org/packages/60/3f/b673180a1f53301a8d4e943e4b90224bf539020368916500a9b535f03fb0/snailz-0.1.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-14 11:35:27",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "snailz"
}