snailz


Namesnailz JSON
Version 0.1.15 PyPI version JSON
download
home_pageNone
SummarySynthetic data generator for snail mutation survey
upload_time2024-12-14 11:35:27
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseNone
keywords open science synthetic data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Snailz

<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/snail-logo.svg" alt="snail logo" width="200px">

These data generators model genomic analysis of snails in the Pacific Northwest
that are growing to unusual size as a result of exposure to pollution.

-   One or more *surveys* are conducted at one or more *sites*.
-   Each survey collects *genomes* and *sizes* of snails.
-   A *grid* at each site is marked out to show the presence or absence of pollution.
-   *Laboratory staff* perform *assays* of the snails' genetic material.
-   Each assay plate has a *design* showing the material applied and *readings* showing the measured response.
-   Plates may be *invalidated* after the fact if a staff member believes it is contaminated.

<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/survey.png" alt="survey sites">

## Usage

1.  Create a fresh Python environment: `uv venv`
2.  Activate that environment: `source .venv/bin/activate`
3.  Build development version of package: `uv pip install -e .`
4.  View available commands: `snailz --help`
5.  Copy default parameter files: `snailz params --outdir ./params`
6.  See how to regenerate datasets: `python -c 'import snailz; help(snailz)'`

To regenerate all data using the default parameters provided, run:

```
snailz everything --paramsdir ./params --datadir ./data --verbose
```

## Database

The final database `data/lab.db` is structured as shown below.
Note that the data from the file `assays.json` is split between several tables.
Note also that the SQLite database file is *not* included in this repository
because its binary representation changes each time it is regenerated
(even though the values it contains stay the same).
The map of survey locations in `data/survey.png` is not included in the repository for the same reason,
but a duplicate is manually saved in `img/survey.png`.

<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/db-schema.svg" alt="database schema">

-   `site`: survey site
    -   `site_id`: primary key (text)
    -   `lon`: longitude of site reference marker (float deg)
    -   `lat`: latitude of site reference marker (float deg)
-   `survey`
    -   `survey_id`: primary key (text)
    -   `site_id`: foreign key of site where survey was conducted (text)
    -   `date`: date that survey was conducted (date, YYYY-MM-DD)
-   `sample`: sample taken from survey
    -   `sample_id`: primary key (int, 1-1 with `experiment.sample_id`)
    -   `survey_id`: foreign key of survey (int)
    -   `lon`: longitude of sample site (float deg)
    -   `lat`: latitude of sample site (float deg)
    -   `sequence`: genome sequence of sample (text)
    -   `size`: snail size (float)
-   `experiment`: experiment done on sample
    -   `sample_id`: primary key (int, 1-1 with `sample.sample_id`)
    -   `kind`: kind of experiment (text, either 'ELISA' or 'JESS')
    -   `start`: start date (date, YYYY-MM-DD)
    -   `end`: end date (date, YYYY-MM-DD, null if experiment is ongoing)
-   `staff`
    -   `staff_id`: primary key (int)
    -   `personal`: personal name (text)
    -   `family`: family name (text)
-   `performed`: join table showing which staff members performed which experiments
    -   `staff_id`: foreign key of staff member
    -   `sample_id`: foreign key of sample/experiment
-   `plate`: information about single assay plate
    -   `plate_id`: primary key (int)
    -   `sample_id`: foreign key of sample/experiment (int)
    -   `date`: date that plate was run (date, YYYY-MM-DD)
    -   `filename`: filename of design/results file (text)
-   `invalidated`: invalidated plates
    -   `plate_id`: foreign key of plate (int)
    -   `staff_id`: foreign key of staff member who did invalidation (int)
    -   `date`: when plate was invalidated

## Data Files

`./data` contains a generated dataset for reference.
As noted above,
it does *not* contain the SQLite database file `lab.db`;
run `snailz db` to regenerate it.
(See `help(snailz)` for an example invocation.)

-   Staff: `staff.csv`
    -   `staff_id`: unique staff member identifier (int > 0)
    -   `personal`: personal name (text)
    -   `family`: family name (text)
-   Genomes: `genomes.json`
    -   `length`: number of base pairs (int > 0)
    -   `reference`: the unmutated reference genome (text)
    -   `individuals`: sequences for individuals (list of text)
    -   `locations`: locations of mutations (list of int)
    -   `susceptible_loc`: location of mutation of interest (int >= 0)
    -   `susceptible_base`: mutated base responsible for size change (char)
-   Grids: `grids/*.csv` (one file per site)
    -   values are contamination levels at sample points (0 means no contamination)
-   Samples: `grids/samples.csv`
    -   `sample_id`: unique ID for genetic sample (text)
    -   `survey_id`: which survey it was taken in (text)
    -   `lon`: longitude of sample site (float)
    -   `lat`: latitude of sample site (float)
    -   `sequence`: sampled gene sequence (text)
    -   `size`: snail weight (float, grams)
-   Assays: `assays.json`
    -   `experiment`: experiment details
        -   `sample_id`: sample that experiment used (int > 0)
        -   `kind`: "ELISA" or "JESS" (text)
        -   `start`: start date (date, YYYY-MM-DD)
        -   `end`: end date (date, YYYY-MM-DD or None if experiment incomplete)
    -   `performed`: join table showing who performed which experiments
        -   `staff_id`: foreign key to `staff`
        -   `sample_id`: foreign key to `experiment`
    -   `plate`: details of assay plates used in experiments
        -   `plate_id`: unique plate identifier (int > 0)
        -   `sample_id`: foreign key to `sample` (text)
        -   `date`: date plate was run (date, YYYY-MM-DD)
        -   `filename`: name of design and results files (text)
    -   `invalidated`: which plates have been invalidated
        -   `plate_id`: foreign key to plate (text)
        -   `staff_id`: foreign key to staff member responsible (text)
        -   `date`: invalidation date (date, YYYY-MM-DD)
-   Plates are represented by matching files in the `designs` and `readings` directories
    -   `designs/*.csv`: assay plate designs
        -   header: machine type, file type ("design" or "readings"), staff ID
        -   blank line
        -   table with column and row titles showing material in each well
    -   `readings/*.csv`: assay plate readings
        -   header: machine type, file type ("design" or "readings"), staff ID
        -   blank line
        -   table with column and row titles showing reading from each well
-   To simulate the messiness of real experimental data,
    the tidy assay plate files in `readings/*.csv` are copied to `mangled/*.csv`
    with random changes:
    -   Some files have a staff member's name added in the first row.
    -   Some have an extra header row containing the experiment date.
    -   Some have a footer with the staff member's ID.
    -   In some, the values are offset one column to the right.

## Workflow

The workflow used to generate the database and data files is shown below:

-   `snailz` or `snailz --help`: show available commands
-   `snailz clean`: remove all datasets
-   `snailz everything`: make all datasets
-   `snailz grids`: synthesize pollution grids
-   `snailz genomes`: synthesize genomes
-   `snailz samples`: sample snails from survey sites
-   `snailz staff`: synthesize staff
-   `snailz assays`: generate assay files
-   `snailz plates`: generate plate files
-   `snailz mangle`: create mangled plate reading files
-   `snailz db`: generate database
-   `snailz map`: generate SVG map of sample locations (in progress)

<img src="https://raw.githubusercontent.com/gvwilson/snailz/main/img/workflow.svg" alt="data generation workflow">

## Parameters

`./snailz/params` contains the parameter files used to control generation of the reference dataset.
These are included in the package and can be copied into the current directory using `snailz params --outdir .`
(replace `.` with another directory name as desired).
`snailz params` also copies a Makefile that can re-run commands with appropriate parameters;
see the table of commands given earlier for options.

-   Sites: `sites.csv`
    -   `site_id`: unique label for site (text)
    -   `lon`: longitude of site reference marker (deg)
    -   `lat`: latitude of site reference marker (deg)
-   Grids: `grids.json`
    -   `depth`: range of random values per cell (int > 0)
    -   `height`: number of cells on Y axis (int > 0)
    -   `seed`: RNG seed (int > 0)
    -   `width`: number of cells on X axis (int > 0)
-   Surveys: `surveys.csv`
    -   `survey_id`: unique label for survey (text)
    -   `site_id`: ID of site where survey was conducted (text)
    -   `date`: date that survey was conducted (date, YYYY-MM-DD)
    -   `spacing`: spacing of measurement point (float, meters)
-   Genomes: `genomes.json`
    -   `length`: number of base pairs in sequences (int > 0)
    -   `num_genomes`: how many individuals to generate (int > 0)
    -   `num_snp`: number of single nucleotide polymorphisms (int > 0)
    -   `prob_other`: probability of non-significant mutations (float in 0..1)
    -   `seed`: RNG seed (int > 0)
    -   `snp_probs`: probability of selecting various bases (list of 4 float summing to 1.0)
-   Staff: `staff.json`
    -   `locale`: locale to use when generating staff names (text)
    -   `num`: number of staff (int > 0)
    -   `seed`: RNG seed (int > 0)
-   Assays: `assays.json`
    -   `assay_duration`: range of days for each assay (ordered pair of int >= 0)
    -   `assay_plates`: range of plates per assay (ordered pair of int >= 1)
    -   `assay_staff`: range of staff in each assay (ordered pair of int > 0)
    -   `assay_types`: types of assays (list of text)
    -   `control_val`: nominal reading value for control wells (float > 0)
    -   `controls`: labels to used for control wells (list of text)
    -   `enddate`: end of all experiments
    -   `filename_length`: length of stem of design/readings filenames (int > 0)
    -   `fraction`: fraction of samples that have been used in experiments
    -   `invalid`: probability of plate being invalidated (float in 0..1)
    -   `seed`: RNG seed (int > 0)
    -   `startdate`: start of all experiments
    -   `stdev`: standard deviation on readings (float > 0)
    -   `treated_val`: nominal reading value for treated well (float > 0)
    -   `treatment`: label to use for treated wells (text)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "snailz",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "Greg Wilson <gvwilson@third-bit.com>",
    "keywords": "open science, synthetic data",
    "author": null,
    "author_email": "Greg Wilson <gvwilson@third-bit.com>",
    "download_url": "https://files.pythonhosted.org/packages/60/3f/b673180a1f53301a8d4e943e4b90224bf539020368916500a9b535f03fb0/snailz-0.1.15.tar.gz",
    "platform": null,
    "description": "# Snailz\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/snail-logo.svg\" alt=\"snail logo\" width=\"200px\">\n\nThese data generators model genomic analysis of snails in the Pacific Northwest\nthat are growing to unusual size as a result of exposure to pollution.\n\n-   One or more *surveys* are conducted at one or more *sites*.\n-   Each survey collects *genomes* and *sizes* of snails.\n-   A *grid* at each site is marked out to show the presence or absence of pollution.\n-   *Laboratory staff* perform *assays* of the snails' genetic material.\n-   Each assay plate has a *design* showing the material applied and *readings* showing the measured response.\n-   Plates may be *invalidated* after the fact if a staff member believes it is contaminated.\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/survey.png\" alt=\"survey sites\">\n\n## Usage\n\n1.  Create a fresh Python environment: `uv venv`\n2.  Activate that environment: `source .venv/bin/activate`\n3.  Build development version of package: `uv pip install -e .`\n4.  View available commands: `snailz --help`\n5.  Copy default parameter files: `snailz params --outdir ./params`\n6.  See how to regenerate datasets: `python -c 'import snailz; help(snailz)'`\n\nTo regenerate all data using the default parameters provided, run:\n\n```\nsnailz everything --paramsdir ./params --datadir ./data --verbose\n```\n\n## Database\n\nThe final database `data/lab.db` is structured as shown below.\nNote that the data from the file `assays.json` is split between several tables.\nNote also that the SQLite database file is *not* included in this repository\nbecause its binary representation changes each time it is regenerated\n(even though the values it contains stay the same).\nThe map of survey locations in `data/survey.png` is not included in the repository for the same reason,\nbut a duplicate is manually saved in `img/survey.png`.\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/db-schema.svg\" alt=\"database schema\">\n\n-   `site`: survey site\n    -   `site_id`: primary key (text)\n    -   `lon`: longitude of site reference marker (float deg)\n    -   `lat`: latitude of site reference marker (float deg)\n-   `survey`\n    -   `survey_id`: primary key (text)\n    -   `site_id`: foreign key of site where survey was conducted (text)\n    -   `date`: date that survey was conducted (date, YYYY-MM-DD)\n-   `sample`: sample taken from survey\n    -   `sample_id`: primary key (int, 1-1 with `experiment.sample_id`)\n    -   `survey_id`: foreign key of survey (int)\n    -   `lon`: longitude of sample site (float deg)\n    -   `lat`: latitude of sample site (float deg)\n    -   `sequence`: genome sequence of sample (text)\n    -   `size`: snail size (float)\n-   `experiment`: experiment done on sample\n    -   `sample_id`: primary key (int, 1-1 with `sample.sample_id`)\n    -   `kind`: kind of experiment (text, either 'ELISA' or 'JESS')\n    -   `start`: start date (date, YYYY-MM-DD)\n    -   `end`: end date (date, YYYY-MM-DD, null if experiment is ongoing)\n-   `staff`\n    -   `staff_id`: primary key (int)\n    -   `personal`: personal name (text)\n    -   `family`: family name (text)\n-   `performed`: join table showing which staff members performed which experiments\n    -   `staff_id`: foreign key of staff member\n    -   `sample_id`: foreign key of sample/experiment\n-   `plate`: information about single assay plate\n    -   `plate_id`: primary key (int)\n    -   `sample_id`: foreign key of sample/experiment (int)\n    -   `date`: date that plate was run (date, YYYY-MM-DD)\n    -   `filename`: filename of design/results file (text)\n-   `invalidated`: invalidated plates\n    -   `plate_id`: foreign key of plate (int)\n    -   `staff_id`: foreign key of staff member who did invalidation (int)\n    -   `date`: when plate was invalidated\n\n## Data Files\n\n`./data` contains a generated dataset for reference.\nAs noted above,\nit does *not* contain the SQLite database file `lab.db`;\nrun `snailz db` to regenerate it.\n(See `help(snailz)` for an example invocation.)\n\n-   Staff: `staff.csv`\n    -   `staff_id`: unique staff member identifier (int > 0)\n    -   `personal`: personal name (text)\n    -   `family`: family name (text)\n-   Genomes: `genomes.json`\n    -   `length`: number of base pairs (int > 0)\n    -   `reference`: the unmutated reference genome (text)\n    -   `individuals`: sequences for individuals (list of text)\n    -   `locations`: locations of mutations (list of int)\n    -   `susceptible_loc`: location of mutation of interest (int >= 0)\n    -   `susceptible_base`: mutated base responsible for size change (char)\n-   Grids: `grids/*.csv` (one file per site)\n    -   values are contamination levels at sample points (0 means no contamination)\n-   Samples: `grids/samples.csv`\n    -   `sample_id`: unique ID for genetic sample (text)\n    -   `survey_id`: which survey it was taken in (text)\n    -   `lon`: longitude of sample site (float)\n    -   `lat`: latitude of sample site (float)\n    -   `sequence`: sampled gene sequence (text)\n    -   `size`: snail weight (float, grams)\n-   Assays: `assays.json`\n    -   `experiment`: experiment details\n        -   `sample_id`: sample that experiment used (int > 0)\n        -   `kind`: \"ELISA\" or \"JESS\" (text)\n        -   `start`: start date (date, YYYY-MM-DD)\n        -   `end`: end date (date, YYYY-MM-DD or None if experiment incomplete)\n    -   `performed`: join table showing who performed which experiments\n        -   `staff_id`: foreign key to `staff`\n        -   `sample_id`: foreign key to `experiment`\n    -   `plate`: details of assay plates used in experiments\n        -   `plate_id`: unique plate identifier (int > 0)\n        -   `sample_id`: foreign key to `sample` (text)\n        -   `date`: date plate was run (date, YYYY-MM-DD)\n        -   `filename`: name of design and results files (text)\n    -   `invalidated`: which plates have been invalidated\n        -   `plate_id`: foreign key to plate (text)\n        -   `staff_id`: foreign key to staff member responsible (text)\n        -   `date`: invalidation date (date, YYYY-MM-DD)\n-   Plates are represented by matching files in the `designs` and `readings` directories\n    -   `designs/*.csv`: assay plate designs\n        -   header: machine type, file type (\"design\" or \"readings\"), staff ID\n        -   blank line\n        -   table with column and row titles showing material in each well\n    -   `readings/*.csv`: assay plate readings\n        -   header: machine type, file type (\"design\" or \"readings\"), staff ID\n        -   blank line\n        -   table with column and row titles showing reading from each well\n-   To simulate the messiness of real experimental data,\n    the tidy assay plate files in `readings/*.csv` are copied to `mangled/*.csv`\n    with random changes:\n    -   Some files have a staff member's name added in the first row.\n    -   Some have an extra header row containing the experiment date.\n    -   Some have a footer with the staff member's ID.\n    -   In some, the values are offset one column to the right.\n\n## Workflow\n\nThe workflow used to generate the database and data files is shown below:\n\n-   `snailz` or `snailz --help`: show available commands\n-   `snailz clean`: remove all datasets\n-   `snailz everything`: make all datasets\n-   `snailz grids`: synthesize pollution grids\n-   `snailz genomes`: synthesize genomes\n-   `snailz samples`: sample snails from survey sites\n-   `snailz staff`: synthesize staff\n-   `snailz assays`: generate assay files\n-   `snailz plates`: generate plate files\n-   `snailz mangle`: create mangled plate reading files\n-   `snailz db`: generate database\n-   `snailz map`: generate SVG map of sample locations (in progress)\n\n<img src=\"https://raw.githubusercontent.com/gvwilson/snailz/main/img/workflow.svg\" alt=\"data generation workflow\">\n\n## Parameters\n\n`./snailz/params` contains the parameter files used to control generation of the reference dataset.\nThese are included in the package and can be copied into the current directory using `snailz params --outdir .`\n(replace `.` with another directory name as desired).\n`snailz params` also copies a Makefile that can re-run commands with appropriate parameters;\nsee the table of commands given earlier for options.\n\n-   Sites: `sites.csv`\n    -   `site_id`: unique label for site (text)\n    -   `lon`: longitude of site reference marker (deg)\n    -   `lat`: latitude of site reference marker (deg)\n-   Grids: `grids.json`\n    -   `depth`: range of random values per cell (int > 0)\n    -   `height`: number of cells on Y axis (int > 0)\n    -   `seed`: RNG seed (int > 0)\n    -   `width`: number of cells on X axis (int > 0)\n-   Surveys: `surveys.csv`\n    -   `survey_id`: unique label for survey (text)\n    -   `site_id`: ID of site where survey was conducted (text)\n    -   `date`: date that survey was conducted (date, YYYY-MM-DD)\n    -   `spacing`: spacing of measurement point (float, meters)\n-   Genomes: `genomes.json`\n    -   `length`: number of base pairs in sequences (int > 0)\n    -   `num_genomes`: how many individuals to generate (int > 0)\n    -   `num_snp`: number of single nucleotide polymorphisms (int > 0)\n    -   `prob_other`: probability of non-significant mutations (float in 0..1)\n    -   `seed`: RNG seed (int > 0)\n    -   `snp_probs`: probability of selecting various bases (list of 4 float summing to 1.0)\n-   Staff: `staff.json`\n    -   `locale`: locale to use when generating staff names (text)\n    -   `num`: number of staff (int > 0)\n    -   `seed`: RNG seed (int > 0)\n-   Assays: `assays.json`\n    -   `assay_duration`: range of days for each assay (ordered pair of int >= 0)\n    -   `assay_plates`: range of plates per assay (ordered pair of int >= 1)\n    -   `assay_staff`: range of staff in each assay (ordered pair of int > 0)\n    -   `assay_types`: types of assays (list of text)\n    -   `control_val`: nominal reading value for control wells (float > 0)\n    -   `controls`: labels to used for control wells (list of text)\n    -   `enddate`: end of all experiments\n    -   `filename_length`: length of stem of design/readings filenames (int > 0)\n    -   `fraction`: fraction of samples that have been used in experiments\n    -   `invalid`: probability of plate being invalidated (float in 0..1)\n    -   `seed`: RNG seed (int > 0)\n    -   `startdate`: start of all experiments\n    -   `stdev`: standard deviation on readings (float > 0)\n    -   `treated_val`: nominal reading value for treated well (float > 0)\n    -   `treatment`: label to use for treated wells (text)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Synthetic data generator for snail mutation survey",
    "version": "0.1.15",
    "project_urls": null,
    "split_keywords": [
        "open science",
        " synthetic data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59fcf0f07f41800a39e69404ec800069981ddba4e42426578338f5f9c675177f",
                "md5": "465a87e3ee7356b182e83fc59355c85b",
                "sha256": "3904f7e3ba691e6544bc7f1ec30a910ea43d3e97d29ae52579b01e84e4fe6455"
            },
            "downloads": -1,
            "filename": "snailz-0.1.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "465a87e3ee7356b182e83fc59355c85b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 26008,
            "upload_time": "2024-12-14T11:35:25",
            "upload_time_iso_8601": "2024-12-14T11:35:25.211583Z",
            "url": "https://files.pythonhosted.org/packages/59/fc/f0f07f41800a39e69404ec800069981ddba4e42426578338f5f9c675177f/snailz-0.1.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "603fb673180a1f53301a8d4e943e4b90224bf539020368916500a9b535f03fb0",
                "md5": "d9bbedc919a47ec792c2ee32da63bc6e",
                "sha256": "c71d49d4291c012a53b4a997cd66048b7c1d09095ec6d64c7052f34dd752165c"
            },
            "downloads": -1,
            "filename": "snailz-0.1.15.tar.gz",
            "has_sig": false,
            "md5_digest": "d9bbedc919a47ec792c2ee32da63bc6e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 25117,
            "upload_time": "2024-12-14T11:35:27",
            "upload_time_iso_8601": "2024-12-14T11:35:27.619525Z",
            "url": "https://files.pythonhosted.org/packages/60/3f/b673180a1f53301a8d4e943e4b90224bf539020368916500a9b535f03fb0/snailz-0.1.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-14 11:35:27",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "snailz"
}
        
Elapsed time: 0.40374s