masterfile

Name	masterfile JSON
Version	0.6.2 JSON
	download
home_page	https://github.com/uwmadison-chm/masterfile
Summary	Tools to organize, document, and validate the variables of interest in scientific studies
upload_time	2024-05-24 17:13:05
maintainer	Nate Vack
docs_url	None
author	Nate Vack
requires_python	None
license	MIT License
keywords	science research data library
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # masterfile

[![DOI](https://zenodo.org/badge/100069618.svg)](https://zenodo.org/badge/latestdoi/100069618)

## Tools to organize, document, and validate the variables of interest in scientific studies

## Command line usage

`masterfile --help` will list all the subcommands.

### Create

    masterfile create masterfile_path out_file

### Join

    masterfile join masterfile_path out_file

### Extract

    masterfile extract [-s|--skip ROWS] [--index_column COL]
                          masterfile_path csv_file out_file

### Validate

    asterfile validate masterfile_path [file [file ...]]


## Draft API usage example

```python
import masterfile
# Load all of the .csv files from /path, and the dictionary files in
# /path/dictionaries. Takes settings info from a 'settings.json' file in
# /path.
# joins the .csv files on 'participant_id', which will be used as the index
# There will be warnings if the data look bad in some way
mf = masterfile.load('/path')
# Get the pandas dataframe associated
df = mf.dataframe  # aliased as mf.df

# All the variable stuff is less important, people can go look in data dicts
# So we'll write that stuff later.
v = mf.lookup('sr_t1_panas_pa')
v.contacts # list_of_names
v.measure.contact  # Someone
v.modality # Component("self-report")
```

## CSV file format

CSV files should be comma-separated (no surprise there) and have DOS line endings (CRLF). They should not have the stupid UTF-8 signature at the start. UTF-8 characters are fine. Missing data is indicated by an empty cell. Quoting should be like Excel does.

Basically, you want Excel-for-Windows-style CSV files with no UTF-8 signature.

## Dictionaries

* CSV format
* Has AT LEAST two columns: component, short_name
* Those are the indexes
* There shouldn't be any repeats in the index
* The settings.json file should contain a "components" thing that says what should exist in the component column
* Things with blank component are ignored (TODO: Maybe?)


## Exclusion files

* CSV format
* Live in exclusions/
* One row per ppt, one column per value
* Has index column, same as data file
* Blanks mean "Use this value," nonblanks mean "exclude this value"
* Things in the cells may be codes; these codes may be defined in settings.json
* If data is excluded for more than one reason, separate codes with ","
* Not all rows / columns in masterfiles need to be included in exclusion files. Missing rows / columns are treated like blank values.


## Data checks

Here are some (all?) of the things to do to verify you have semantically reasonable data:

* Variable parts not in dictionaries
* Missing participant_id column
* Repeated paticipant_id column
* Blanks in participant_id column
* Duplicate columns
* Column names not matching format

## Getting started for development

Create a virtualenv:

    virtualenv ~/env/masterfile
    source ~/env/masterfile/bin/activate

Install the requirements and this module for development:

    pip install -r requirements_dev.txt
    pip install -e .

Run tests:

    pytest

Run tests across all supported Python versions:

    tox

To run in a specific python version:

    tox -e py37

## Credits

Written by Nate Vack <njvack@wisc.edu> with help from Dan Fitch <dfitch@wisc.edu>

masterfile packages some wonderful tools: [schema](https://github.com/halst/schema) and [attrs](https://github.com/python-attrs/attrs).

schema is copyright (c) 2012 Vladimir Keleshev, vladimir@keleshev.com

attrs is copyright (c) 2015 Hynek Schlawack

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/uwmadison-chm/masterfile",
    "name": "masterfile",
    "maintainer": "Nate Vack",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "njvack@wisc.edu",
    "keywords": "science research data library",
    "author": "Nate Vack",
    "author_email": "njvack@wisc.edu",
    "download_url": "https://files.pythonhosted.org/packages/d1/49/100468f0193fa66bda0c2b067566dfb04eda37c4c2e874f0b727470f2390/masterfile-0.6.2.tar.gz",
    "platform": "OS Independent",
    "description": "# masterfile\n\n[![DOI](https://zenodo.org/badge/100069618.svg)](https://zenodo.org/badge/latestdoi/100069618)\n\n## Tools to organize, document, and validate the variables of interest in scientific studies\n\n## Command line usage\n\n`masterfile --help` will list all the subcommands.\n\n### Create\n\n    masterfile create masterfile_path out_file\n\n### Join\n\n    masterfile join masterfile_path out_file\n\n### Extract\n\n    masterfile extract [-s|--skip ROWS] [--index_column COL]\n                          masterfile_path csv_file out_file\n\n### Validate\n\n    asterfile validate masterfile_path [file [file ...]]\n\n\n## Draft API usage example\n\n```python\nimport masterfile\n# Load all of the .csv files from /path, and the dictionary files in\n# /path/dictionaries. Takes settings info from a 'settings.json' file in\n# /path.\n# joins the .csv files on 'participant_id', which will be used as the index\n# There will be warnings if the data look bad in some way\nmf = masterfile.load('/path')\n# Get the pandas dataframe associated\ndf = mf.dataframe  # aliased as mf.df\n\n# All the variable stuff is less important, people can go look in data dicts\n# So we'll write that stuff later.\nv = mf.lookup('sr_t1_panas_pa')\nv.contacts # list_of_names\nv.measure.contact  # Someone\nv.modality # Component(\"self-report\")\n```\n\n## CSV file format\n\nCSV files should be comma-separated (no surprise there) and have DOS line endings (CRLF). They should not have the stupid UTF-8 signature at the start. UTF-8 characters are fine. Missing data is indicated by an empty cell. Quoting should be like Excel does.\n\nBasically, you want Excel-for-Windows-style CSV files with no UTF-8 signature.\n\n## Dictionaries\n\n* CSV format\n* Has AT LEAST two columns: component, short_name\n* Those are the indexes\n* There shouldn't be any repeats in the index\n* The settings.json file should contain a \"components\" thing that says what should exist in the component column\n* Things with blank component are ignored (TODO: Maybe?)\n\n\n## Exclusion files\n\n* CSV format\n* Live in exclusions/\n* One row per ppt, one column per value\n* Has index column, same as data file\n* Blanks mean \"Use this value,\" nonblanks mean \"exclude this value\"\n* Things in the cells may be codes; these codes may be defined in settings.json\n* If data is excluded for more than one reason, separate codes with \",\"\n* Not all rows / columns in masterfiles need to be included in exclusion files. Missing rows / columns are treated like blank values.\n\n\n## Data checks\n\nHere are some (all?) of the things to do to verify you have semantically reasonable data:\n\n* Variable parts not in dictionaries\n* Missing participant_id column\n* Repeated paticipant_id column\n* Blanks in participant_id column\n* Duplicate columns\n* Column names not matching format\n\n## Getting started for development\n\nCreate a virtualenv:\n\n    virtualenv ~/env/masterfile\n    source ~/env/masterfile/bin/activate\n\nInstall the requirements and this module for development:\n\n    pip install -r requirements_dev.txt\n    pip install -e .\n\nRun tests:\n\n    pytest\n\nRun tests across all supported Python versions:\n\n    tox\n\nTo run in a specific python version:\n\n    tox -e py37\n\n## Credits\n\nWritten by Nate Vack <njvack@wisc.edu> with help from Dan Fitch <dfitch@wisc.edu>\n\nmasterfile packages some wonderful tools: [schema](https://github.com/halst/schema) and [attrs](https://github.com/python-attrs/attrs).\n\nschema is copyright (c) 2012 Vladimir Keleshev, vladimir@keleshev.com\n\nattrs is copyright (c) 2015 Hynek Schlawack\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Tools to organize, document, and validate the variables of interest in scientific studies",
    "version": "0.6.2",
    "project_urls": {
        "Download": "https://github.com/uwmadison-chm/masterfile/releases",
        "Homepage": "https://github.com/uwmadison-chm/masterfile"
    },
    "split_keywords": [
        "science",
        "research",
        "data",
        "library"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d149100468f0193fa66bda0c2b067566dfb04eda37c4c2e874f0b727470f2390",
                "md5": "ce96629f5128420d6d4b28ae1282d5fc",
                "sha256": "b9290548555e5899775f780692098ff1566fe02cb7cd05908b47bb6d068c5f8a"
            },
            "downloads": -1,
            "filename": "masterfile-0.6.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ce96629f5128420d6d4b28ae1282d5fc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 26584,
            "upload_time": "2024-05-24T17:13:05",
            "upload_time_iso_8601": "2024-05-24T17:13:05.581705Z",
            "url": "https://files.pythonhosted.org/packages/d1/49/100468f0193fa66bda0c2b067566dfb04eda37c4c2e874f0b727470f2390/masterfile-0.6.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-24 17:13:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "uwmadison-chm",
    "github_project": "masterfile",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "masterfile"
}

Nate Vack