# masterfile
[![DOI](https://zenodo.org/badge/100069618.svg)](https://zenodo.org/badge/latestdoi/100069618)
## Tools to organize, document, and validate the variables of interest in scientific studies
## Command line usage
`masterfile --help` will list all the subcommands.
### Create
masterfile create masterfile_path out_file
### Join
masterfile join masterfile_path out_file
### Extract
masterfile extract [-s|--skip ROWS] [--index_column COL]
masterfile_path csv_file out_file
### Validate
asterfile validate masterfile_path [file [file ...]]
## Draft API usage example
```python
import masterfile
# Load all of the .csv files from /path, and the dictionary files in
# /path/dictionaries. Takes settings info from a 'settings.json' file in
# /path.
# joins the .csv files on 'participant_id', which will be used as the index
# There will be warnings if the data look bad in some way
mf = masterfile.load('/path')
# Get the pandas dataframe associated
df = mf.dataframe # aliased as mf.df
# All the variable stuff is less important, people can go look in data dicts
# So we'll write that stuff later.
v = mf.lookup('sr_t1_panas_pa')
v.contacts # list_of_names
v.measure.contact # Someone
v.modality # Component("self-report")
```
## CSV file format
CSV files should be comma-separated (no surprise there) and have DOS line endings (CRLF). They should not have the stupid UTF-8 signature at the start. UTF-8 characters are fine. Missing data is indicated by an empty cell. Quoting should be like Excel does.
Basically, you want Excel-for-Windows-style CSV files with no UTF-8 signature.
## Dictionaries
* CSV format
* Has AT LEAST two columns: component, short_name
* Those are the indexes
* There shouldn't be any repeats in the index
* The settings.json file should contain a "components" thing that says what should exist in the component column
* Things with blank component are ignored (TODO: Maybe?)
## Exclusion files
* CSV format
* Live in exclusions/
* One row per ppt, one column per value
* Has index column, same as data file
* Blanks mean "Use this value," nonblanks mean "exclude this value"
* Things in the cells may be codes; these codes may be defined in settings.json
* If data is excluded for more than one reason, separate codes with ","
* Not all rows / columns in masterfiles need to be included in exclusion files. Missing rows / columns are treated like blank values.
## Data checks
Here are some (all?) of the things to do to verify you have semantically reasonable data:
* Variable parts not in dictionaries
* Missing participant_id column
* Repeated paticipant_id column
* Blanks in participant_id column
* Duplicate columns
* Column names not matching format
## Getting started for development
Create a virtualenv:
virtualenv ~/env/masterfile
source ~/env/masterfile/bin/activate
Install the requirements and this module for development:
pip install -r requirements_dev.txt
pip install -e .
Run tests:
pytest
Run tests across all supported Python versions:
tox
To run in a specific python version:
tox -e py37
## Credits
Written by Nate Vack <njvack@wisc.edu> with help from Dan Fitch <dfitch@wisc.edu>
masterfile packages some wonderful tools: [schema](https://github.com/halst/schema) and [attrs](https://github.com/python-attrs/attrs).
schema is copyright (c) 2012 Vladimir Keleshev, vladimir@keleshev.com
attrs is copyright (c) 2015 Hynek Schlawack
Raw data
{
"_id": null,
"home_page": "https://github.com/uwmadison-chm/masterfile",
"name": "masterfile",
"maintainer": "Nate Vack",
"docs_url": null,
"requires_python": null,
"maintainer_email": "njvack@wisc.edu",
"keywords": "science research data library",
"author": "Nate Vack",
"author_email": "njvack@wisc.edu",
"download_url": "https://files.pythonhosted.org/packages/d1/49/100468f0193fa66bda0c2b067566dfb04eda37c4c2e874f0b727470f2390/masterfile-0.6.2.tar.gz",
"platform": "OS Independent",
"description": "# masterfile\n\n[![DOI](https://zenodo.org/badge/100069618.svg)](https://zenodo.org/badge/latestdoi/100069618)\n\n## Tools to organize, document, and validate the variables of interest in scientific studies\n\n## Command line usage\n\n`masterfile --help` will list all the subcommands.\n\n### Create\n\n masterfile create masterfile_path out_file\n\n### Join\n\n masterfile join masterfile_path out_file\n\n### Extract\n\n masterfile extract [-s|--skip ROWS] [--index_column COL]\n masterfile_path csv_file out_file\n\n### Validate\n\n asterfile validate masterfile_path [file [file ...]]\n\n\n## Draft API usage example\n\n```python\nimport masterfile\n# Load all of the .csv files from /path, and the dictionary files in\n# /path/dictionaries. Takes settings info from a 'settings.json' file in\n# /path.\n# joins the .csv files on 'participant_id', which will be used as the index\n# There will be warnings if the data look bad in some way\nmf = masterfile.load('/path')\n# Get the pandas dataframe associated\ndf = mf.dataframe # aliased as mf.df\n\n# All the variable stuff is less important, people can go look in data dicts\n# So we'll write that stuff later.\nv = mf.lookup('sr_t1_panas_pa')\nv.contacts # list_of_names\nv.measure.contact # Someone\nv.modality # Component(\"self-report\")\n```\n\n## CSV file format\n\nCSV files should be comma-separated (no surprise there) and have DOS line endings (CRLF). They should not have the stupid UTF-8 signature at the start. UTF-8 characters are fine. Missing data is indicated by an empty cell. Quoting should be like Excel does.\n\nBasically, you want Excel-for-Windows-style CSV files with no UTF-8 signature.\n\n## Dictionaries\n\n* CSV format\n* Has AT LEAST two columns: component, short_name\n* Those are the indexes\n* There shouldn't be any repeats in the index\n* The settings.json file should contain a \"components\" thing that says what should exist in the component column\n* Things with blank component are ignored (TODO: Maybe?)\n\n\n## Exclusion files\n\n* CSV format\n* Live in exclusions/\n* One row per ppt, one column per value\n* Has index column, same as data file\n* Blanks mean \"Use this value,\" nonblanks mean \"exclude this value\"\n* Things in the cells may be codes; these codes may be defined in settings.json\n* If data is excluded for more than one reason, separate codes with \",\"\n* Not all rows / columns in masterfiles need to be included in exclusion files. Missing rows / columns are treated like blank values.\n\n\n## Data checks\n\nHere are some (all?) of the things to do to verify you have semantically reasonable data:\n\n* Variable parts not in dictionaries\n* Missing participant_id column\n* Repeated paticipant_id column\n* Blanks in participant_id column\n* Duplicate columns\n* Column names not matching format\n\n## Getting started for development\n\nCreate a virtualenv:\n\n virtualenv ~/env/masterfile\n source ~/env/masterfile/bin/activate\n\nInstall the requirements and this module for development:\n\n pip install -r requirements_dev.txt\n pip install -e .\n\nRun tests:\n\n pytest\n\nRun tests across all supported Python versions:\n\n tox\n\nTo run in a specific python version:\n\n tox -e py37\n\n## Credits\n\nWritten by Nate Vack <njvack@wisc.edu> with help from Dan Fitch <dfitch@wisc.edu>\n\nmasterfile packages some wonderful tools: [schema](https://github.com/halst/schema) and [attrs](https://github.com/python-attrs/attrs).\n\nschema is copyright (c) 2012 Vladimir Keleshev, vladimir@keleshev.com\n\nattrs is copyright (c) 2015 Hynek Schlawack\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Tools to organize, document, and validate the variables of interest in scientific studies",
"version": "0.6.2",
"project_urls": {
"Download": "https://github.com/uwmadison-chm/masterfile/releases",
"Homepage": "https://github.com/uwmadison-chm/masterfile"
},
"split_keywords": [
"science",
"research",
"data",
"library"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d149100468f0193fa66bda0c2b067566dfb04eda37c4c2e874f0b727470f2390",
"md5": "ce96629f5128420d6d4b28ae1282d5fc",
"sha256": "b9290548555e5899775f780692098ff1566fe02cb7cd05908b47bb6d068c5f8a"
},
"downloads": -1,
"filename": "masterfile-0.6.2.tar.gz",
"has_sig": false,
"md5_digest": "ce96629f5128420d6d4b28ae1282d5fc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 26584,
"upload_time": "2024-05-24T17:13:05",
"upload_time_iso_8601": "2024-05-24T17:13:05.581705Z",
"url": "https://files.pythonhosted.org/packages/d1/49/100468f0193fa66bda0c2b067566dfb04eda37c4c2e874f0b727470f2390/masterfile-0.6.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-24 17:13:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "uwmadison-chm",
"github_project": "masterfile",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "masterfile"
}