# Sample Metadata
[](https://codecov.io/gh/populationgenomics/sample-metadata)
The sample-metadata system is database that stores **de-identified** metadata.
There are three components to the sample-metadata system:
- System-versioned MariaDB database,
- Python web API to manage permissions, and store frequently used queries,
- An installable python library that wraps the Python web API (using OpenAPI generator)
Every resource in sample-metadata belongs to a project. All resources are access
controlled through membership of the google groups:
`$dataset-sample-metadata-main-{read,write}`. Note that members of google-groups
are cached in a secret as group-membership identity checks are slow.
## API
There are two ways to query metamist in Python:
1. Use the REST interface with the predefined requests
2. Use the GraphQL interface.
To use the GraphQL interface in Python with the `sample_metadata` library, you can do the following:
```python
from sample_metadata.graphql import query
_query = """
query YourQueryNameHere($sampleId: String!) {
sample(id: $sampleId) {
id
externalId
}
}
"""
print(query(_query, {"sampleId": "CPG18"}))
```
## Structure

### Sample IDs
In an effort to reduce our dependency on potentially mutable external sample IDs with inconsistent format,
the sample-metadata server generates an internal sample id for every sample. Internally they're an
incrementing integer, but they're transformed externally to have a prefix, and checksum - this allows durability
when transcribing sample IDs to reduce mistypes, and allow to quickly check whether a sample ID is valid.
> NB: The prefix and checksums are modified per environment (production, development, local) to avoid duplicates from these environments.
For example, let's consider the production environment which uses the prefix of `CPG` and a checksum offset of 0:
> A sample is given the internal ID `12345`, we calculate the Luhn checksum to be `5` (with no offset applied).
> We can then concatenate the results, for the final sample ID to be `CPG123455`.
### Reporting sex
To avoid ambiguity in reporting of gender, sex and karyotype - the sample metadata system
stores these values separately on the `participant` as:
- `reported_gender` (string, expected `male` | `female` | _other values_)
- `reported_sex` (follows pedigree convention: `unknown=0 | null`, `male=1`, `female=2`)
- `inferred_karyotype` (string, eg: `XX` | `XY` | _other karyotypes_)
If you import a pedigree, the sex value is written to the `reported_sex` attribute.
## Local develompent of SM
The recommended way to develop the sample-metadata system is to run a local copy of SM.
> There have been some reported issues of running a local SM environment on an M1 mac.
You can run MariaDB with a locally installed docker, or from within a docker container.
You can configure the MariaDB connection with environment variables.
### Creating the environment
Dependencies for the `sample-metadata` API package are listed in `setup.py`.
Additional dev requirements are listed in `requirements-dev.txt`, and packages for
the sever-side code are listed in `requirements.txt`.
To create the full dev environment, run:
```shell
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install --editable .
```
### Default DB set-up
These are the default values for the SM database connection.
Please alter them if you use any different values when setting up the database.
```shell
export SM_DEV_DB_USER=root
export SM_DEV_DB_PASSWORD= # empty password
export SM_DEV_DB_HOST=127.0.0.1
export SM_DEV_DB_PORT=3306 # default mariadb port
```
Create the database in MariaDB (by default, we call it `sm_dev`):
If you use a different databse name also set the following
```shell
export SM_DEV_DB_NAME=sm_database_name
```
> Sample-metadata stores all metadata in one database (_previously: one database per project_).
```shell
mysql -u root --execute 'CREATE DATABASE sm_dev'
```
Download the `mariadb-java-client` and create the schema using liquibase:
```shell
pushd db/
wget https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/3.0.3/mariadb-java-client-3.0.3.jar
liquibase \
--changeLogFile project.xml \
--url jdbc:mariadb://localhost/sm_dev \
--driver org.mariadb.jdbc.Driver \
--classpath mariadb-java-client-3.0.3.jar \
--username root \
update
popd
```
#### Using Maria DB docker image
Pull mariadb image
```bash
docker pull mariadb
```
Run a mariadb container that will server your database. `-p 3307:3306` remaps the port to 3307 in case if you local MySQL is already using 3306
```bash
docker stop mysql-p3307 # stop and remove if the container already exists
docker rm mysql-p3307
# run with an empty root password
docker run -p 3307:3306 --name mysql-p3307 -e MYSQL_ALLOW_EMPTY_PASSWORD=true -d mariadb
```
```bash
mysql --host=127.0.0.1 --port=3307 -u root -e 'CREATE DATABASE sm_dev;'
mysql --host=127.0.0.1 --port=3307 -u root -e 'show databases;'
```
Go into the `db/` subdirectory, download the `mariadb-java-client` and create the schema using liquibase:
```bash
pushd db/
wget https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/3.0.3/mariadb-java-client-3.0.3.jar
liquibase \
--changeLogFile project.xml \
--url jdbc:mariadb://127.0.0.1:3307/sm_dev \
--driver org.mariadb.jdbc.Driver \
--classpath mariadb-java-client-3.0.3.jar \
--username root \
update
popd
```
Finally, make sure you configure the server (making use of the environment variables) to point it to your local Maria DB server
```bash
export SM_DEV_DB_PORT=3307
```
### Running the server
You'll want to set the following environment variables (permanently) in your
local development environment.
```shell
# ensures the SWAGGER page (localhost:8000/docs) points to your local environment
export SM_ENVIRONMENT=LOCAL
# skips permission checks in your local environment
export SM_ALLOWALLACCESS=true
# start the server
python3 -m api.server
# OR
# uvicorn --port 8000 --host 0.0.0.0 api.server:app
```
In a different terminal, execute the following
request to create a new project called 'dev'
```shell
curl -X 'PUT' \
'http://localhost:8000/api/v1/project/?name=dev&dataset=dev&gcp_id=dev&create_test_project=false' \
-H 'accept: application/json' \
-H "Authorization: Bearer $(gcloud auth print-identity-token)"
```
#### Quickstart: Generate and install the installable API
It's best to do this with an already running server:
```shell
python3 regenerate_api.py \
&& pip install .
```
#### Debugging the server in VSCode
VSCode allows you to debug python modules, we could debug the web API at `api/server.py` by considering the following `launch.json`:
```json
{
"version": "0.2.0",
"configurations": [
{
"name": "API server",
"type": "python",
"request": "launch",
"module": "api.server"
}
]
}
```
We could now place breakpoints on the sample route (ie: `api/routes/sample.py`), and debug requests as they come in.
#### Developing the UI
```shell
# Ensure you have started sm locally on your computer already, then in another tab open the UI.
# This will automatically proxy request to the server.
cd web
npm install
npm start
```
#### Unauthenticated access
You'll want to set the `SM_LOCALONLY_DEFAULTUSER` environment variable along with `ALLOWALLACCESS` to allow access to a local sample-metadata server without providing a bearer token. This will allow you to test the front-end components that access data. This happens automatically on the production instance through the Google identity-aware-proxy.
```shell
export SM_ALLOWALLACCESS=1
export SM_LOCALONLY_DEFAULTUSER=$(whoami)
```
### OpenAPI and Swagger
The Web API uses `apispec` with OpenAPI3 annotations on each route to describe interactions with the server. We can generate a swagger UI and an installable
python module based on these annotations.
Some handy links:
- [OpenAPI specification](https://swagger.io/specification/)
- [Describing parameters](https://swagger.io/docs/specification/describing-parameters/)
- [Describing request body](https://swagger.io/docs/specification/describing-request-body/)
- [Media types](https://swagger.io/docs/specification/media-types/)
The web API exposes this schema in two ways:
- Swagger UI: `http://localhost:8000/docs`
- You can use this to construct requests to the server
- Make sure you fill in the Bearer token (at the top right )
- OpenAPI schema: `http://localhost:8000/schema.json`
- Returns a JSON with the full OpenAPI 3 compliant schema.
- You could put this into the [Swagger editor](https://editor.swagger.io/) to see the same "Swagger UI" that `/api/docs` exposes.
- We generate the sample_metadata installable Python API based on this schema.
#### Generating the installable API
The installable API is automatically generated through the `package.yml` GitHub action and uploaded to PyPI.
To generate the python api you'll need to install openapi generator v5.x.x
To install a specific version of the openapi-generator dow the following:
```bash
npm install @openapitools/openapi-generator-cli -g
openapi-generator-cli version-manager set 5.3.0
```
Then set your environment variable OPENAPI_COMMAND to the following.
You can also add an alias to your ~/.bash_profile or equivalent for running in the
terminal.
```bash
export OPENAPI_COMMAND="npx @openapitools/openapi-generator-cli"
alias openapi-generator="npx @openapitools/openapi-generator-cli"
```
You could generate the installable API and install it with pip by running:
```bash
# this will start the api.server, so make sure you have the dependencies installed,
python regenerate_api.py \
&& pip install .
```
Or you can build the docker file, and specify that
```bash
# SM_DOCKER is a known env variable to regenerate_api.py
export SM_DOCKER="cpg/sample-metadata-server:dev"
docker build --build-arg SM_ENVIRONMENT=local -t $SM_DOCKER -f deploy/api/Dockerfile .
python regenerate_apy.py
```
## Deployment
The sample-metadata server
You'll want to complete the following steps:
- Ensure there is a database created for each project (with the database name being the project),
- Ensure there are secrets in `projects/sample_metadata/secrets/databases/versions/latest`, that's an array of objects with keys `dbname, host, port, username, password`.
- Ensure `google-cloud` was installed
```bash
export SM_ENVIRONMENT='PRODUCTION'
# OR, point to the dev instance with
export SM_ENVIRONMENT='DEVELOPMENT'
```
Raw data
{
"_id": null,
"home_page": "https://github.com/populationgenomics/sample-metadata",
"name": "sample-metadata",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "bioinformatics",
"author": "",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/20/96/b5e3958cb482168307dfbbc6cdd2d992147daad9caf6ffb4d75e83fea651/sample_metadata-5.7.2.tar.gz",
"platform": null,
"description": "# Sample Metadata\n\n[](https://codecov.io/gh/populationgenomics/sample-metadata)\n\nThe sample-metadata system is database that stores **de-identified** metadata.\n\nThere are three components to the sample-metadata system:\n\n- System-versioned MariaDB database,\n- Python web API to manage permissions, and store frequently used queries,\n- An installable python library that wraps the Python web API (using OpenAPI generator)\n\nEvery resource in sample-metadata belongs to a project. All resources are access\ncontrolled through membership of the google groups:\n`$dataset-sample-metadata-main-{read,write}`. Note that members of google-groups\nare cached in a secret as group-membership identity checks are slow.\n\n## API\n\nThere are two ways to query metamist in Python:\n\n1. Use the REST interface with the predefined requests\n2. Use the GraphQL interface.\n\nTo use the GraphQL interface in Python with the `sample_metadata` library, you can do the following:\n\n```python\nfrom sample_metadata.graphql import query\n\n_query = \"\"\"\nquery YourQueryNameHere($sampleId: String!) {\n sample(id: $sampleId) {\n id\n externalId\n }\n}\n\"\"\"\n\nprint(query(_query, {\"sampleId\": \"CPG18\"}))\n```\n\n## Structure\n\n\n\n### Sample IDs\n\nIn an effort to reduce our dependency on potentially mutable external sample IDs with inconsistent format,\nthe sample-metadata server generates an internal sample id for every sample. Internally they're an\nincrementing integer, but they're transformed externally to have a prefix, and checksum - this allows durability\nwhen transcribing sample IDs to reduce mistypes, and allow to quickly check whether a sample ID is valid.\n\n> NB: The prefix and checksums are modified per environment (production, development, local) to avoid duplicates from these environments.\n\nFor example, let's consider the production environment which uses the prefix of `CPG` and a checksum offset of 0:\n\n> A sample is given the internal ID `12345`, we calculate the Luhn checksum to be `5` (with no offset applied).\n> We can then concatenate the results, for the final sample ID to be `CPG123455`.\n\n### Reporting sex\n\nTo avoid ambiguity in reporting of gender, sex and karyotype - the sample metadata system\nstores these values separately on the `participant` as:\n\n- `reported_gender` (string, expected `male` | `female` | _other values_)\n- `reported_sex` (follows pedigree convention: `unknown=0 | null`, `male=1`, `female=2`)\n- `inferred_karyotype` (string, eg: `XX` | `XY` | _other karyotypes_)\n\nIf you import a pedigree, the sex value is written to the `reported_sex` attribute.\n\n## Local develompent of SM\n\nThe recommended way to develop the sample-metadata system is to run a local copy of SM.\n\n> There have been some reported issues of running a local SM environment on an M1 mac.\n\nYou can run MariaDB with a locally installed docker, or from within a docker container.\nYou can configure the MariaDB connection with environment variables.\n\n### Creating the environment\n\nDependencies for the `sample-metadata` API package are listed in `setup.py`.\nAdditional dev requirements are listed in `requirements-dev.txt`, and packages for\nthe sever-side code are listed in `requirements.txt`.\n\nTo create the full dev environment, run:\n\n```shell\nvirtualenv venv\nsource venv/bin/activate\npip install -r requirements.txt\npip install -r requirements-dev.txt\npip install --editable .\n```\n\n### Default DB set-up\n\nThese are the default values for the SM database connection.\nPlease alter them if you use any different values when setting up the database.\n\n```shell\nexport SM_DEV_DB_USER=root\nexport SM_DEV_DB_PASSWORD= # empty password\nexport SM_DEV_DB_HOST=127.0.0.1\nexport SM_DEV_DB_PORT=3306 # default mariadb port\n```\n\nCreate the database in MariaDB (by default, we call it `sm_dev`):\nIf you use a different databse name also set the following\n\n```shell\nexport SM_DEV_DB_NAME=sm_database_name\n```\n\n> Sample-metadata stores all metadata in one database (_previously: one database per project_).\n\n```shell\nmysql -u root --execute 'CREATE DATABASE sm_dev'\n```\n\nDownload the `mariadb-java-client` and create the schema using liquibase:\n\n```shell\npushd db/\nwget https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/3.0.3/mariadb-java-client-3.0.3.jar\nliquibase \\\n --changeLogFile project.xml \\\n --url jdbc:mariadb://localhost/sm_dev \\\n --driver org.mariadb.jdbc.Driver \\\n --classpath mariadb-java-client-3.0.3.jar \\\n --username root \\\n update\npopd\n```\n\n#### Using Maria DB docker image\n\nPull mariadb image\n\n```bash\ndocker pull mariadb\n```\n\nRun a mariadb container that will server your database. `-p 3307:3306` remaps the port to 3307 in case if you local MySQL is already using 3306\n\n```bash\ndocker stop mysql-p3307 # stop and remove if the container already exists\ndocker rm mysql-p3307\n# run with an empty root password\ndocker run -p 3307:3306 --name mysql-p3307 -e MYSQL_ALLOW_EMPTY_PASSWORD=true -d mariadb\n```\n\n```bash\nmysql --host=127.0.0.1 --port=3307 -u root -e 'CREATE DATABASE sm_dev;'\nmysql --host=127.0.0.1 --port=3307 -u root -e 'show databases;'\n```\n\nGo into the `db/` subdirectory, download the `mariadb-java-client` and create the schema using liquibase:\n\n```bash\n\npushd db/\nwget https://repo1.maven.org/maven2/org/mariadb/jdbc/mariadb-java-client/3.0.3/mariadb-java-client-3.0.3.jar\nliquibase \\\n --changeLogFile project.xml \\\n --url jdbc:mariadb://127.0.0.1:3307/sm_dev \\\n --driver org.mariadb.jdbc.Driver \\\n --classpath mariadb-java-client-3.0.3.jar \\\n --username root \\\n update\npopd\n```\n\nFinally, make sure you configure the server (making use of the environment variables) to point it to your local Maria DB server\n\n```bash\nexport SM_DEV_DB_PORT=3307\n```\n\n### Running the server\n\nYou'll want to set the following environment variables (permanently) in your\nlocal development environment.\n\n```shell\n# ensures the SWAGGER page (localhost:8000/docs) points to your local environment\nexport SM_ENVIRONMENT=LOCAL\n# skips permission checks in your local environment\nexport SM_ALLOWALLACCESS=true\n\n# start the server\npython3 -m api.server\n# OR\n# uvicorn --port 8000 --host 0.0.0.0 api.server:app\n```\n\nIn a different terminal, execute the following\nrequest to create a new project called 'dev'\n\n```shell\ncurl -X 'PUT' \\\n 'http://localhost:8000/api/v1/project/?name=dev&dataset=dev&gcp_id=dev&create_test_project=false' \\\n -H 'accept: application/json' \\\n -H \"Authorization: Bearer $(gcloud auth print-identity-token)\"\n```\n\n#### Quickstart: Generate and install the installable API\n\nIt's best to do this with an already running server:\n\n```shell\npython3 regenerate_api.py \\\n && pip install .\n```\n\n#### Debugging the server in VSCode\n\nVSCode allows you to debug python modules, we could debug the web API at `api/server.py` by considering the following `launch.json`:\n\n```json\n{\n \"version\": \"0.2.0\",\n \"configurations\": [\n {\n \"name\": \"API server\",\n \"type\": \"python\",\n \"request\": \"launch\",\n \"module\": \"api.server\"\n }\n ]\n}\n```\n\nWe could now place breakpoints on the sample route (ie: `api/routes/sample.py`), and debug requests as they come in.\n\n#### Developing the UI\n\n```shell\n# Ensure you have started sm locally on your computer already, then in another tab open the UI.\n# This will automatically proxy request to the server.\ncd web\nnpm install\nnpm start\n```\n\n#### Unauthenticated access\n\nYou'll want to set the `SM_LOCALONLY_DEFAULTUSER` environment variable along with `ALLOWALLACCESS` to allow access to a local sample-metadata server without providing a bearer token. This will allow you to test the front-end components that access data. This happens automatically on the production instance through the Google identity-aware-proxy.\n\n```shell\nexport SM_ALLOWALLACCESS=1\nexport SM_LOCALONLY_DEFAULTUSER=$(whoami)\n```\n\n### OpenAPI and Swagger\n\nThe Web API uses `apispec` with OpenAPI3 annotations on each route to describe interactions with the server. We can generate a swagger UI and an installable\npython module based on these annotations.\n\nSome handy links:\n\n- [OpenAPI specification](https://swagger.io/specification/)\n- [Describing parameters](https://swagger.io/docs/specification/describing-parameters/)\n- [Describing request body](https://swagger.io/docs/specification/describing-request-body/)\n- [Media types](https://swagger.io/docs/specification/media-types/)\n\nThe web API exposes this schema in two ways:\n\n- Swagger UI: `http://localhost:8000/docs`\n - You can use this to construct requests to the server\n - Make sure you fill in the Bearer token (at the top right )\n- OpenAPI schema: `http://localhost:8000/schema.json`\n - Returns a JSON with the full OpenAPI 3 compliant schema.\n - You could put this into the [Swagger editor](https://editor.swagger.io/) to see the same \"Swagger UI\" that `/api/docs` exposes.\n - We generate the sample_metadata installable Python API based on this schema.\n\n#### Generating the installable API\n\nThe installable API is automatically generated through the `package.yml` GitHub action and uploaded to PyPI.\n\nTo generate the python api you'll need to install openapi generator v5.x.x\n\nTo install a specific version of the openapi-generator dow the following:\n\n```bash\nnpm install @openapitools/openapi-generator-cli -g\nopenapi-generator-cli version-manager set 5.3.0\n```\n\nThen set your environment variable OPENAPI_COMMAND to the following.\nYou can also add an alias to your ~/.bash_profile or equivalent for running in the\nterminal.\n\n```bash\nexport OPENAPI_COMMAND=\"npx @openapitools/openapi-generator-cli\"\nalias openapi-generator=\"npx @openapitools/openapi-generator-cli\"\n```\n\nYou could generate the installable API and install it with pip by running:\n\n```bash\n# this will start the api.server, so make sure you have the dependencies installed,\npython regenerate_api.py \\\n && pip install .\n```\n\nOr you can build the docker file, and specify that\n\n```bash\n# SM_DOCKER is a known env variable to regenerate_api.py\nexport SM_DOCKER=\"cpg/sample-metadata-server:dev\"\ndocker build --build-arg SM_ENVIRONMENT=local -t $SM_DOCKER -f deploy/api/Dockerfile .\npython regenerate_apy.py\n```\n\n## Deployment\n\nThe sample-metadata server\n\nYou'll want to complete the following steps:\n\n- Ensure there is a database created for each project (with the database name being the project),\n- Ensure there are secrets in `projects/sample_metadata/secrets/databases/versions/latest`, that's an array of objects with keys `dbname, host, port, username, password`.\n- Ensure `google-cloud` was installed\n\n```bash\nexport SM_ENVIRONMENT='PRODUCTION'\n\n# OR, point to the dev instance with\nexport SM_ENVIRONMENT='DEVELOPMENT'\n\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python API for interacting with the Sample API system",
"version": "5.7.2",
"project_urls": {
"Homepage": "https://github.com/populationgenomics/sample-metadata"
},
"split_keywords": [
"bioinformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2096b5e3958cb482168307dfbbc6cdd2d992147daad9caf6ffb4d75e83fea651",
"md5": "ca6c30dd623c6dd920883945ec252ca9",
"sha256": "2f1847070de7657d5a14f4060a222eff7ed2322784057e42d2f6140c3f157dea"
},
"downloads": -1,
"filename": "sample_metadata-5.7.2.tar.gz",
"has_sig": false,
"md5_digest": "ca6c30dd623c6dd920883945ec252ca9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 135984,
"upload_time": "2023-06-07T03:31:45",
"upload_time_iso_8601": "2023-06-07T03:31:45.880048Z",
"url": "https://files.pythonhosted.org/packages/20/96/b5e3958cb482168307dfbbc6cdd2d992147daad9caf6ffb4d75e83fea651/sample_metadata-5.7.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-07 03:31:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "populationgenomics",
"github_project": "sample-metadata",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "cpg-utils",
"specs": []
},
{
"name": "aiohttp",
"specs": []
},
{
"name": "cloudpathlib",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "google-auth",
"specs": [
[
"==",
"1.35.0"
]
]
},
{
"name": "google-cloud-secret-manager",
"specs": [
[
"==",
"2.8.0"
]
]
},
{
"name": "google-cloud-logging",
"specs": [
[
"==",
"2.7.0"
]
]
},
{
"name": "google-cloud-storage",
"specs": [
[
"==",
"1.43.0"
]
]
},
{
"name": "uvicorn",
"specs": [
[
"==",
"0.18.3"
]
]
},
{
"name": "fastapi",
"specs": [
[
"==",
"0.85.1"
]
]
},
{
"name": "strawberry-graphql",
"specs": [
[
"==",
"0.138.1"
]
]
},
{
"name": "python-multipart",
"specs": [
[
"==",
"0.0.5"
]
]
},
{
"name": "databases",
"specs": [
[
"==",
"0.6.1"
]
]
},
{
"name": "SQLAlchemy",
"specs": [
[
"==",
"1.4.41"
]
]
},
{
"name": "cryptography",
"specs": [
[
"==",
"36.0.1"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.8.2"
]
]
},
{
"name": "slack-sdk",
"specs": [
[
"==",
"3.20.2"
]
]
}
],
"lcname": "sample-metadata"
}