# fw-meta
Extract [Flywheel](https://flywheel.io/) upload metadata from
[fw_file](https://gitlab.com/flywheel-io/tools/lib/fw-file) `File` objects or
any mapping that has a dict-like interface.
The most common use case is scraping Flywheel group and project information from
DICOM tags where it was entered by a researcher at scan time through a scanner's
UI.
The group and project is required for placing (aka. routing) uploaded files
correctly within the Flywheel hierarchy.
## Installation
Add as a `poetry` dependency to your project:
```bash
poetry add fw-meta
```
## Usage
Given
- `DICOM` context
- `PatientID` being an available and unused field on the scanner's UI
- `"neuro/Amnesia"` being entered in `PatientID`
- using the recommended extraction pattern
`"[fw://]{group}[/{project}[/{subject}[/{session}[/{acquisition}]]]]"`
The extracted metadata should be `{"group._id": "neuro", "project.label": "Amnesia"}`:
```python
from fw_meta import extract_meta
pattern = "[fw://]{group}[/{project}[/{subject}[/{session}[/{acquisition}]]]]"
data = dict(PatientID="neuro/Amnesia")
meta = extract_meta(data, mappings={"PatientID": pattern})
meta == {"group._id": "neuro", "project.label": "Amnesia"}
```
### Source fields
Metadata can be extracted from any source field such as the tag values in the
case of DICOMs. Selecting an appropriate DICOM tag comes down to ones that are:
- available fields on the scanner UI
- allow entering the routing string (ie. long / versatile enough)
- not currently used by researchers (or repurposable)
Some recommended tags that worked well previously:
- `PatientID`
- `PatientComments`
- `StudyComments`
- `ReferringPhysicianName`
### Extraction pattern mappings
Extraction patterns are simplified python regexes tailored for scraping Flywheel
metadata [fields](fw_meta/imports.py) like `group._id` and `project.label` from
a string using capture groups.
The pattern syntax is shown through a series of examples below. All cases assume
the following context:
```python
from fw_meta import extract_meta
data = dict(PatientID="neuro_amnesia")
```
**Extracting a whole string as-is** is the simplest use case. For example, get
`"neuro_amnesia"` - the value of `PatientID` into a single Flywheel field like
`group._id` - here the pattern simply becomes the target field, `group._id`:
```python
meta = extract_meta(data, mappings={"PatientID": "group._id"})
meta == {"group._id": "neuro_amnesia"}
```
The **simplified capture group notation using {curly braces}** gives more
flexibility to the patterns, allowing substrings to be ignored for example:
```python
meta = extract_meta(data, mappings={"PatientID": "{group}_*"})
meta == {"group._id": "neuro"} # "_amnesia" was not captured in the group
```
Note how the pattern `group` resulted in the extraction of `group._id`. This
is because Flywheel groups are most commonly routed to by their `_id` field, and
two [**aliases**](fw_meta/aliases.py), `group` and `group.id` are configured
to allow for simpler and more legible capture patterns.
The **simplified optional notation using [square brackets]** allows patterns
to match with or without an optional part:
```python
# the PatientID doesn't contain 2 underscores - the pattern matches w/o subject
pattern = "{group}_{project}[_{subject}]"
meta = extract_meta(data, mappings={"PatientID": pattern})
meta == {"group._id": "neuro", "project.label": "amnesia"}
# the PatientID contains the optional part thus the subject also gets extracted
data = dict(PatientID="neuro_amnesia_subject")
meta = extract_meta(data, mappings={"PatientID": pattern})
meta == {"group._id": "neuro", "project.label": "amnesia", "subject.label": "subject"}
```
The **recommended extraction pattern** has both capture curlies and optional
brackets: `"[fw://]{group}[/{project}[/{subject}[/{session}[/{acquisition}]]]]"`
This pattern is:
- prefix-consistent with the `fw://group/Project` as displayed on the UI
- usable as an opt-in filter only including data if the value starts with `fw://`
- flexible enough to route to the correct group without the project
- flexible enough to specify custom subject/session/acquisition labels
**Extracting multiple meta fields from a single value** can be done by adding
multiple groups with curly braces in the pattern. The following example captures
the group _and_ the project separated by an underscore:
```python
meta = extract_meta(data, mappings={"PatientID": "{group}_{project}"})
meta == {"group._id": "neuro", "project.label": "amnesia"}
```
**Extracting a single meta field from multiple values** is also possible by
treating the left-hand-side as an f-string template to be formatted. This
example extracts `acquisition.label` as the concatenation of `SeriesNumber` and
`SeriesDescription`:
```python
data = dict(SeriesNumber="3", SeriesDescription="foo")
meta = extract_meta(data, mappings={"{SeriesNumber} - {SeriesDescription}": "acquisition"})
meta == {"acquisition.label": "3 - foo"}
```
Note that if any of the values appearing in the template are missing, then the
whole pattern is considered non-matching and will be skipped.
The **same capture group may appear in multiple patterns providing a fallback**
mechanism where the first non-empty match wins. For example to extract
`session.label` from `StudyComments` when it's available, but fall back to using
`StudyDate` if it isn't:
```python
data = dict(StudyDate="20001231", StudyComments="foo")
meta = extract_meta(data, mappings=[("StudyComments", "session"), ("StudyDate", "session")])
meta == {"session.label": "foo"}
data = dict(StudyDate="20001231") # no StudyComments
meta = extract_meta(data, mappings=[("StudyComments", "session"), ("StudyDate", "session")])
meta == {"session.label": "20001231"} # fall back to StudyDate
```
**Capture groups may have a regex** defining what substrings the group should
match on:
```python
# match whole string into subject IF it starts with an "s" and is digits after
pattern = "{subject:s\d+}"
data = dict(PatientID="s123") # should match
meta = extract_meta(data, mappings={"PatientID": pattern})
meta == {"subject.label": "s123"}
data = dict(PatientID="foobar") # should not match
meta = extract_meta(data, mappings={"PatientID": pattern})
meta == {}
```
**Timestamps are parsed with
[`dateutil.parser`](https://dateutil.readthedocs.io/en/stable/parser.html)**.
This allows extracting the `session.timestamp` and `acquisition.timestamp`
metadata fields with minimal configuration:
```python
data = dict(path="/data/20001231133742/file.txt")
pattern = "/data/{acquisition.timestamp}/*"
meta = extract_meta(data, mappings={"path": pattern})
meta == {
"acquisition.timestamp": "2000-12-31T13:37:42+01:00",
"acquisition.timezone": "Europe/Budapest",
}
```
Note that the timezone was auto-populated and the timestamp got localized - see
the config section below for more details and options.
**Timestamps may be parsed using an
[`strptime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)
pattern** to enable loading any formats that might not be handled via
`dateutil.parser`:
```python
data = dict(path="/data/20001231_133742_12345/file.txt")
pattern = "/data/{acquisition.timestamp:%Y%m%d_%H%M%S_%f}/*"
meta = extract_meta(data, mappings={"path": pattern})
meta == {
"acquisition.timestamp": "2000-12-31T13:37:42.123450+01:00",
"acquisition.timezone": "Europe/Budapest",
}
```
### Defaults
Some scenarios benefit from **setting a default metadata value as a fallback**
even if one could not be extracted via a pattern. An example is routing any
DICOM from scanner "A" that doesn't have a routing string to a group/project
pre-created and designated for the data instead of the `Unknown` group and/or
`Unsorted` project.
```python
meta = extract_meta({}, mappings={"PatientID": "group"})
meta == {} # PatientID is empty - no group._id extracted
meta = extract_meta({}, mappings={"PatientID": "group"}, defaults={"group": "default"})
meta == {"group._id": "default"} # group._id defaulted
```
### Configuration
Timestamp metadata fields `session.timestamp` and `acquisition.timestamp` are
always accompanied by a timezone (`session.timezone` / `acquisition.timezone`).
When dealing with zone-naive timestamps, `fw-meta` assumes they belong to the
the currently configured local timezone which is common practice with DICOMs and
other medical data. The local timezone is retrieved using `tzlocal` and defaults
to `UTC` if it's not available.
Setting the environment variable `TZ` to a timezone name from the
[tz database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)
can be used to explicitly override the timezone used to localize any tz-naive
timestamps with.
## Development
Install the package and it's dependencies using `poetry` and enable `pre-commit`:
```bash
poetry install
pre-commit install
```
## License
[![MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://gitlab.com/flywheel-io/tools/lib/fw-meta",
"name": "fw-meta",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": "Flywheel, DICOM, metadata, extract",
"author": "Flywheel",
"author_email": "support@flywheel.io",
"download_url": null,
"platform": null,
"description": "# fw-meta\n\nExtract [Flywheel](https://flywheel.io/) upload metadata from\n[fw_file](https://gitlab.com/flywheel-io/tools/lib/fw-file) `File` objects or\nany mapping that has a dict-like interface.\n\nThe most common use case is scraping Flywheel group and project information from\nDICOM tags where it was entered by a researcher at scan time through a scanner's\nUI.\n\nThe group and project is required for placing (aka. routing) uploaded files\ncorrectly within the Flywheel hierarchy.\n\n## Installation\n\nAdd as a `poetry` dependency to your project:\n\n```bash\npoetry add fw-meta\n```\n\n## Usage\n\nGiven\n\n- `DICOM` context\n- `PatientID` being an available and unused field on the scanner's UI\n- `\"neuro/Amnesia\"` being entered in `PatientID`\n- using the recommended extraction pattern\n`\"[fw://]{group}[/{project}[/{subject}[/{session}[/{acquisition}]]]]\"`\n\nThe extracted metadata should be `{\"group._id\": \"neuro\", \"project.label\": \"Amnesia\"}`:\n\n```python\nfrom fw_meta import extract_meta\n\npattern = \"[fw://]{group}[/{project}[/{subject}[/{session}[/{acquisition}]]]]\"\ndata = dict(PatientID=\"neuro/Amnesia\")\nmeta = extract_meta(data, mappings={\"PatientID\": pattern})\nmeta == {\"group._id\": \"neuro\", \"project.label\": \"Amnesia\"}\n```\n\n### Source fields\n\nMetadata can be extracted from any source field such as the tag values in the\ncase of DICOMs. Selecting an appropriate DICOM tag comes down to ones that are:\n\n- available fields on the scanner UI\n- allow entering the routing string (ie. long / versatile enough)\n- not currently used by researchers (or repurposable)\n\nSome recommended tags that worked well previously:\n\n- `PatientID`\n- `PatientComments`\n- `StudyComments`\n- `ReferringPhysicianName`\n\n### Extraction pattern mappings\n\nExtraction patterns are simplified python regexes tailored for scraping Flywheel\nmetadata [fields](fw_meta/imports.py) like `group._id` and `project.label` from\na string using capture groups.\n\nThe pattern syntax is shown through a series of examples below. All cases assume\nthe following context:\n\n```python\nfrom fw_meta import extract_meta\ndata = dict(PatientID=\"neuro_amnesia\")\n```\n\n**Extracting a whole string as-is** is the simplest use case. For example, get\n`\"neuro_amnesia\"` - the value of `PatientID` into a single Flywheel field like\n`group._id` - here the pattern simply becomes the target field, `group._id`:\n\n```python\nmeta = extract_meta(data, mappings={\"PatientID\": \"group._id\"})\nmeta == {\"group._id\": \"neuro_amnesia\"}\n```\n\nThe **simplified capture group notation using {curly braces}** gives more\nflexibility to the patterns, allowing substrings to be ignored for example:\n\n```python\nmeta = extract_meta(data, mappings={\"PatientID\": \"{group}_*\"})\nmeta == {\"group._id\": \"neuro\"} # \"_amnesia\" was not captured in the group\n```\n\nNote how the pattern `group` resulted in the extraction of `group._id`. This\nis because Flywheel groups are most commonly routed to by their `_id` field, and\ntwo [**aliases**](fw_meta/aliases.py), `group` and `group.id` are configured\nto allow for simpler and more legible capture patterns.\n\nThe **simplified optional notation using [square brackets]** allows patterns\nto match with or without an optional part:\n\n```python\n# the PatientID doesn't contain 2 underscores - the pattern matches w/o subject\npattern = \"{group}_{project}[_{subject}]\"\nmeta = extract_meta(data, mappings={\"PatientID\": pattern})\nmeta == {\"group._id\": \"neuro\", \"project.label\": \"amnesia\"}\n\n# the PatientID contains the optional part thus the subject also gets extracted\ndata = dict(PatientID=\"neuro_amnesia_subject\")\nmeta = extract_meta(data, mappings={\"PatientID\": pattern})\nmeta == {\"group._id\": \"neuro\", \"project.label\": \"amnesia\", \"subject.label\": \"subject\"}\n```\n\nThe **recommended extraction pattern** has both capture curlies and optional\nbrackets: `\"[fw://]{group}[/{project}[/{subject}[/{session}[/{acquisition}]]]]\"`\nThis pattern is:\n\n- prefix-consistent with the `fw://group/Project` as displayed on the UI\n- usable as an opt-in filter only including data if the value starts with `fw://`\n- flexible enough to route to the correct group without the project\n- flexible enough to specify custom subject/session/acquisition labels\n\n**Extracting multiple meta fields from a single value** can be done by adding\nmultiple groups with curly braces in the pattern. The following example captures\nthe group _and_ the project separated by an underscore:\n\n```python\nmeta = extract_meta(data, mappings={\"PatientID\": \"{group}_{project}\"})\nmeta == {\"group._id\": \"neuro\", \"project.label\": \"amnesia\"}\n```\n\n**Extracting a single meta field from multiple values** is also possible by\ntreating the left-hand-side as an f-string template to be formatted. This\nexample extracts `acquisition.label` as the concatenation of `SeriesNumber` and\n`SeriesDescription`:\n\n```python\ndata = dict(SeriesNumber=\"3\", SeriesDescription=\"foo\")\nmeta = extract_meta(data, mappings={\"{SeriesNumber} - {SeriesDescription}\": \"acquisition\"})\nmeta == {\"acquisition.label\": \"3 - foo\"}\n```\n\nNote that if any of the values appearing in the template are missing, then the\nwhole pattern is considered non-matching and will be skipped.\n\nThe **same capture group may appear in multiple patterns providing a fallback**\nmechanism where the first non-empty match wins. For example to extract\n`session.label` from `StudyComments` when it's available, but fall back to using\n`StudyDate` if it isn't:\n\n```python\ndata = dict(StudyDate=\"20001231\", StudyComments=\"foo\")\nmeta = extract_meta(data, mappings=[(\"StudyComments\", \"session\"), (\"StudyDate\", \"session\")])\nmeta == {\"session.label\": \"foo\"}\n\ndata = dict(StudyDate=\"20001231\") # no StudyComments\nmeta = extract_meta(data, mappings=[(\"StudyComments\", \"session\"), (\"StudyDate\", \"session\")])\nmeta == {\"session.label\": \"20001231\"} # fall back to StudyDate\n```\n\n**Capture groups may have a regex** defining what substrings the group should\nmatch on:\n\n```python\n# match whole string into subject IF it starts with an \"s\" and is digits after\npattern = \"{subject:s\\d+}\"\ndata = dict(PatientID=\"s123\") # should match\nmeta = extract_meta(data, mappings={\"PatientID\": pattern})\nmeta == {\"subject.label\": \"s123\"}\n\ndata = dict(PatientID=\"foobar\") # should not match\nmeta = extract_meta(data, mappings={\"PatientID\": pattern})\nmeta == {}\n```\n\n**Timestamps are parsed with\n[`dateutil.parser`](https://dateutil.readthedocs.io/en/stable/parser.html)**.\nThis allows extracting the `session.timestamp` and `acquisition.timestamp`\nmetadata fields with minimal configuration:\n\n```python\ndata = dict(path=\"/data/20001231133742/file.txt\")\npattern = \"/data/{acquisition.timestamp}/*\"\nmeta = extract_meta(data, mappings={\"path\": pattern})\nmeta == {\n \"acquisition.timestamp\": \"2000-12-31T13:37:42+01:00\",\n \"acquisition.timezone\": \"Europe/Budapest\",\n}\n```\n\nNote that the timezone was auto-populated and the timestamp got localized - see\nthe config section below for more details and options.\n\n**Timestamps may be parsed using an\n[`strptime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)\npattern** to enable loading any formats that might not be handled via\n`dateutil.parser`:\n\n```python\ndata = dict(path=\"/data/20001231_133742_12345/file.txt\")\npattern = \"/data/{acquisition.timestamp:%Y%m%d_%H%M%S_%f}/*\"\nmeta = extract_meta(data, mappings={\"path\": pattern})\nmeta == {\n \"acquisition.timestamp\": \"2000-12-31T13:37:42.123450+01:00\",\n \"acquisition.timezone\": \"Europe/Budapest\",\n}\n```\n\n### Defaults\n\nSome scenarios benefit from **setting a default metadata value as a fallback**\neven if one could not be extracted via a pattern. An example is routing any\nDICOM from scanner \"A\" that doesn't have a routing string to a group/project\npre-created and designated for the data instead of the `Unknown` group and/or\n`Unsorted` project.\n\n```python\nmeta = extract_meta({}, mappings={\"PatientID\": \"group\"})\nmeta == {} # PatientID is empty - no group._id extracted\n\nmeta = extract_meta({}, mappings={\"PatientID\": \"group\"}, defaults={\"group\": \"default\"})\nmeta == {\"group._id\": \"default\"} # group._id defaulted\n```\n\n### Configuration\n\nTimestamp metadata fields `session.timestamp` and `acquisition.timestamp` are\nalways accompanied by a timezone (`session.timezone` / `acquisition.timezone`).\n\nWhen dealing with zone-naive timestamps, `fw-meta` assumes they belong to the\nthe currently configured local timezone which is common practice with DICOMs and\nother medical data. The local timezone is retrieved using `tzlocal` and defaults\nto `UTC` if it's not available.\n\nSetting the environment variable `TZ` to a timezone name from the\n[tz database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)\ncan be used to explicitly override the timezone used to localize any tz-naive\ntimestamps with.\n\n## Development\n\nInstall the package and it's dependencies using `poetry` and enable `pre-commit`:\n\n```bash\npoetry install\npre-commit install\n```\n\n## License\n\n[![MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Flywheel metadata extraction.",
"version": "4.2.2",
"project_urls": {
"Documentation": "https://gitlab.com/flywheel-io/tools/lib/fw-meta",
"Homepage": "https://gitlab.com/flywheel-io/tools/lib/fw-meta",
"Repository": "https://gitlab.com/flywheel-io/tools/lib/fw-meta"
},
"split_keywords": [
"flywheel",
" dicom",
" metadata",
" extract"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fee495d018d473c0d15deae5cf84a910a1b9c3762a6b9d283a4eebb2c60ebd39",
"md5": "25ca5002987ae692f704372546cc0ef6",
"sha256": "1f646b36a9d382746701fd74b206a84ba6ae451fc64064f4c9f263de236f83ab"
},
"downloads": -1,
"filename": "fw_meta-4.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "25ca5002987ae692f704372546cc0ef6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 15673,
"upload_time": "2024-08-29T22:30:46",
"upload_time_iso_8601": "2024-08-29T22:30:46.411464Z",
"url": "https://files.pythonhosted.org/packages/fe/e4/95d018d473c0d15deae5cf84a910a1b9c3762a6b9d283a4eebb2c60ebd39/fw_meta-4.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-29 22:30:46",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "flywheel-io",
"gitlab_project": "tools",
"lcname": "fw-meta"
}