# metDataModel, data models for mass spectrometry based metabolomics
Our goal is to define a minimal set of data models to promote interoperability in computational metabolomics.
This package contains the basic concepts and data structures,
which can then be imported to other projects, and extended to more specialized classes.
## Core data Structure
![Core data Structure](docs/datastru2024.png)
Metabolic model:
Compound (metabolite is compound)
Reaction
Pathway
Network
Enzyme
Gene
Metabolomic measurement:
Spectrum
Array of Spectra
Mass track (EIC)
Elution peak
Feature
Empirical compound
Meta data:
Study
Experiment
Sample
Method
We try to keep the core models minimal.
The index and query functions can be left to applications.
A mass spectrum is a list of m/z values with corresponding intensity values. The "Spectrum" here is generic enough for LC-MS, GC-MS, LC-IMS-MS, etc. It can be used for NMR and other technologies with minor modifications. The "Array of Spectra" is composed by linking separation parameters with spectra.
Mass track is a concept started in Asari (https://www.nature.com/articles/s41467-023-39889-1), as extracted ion chromatogram spanning the full retention time.
A peak is specific to a sample, while a feature is specific to an experiment.
An empirical compound (empCpd) is a computational unit for a tentative metabolite, based on the experimental measurement. If the experiment does not resolve isomers, the empirical compound can be any or a mixture of the isomers. The step of assigning features to empirical compound is called pre-annotation (https://pubs.acs.org/doi/10.1021/acs.analchem.2c05810); annotation assigns compound identification to empirical compound, which accommodates different levels or probable scores. The structure of empCpd is JSON compatible and chainable (see below).
"Compound" is preferred over "metabolite", because the experimental data often contain molecules other than metabolites (contaminants, xenobiotics, etc.).
Internal structures of each class are not meant to be final.
As long as a workflow is adhered to these core concepts, interoperability is easy to achieve.
**An alternative presentation of these concepts:**
![Alternative presentation](docs/datastru.png)
## Serialized empCpd format (in JSON and can be implemented in any language)
empCpd = {
"neutral_formula_mass": 268.08077,
"neutral_formula": C10H12N4O5,
"alternative_formulas": [],
"interim_id": C10H12N4O5_268.08077,
"identity": [
{'compounds': ['HMDB0000195'], 'names': ['Inosine'],
'score': 0.6, 'probability': null},
{'compounds': ['HMDB0000195', 'HMDB0000481'], 'names': ['Inosine', 'Allopurinol riboside'],
'score': 0.1, 'probability': null},
{'compounds': ['HMDB0000481'], 'names': ['Allopurinol riboside'],
'score': 0.1, 'probability': null},
{'compounds': ['HMDB0003040''], 'names': ['Arabinosylhypoxanthine'],
'score': 0.05, 'probability': null},
],
"MS1_pseudo_Spectra": [
{'feature_id': 'FT1705', 'mz': 269.0878, 'rtime': 99.90,
'isotope': 'M0', 'modification': '+H', 'charged_formula': '', 'ion_relation': 'M+H[1+]'},
{'feature_id': 'FT1876', 'mz': 291.0697, 'rtime': 99.53,
'isotope': 'M0', 'modification': '+Na', 'charged_formula': '', 'ion_relation': 'M+Na[1+]'},
{'feature_id': 'FT1721', 'mz': 270.0912, 'rtime': 99.91,
'isotope': '13C', 'modification': '+H', 'charged_formula': '', 'ion_relation': 'M(C13)+H[1+]'},
{'feature_id': 'FT1993', 'mz': 307.0436, 'rtime': 99.79,
'isotope': 'M0', 'modification': '+K', 'charged_formula': '', 'ion_relation': 'M+K[1+]'},
],
"MS2_Spectra": [
'AZ0000711', 'AZ0002101'
],
"Database_referred": ["Azimuth", "HMDB", "MONA"],
}
An empCpd can be constructed without knowing the formula, by grouping features based on mass differences.
The "identity" can be a single compound or a mixture of compounds.
How to compute the score or probability will be dependent on external algorithms to combine information from different annotation approaches.
Additional fields can be added as needed.
## Scientific background
There's been extensive software development in related areas.
The XCMS ecosystem (https://www.bioconductor.org/packages/release/bioc/html/xcms.html) is a leading example of data preprocessing in R language.
A Python ecosystem is now viable with the Asari package for preprocessing (https://www.nature.com/articles/s41467-023-39889-1).
The modeling of metabolism is exemplified by the Escher project (https://github.com/zakandrewking/escher).
The advancing of science relies on the close interaction of experimental measurements and theoretical modeling, and the two should feed on each other. However, a clear gap exists between the two in metabolomics. E.g., the elemental mass table in Escher (retrieved on version 1.7.3) are of average mass, but mass spectrometers measure isotopic mass.
Many software programs already have excellent data models and data structures. But the reuse of data models is much easier to start from basics, hence this project, where complexity is an option.
This metDataModel package is used in:
* asari: Trackable and scalable metabolomics data preprocessing - https://github.com/shuzhao-li/asari, https://www.nature.com/articles/s41467-023-39889-1
* khipu: generalized tree structure for pre-annotation - https://github.com/shuzhao-li-lab/khipu, https://pubs.acs.org/doi/10.1021/acs.analchem.2c05810
* pcpfm: Python Centric Pipeline For Metabolomics - https://github.com/shuzhao-li-lab/PythonCentricPipelineForMetabolomics
Our mass2chem, khipu and JMS packages house the annotation functions.
## For developers
The data structures should be language neutral.
We edit primarily in the Python code, as JSON and YAML can be exported automatically.
Each Python class has a serialization function to export JSON.
I.e., concise information for users' need is exported, but not all class details.
Adaptation/update/extension is encouraged in other languages.
We strive for the right level of abstraction.
For the core classes, it's more important to have transparent, extensible structure.
Therefore, it's a design decision not to have getter or setter functions.
Shallow data structures are more portable.
MetDataModel provides a template, and application projects can extend it to fit their specific needs.
There are two flavors of the metDataModel provided. The default version uses Python Dataclasses to
enforce weak static typing. These types are recommended and not enforced at runtime, but should provide
users details to implement methods that use them in statically typed languages. The other is
dictionary based and has no suggested types. This is the minimal model.
Note that while metDataModel objects can have other metDataObjects as data members, this can cause
issues if there are circular references.
Please feel free to submit issues, and write Wiki pages for discussions.
Pypi install: https://pypi.org/project/metDataModel/
### Related community resources
While we focus on the application of mass spectrometry data,
many mass spectrometry data structures are defined in various software projects that focus on "pre-processing", e.g.
- openMS (https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/nightly/html/index.html)
- MSnBase (used by XCMS, https://github.com/lgatto/MSnbase)
To learn about mass spectrometry concepts and pre-processing:
- Data structure described for (py)openMS (https://pyopenms.readthedocs.io/en/latest/datastructures.html)
- XCMS tutorial by Johannes Rainer (https://github.com/jorainer/metabolomics2018)
- Our asari paper https://www.nature.com/articles/s41467-023-39889-1 and documentation https://asari.readthedocs.io/en/latest/?badge=latest
To learn about genome scale metabolic models:
- review by Gu et al, 2019 (https://link.springer.com/article/10.1186/s13059-019-1730-3)
- our book chapter to explain metabolic models in the context of metabolomic pathway analysis (https://link.springer.com/protocol/10.1007/978-1-0716-0239-3_19)
### Citation
Mitchell et al, Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline (to come).
Raw data
{
"_id": null,
"home_page": "https://github.com/shuzhao-li/metDataModel",
"name": "metDataModel",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": null,
"keywords": "metDataModel",
"author": "Shuzhao Li",
"author_email": "shuzhao.li@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/42/e6/025491f483ffcdb28c72cbaed5746a2652229cfd3c52abcb9e8b1f5e81e6/metdatamodel-0.6.1.tar.gz",
"platform": null,
"description": "# metDataModel, data models for mass spectrometry based metabolomics\n\nOur goal is to define a minimal set of data models to promote interoperability in computational metabolomics.\nThis package contains the basic concepts and data structures,\nwhich can then be imported to other projects, and extended to more specialized classes. \n\n## Core data Structure\n\n![Core data Structure](docs/datastru2024.png)\n\n Metabolic model:\n Compound (metabolite is compound)\n Reaction\n Pathway\n Network\n Enzyme\n Gene\n Metabolomic measurement:\n Spectrum\n Array of Spectra\n Mass track (EIC)\n Elution peak\n Feature\n Empirical compound\n Meta data:\n Study\n Experiment\n Sample\n Method\n\nWe try to keep the core models minimal. \nThe index and query functions can be left to applications.\n\nA mass spectrum is a list of m/z values with corresponding intensity values. The \"Spectrum\" here is generic enough for LC-MS, GC-MS, LC-IMS-MS, etc. It can be used for NMR and other technologies with minor modifications. The \"Array of Spectra\" is composed by linking separation parameters with spectra.\n\nMass track is a concept started in Asari (https://www.nature.com/articles/s41467-023-39889-1), as extracted ion chromatogram spanning the full retention time.\n\nA peak is specific to a sample, while a feature is specific to an experiment. \n\nAn empirical compound (empCpd) is a computational unit for a tentative metabolite, based on the experimental measurement. If the experiment does not resolve isomers, the empirical compound can be any or a mixture of the isomers. The step of assigning features to empirical compound is called pre-annotation (https://pubs.acs.org/doi/10.1021/acs.analchem.2c05810); annotation assigns compound identification to empirical compound, which accommodates different levels or probable scores. The structure of empCpd is JSON compatible and chainable (see below).\n\n\"Compound\" is preferred over \"metabolite\", because the experimental data often contain molecules other than metabolites (contaminants, xenobiotics, etc.).\n\nInternal structures of each class are not meant to be final. \nAs long as a workflow is adhered to these core concepts, interoperability is easy to achieve.\n\n**An alternative presentation of these concepts:**\n![Alternative presentation](docs/datastru.png)\n\n\n## Serialized empCpd format (in JSON and can be implemented in any language)\n \n empCpd = {\n \"neutral_formula_mass\": 268.08077, \n \"neutral_formula\": C10H12N4O5,\n \"alternative_formulas\": [],\n \"interim_id\": C10H12N4O5_268.08077,\n \"identity\": [\n {'compounds': ['HMDB0000195'], 'names': ['Inosine'], \n 'score': 0.6, 'probability': null},\n {'compounds': ['HMDB0000195', 'HMDB0000481'], 'names': ['Inosine', 'Allopurinol riboside'], \n 'score': 0.1, 'probability': null},\n {'compounds': ['HMDB0000481'], 'names': ['Allopurinol riboside'], \n 'score': 0.1, 'probability': null},\n {'compounds': ['HMDB0003040''], 'names': ['Arabinosylhypoxanthine'], \n 'score': 0.05, 'probability': null},\n ],\n \"MS1_pseudo_Spectra\": [\n {'feature_id': 'FT1705', 'mz': 269.0878, 'rtime': 99.90, \n 'isotope': 'M0', 'modification': '+H', 'charged_formula': '', 'ion_relation': 'M+H[1+]'},\n {'feature_id': 'FT1876', 'mz': 291.0697, 'rtime': 99.53, \n 'isotope': 'M0', 'modification': '+Na', 'charged_formula': '', 'ion_relation': 'M+Na[1+]'},\n {'feature_id': 'FT1721', 'mz': 270.0912, 'rtime': 99.91, \n 'isotope': '13C', 'modification': '+H', 'charged_formula': '', 'ion_relation': 'M(C13)+H[1+]'},\n {'feature_id': 'FT1993', 'mz': 307.0436, 'rtime': 99.79, \n 'isotope': 'M0', 'modification': '+K', 'charged_formula': '', 'ion_relation': 'M+K[1+]'},\n ],\n \"MS2_Spectra\": [\n 'AZ0000711', 'AZ0002101'\n ],\n \"Database_referred\": [\"Azimuth\", \"HMDB\", \"MONA\"],\n }\n\nAn empCpd can be constructed without knowing the formula, by grouping features based on mass differences.\nThe \"identity\" can be a single compound or a mixture of compounds. \nHow to compute the score or probability will be dependent on external algorithms to combine information from different annotation approaches.\nAdditional fields can be added as needed.\n\n\n## Scientific background\nThere's been extensive software development in related areas. \nThe XCMS ecosystem (https://www.bioconductor.org/packages/release/bioc/html/xcms.html) is a leading example of data preprocessing in R language. \nA Python ecosystem is now viable with the Asari package for preprocessing (https://www.nature.com/articles/s41467-023-39889-1).\nThe modeling of metabolism is exemplified by the Escher project (https://github.com/zakandrewking/escher).\n\nThe advancing of science relies on the close interaction of experimental measurements and theoretical modeling, and the two should feed on each other. However, a clear gap exists between the two in metabolomics. E.g., the elemental mass table in Escher (retrieved on version 1.7.3) are of average mass, but mass spectrometers measure isotopic mass. \nMany software programs already have excellent data models and data structures. But the reuse of data models is much easier to start from basics, hence this project, where complexity is an option.\n\nThis metDataModel package is used in:\n* asari: Trackable and scalable metabolomics data preprocessing - https://github.com/shuzhao-li/asari, https://www.nature.com/articles/s41467-023-39889-1\n* khipu: generalized tree structure for pre-annotation - https://github.com/shuzhao-li-lab/khipu, https://pubs.acs.org/doi/10.1021/acs.analchem.2c05810\n* pcpfm: Python Centric Pipeline For Metabolomics - https://github.com/shuzhao-li-lab/PythonCentricPipelineForMetabolomics\n\nOur mass2chem, khipu and JMS packages house the annotation functions.\n\n\n## For developers\n\nThe data structures should be language neutral. \n\nWe edit primarily in the Python code, as JSON and YAML can be exported automatically.\nEach Python class has a serialization function to export JSON.\n\nI.e., concise information for users' need is exported, but not all class details.\n\nAdaptation/update/extension is encouraged in other languages. \n\nWe strive for the right level of abstraction.\nFor the core classes, it's more important to have transparent, extensible structure.\nTherefore, it's a design decision not to have getter or setter functions. \nShallow data structures are more portable.\nMetDataModel provides a template, and application projects can extend it to fit their specific needs.\n\nThere are two flavors of the metDataModel provided. The default version uses Python Dataclasses to \nenforce weak static typing. These types are recommended and not enforced at runtime, but should provide\nusers details to implement methods that use them in statically typed languages. The other is \ndictionary based and has no suggested types. This is the minimal model. \n\nNote that while metDataModel objects can have other metDataObjects as data members, this can cause \nissues if there are circular references. \n\nPlease feel free to submit issues, and write Wiki pages for discussions.\n\nPypi install: https://pypi.org/project/metDataModel/\n\n\n### Related community resources\nWhile we focus on the application of mass spectrometry data, \nmany mass spectrometry data structures are defined in various software projects that focus on \"pre-processing\", e.g.\n\n- openMS (https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/nightly/html/index.html) \n\n- MSnBase (used by XCMS, https://github.com/lgatto/MSnbase)\n\nTo learn about mass spectrometry concepts and pre-processing:\n\n- Data structure described for (py)openMS (https://pyopenms.readthedocs.io/en/latest/datastructures.html)\n\n- XCMS tutorial by Johannes Rainer (https://github.com/jorainer/metabolomics2018)\n\n- Our asari paper https://www.nature.com/articles/s41467-023-39889-1 and documentation https://asari.readthedocs.io/en/latest/?badge=latest\n\nTo learn about genome scale metabolic models:\n\n- review by Gu et al, 2019 (https://link.springer.com/article/10.1186/s13059-019-1730-3)\n\n- our book chapter to explain metabolic models in the context of metabolomic pathway analysis (https://link.springer.com/protocol/10.1007/978-1-0716-0239-3_19)\n\n### Citation\nMitchell et al, Common data models to streamline metabolomics processing and annotation, and implementation in a Python pipeline (to come).\n",
"bugtrack_url": null,
"license": "BSD license",
"summary": "Data models for metabolomics",
"version": "0.6.1",
"project_urls": {
"Homepage": "https://github.com/shuzhao-li/metDataModel"
},
"split_keywords": [
"metdatamodel"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "44682d34b2ccdfaa52b74d1a4c6ea03c81dd1fa7e62d8a54d469d2f1b4ac9384",
"md5": "97fc8774ab6e859651f7ca77ede75f53",
"sha256": "a107729a76a4106c31dd18aa3301cf6cd7e544ed802c6dcf02da67812172aeab"
},
"downloads": -1,
"filename": "metDataModel-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "97fc8774ab6e859651f7ca77ede75f53",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 25494,
"upload_time": "2024-04-15T19:46:33",
"upload_time_iso_8601": "2024-04-15T19:46:33.173317Z",
"url": "https://files.pythonhosted.org/packages/44/68/2d34b2ccdfaa52b74d1a4c6ea03c81dd1fa7e62d8a54d469d2f1b4ac9384/metDataModel-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "42e6025491f483ffcdb28c72cbaed5746a2652229cfd3c52abcb9e8b1f5e81e6",
"md5": "57d25af9887fc886f247368dc7569250",
"sha256": "7561fde67266af51a744f5a1d56a67bc70fec5dce6707eb97922e2c0a9ee2ea5"
},
"downloads": -1,
"filename": "metdatamodel-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "57d25af9887fc886f247368dc7569250",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 21866,
"upload_time": "2024-04-15T19:46:35",
"upload_time_iso_8601": "2024-04-15T19:46:35.313487Z",
"url": "https://files.pythonhosted.org/packages/42/e6/025491f483ffcdb28c72cbaed5746a2652229cfd3c52abcb9e8b1f5e81e6/metdatamodel-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-15 19:46:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "shuzhao-li",
"github_project": "metDataModel",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "metdatamodel"
}