mantra-dataset


Namemantra-dataset JSON
Version 0.0.6 PyPI version JSON
download
home_pageNone
SummaryA package for working with higher-order datasets like manifold triangulations.
upload_time2024-11-12 16:14:53
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseCopyright (c) 2024 Ernst Röell, Daniel Bin Schmid and Bastian Rieck Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords topology deep learning tda tdl topological data analysis topological deep learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MANTRA: Manifold Triangulations Assembly

[![Maintainability](https://api.codeclimate.com/v1/badges/82f86d7e2f0aae342055/maintainability)](https://codeclimate.com/github/aidos-lab/MANTRA/maintainability) ![GitHub contributors](https://img.shields.io/github/contributors/aidos-lab/MANTRA) ![GitHub](https://img.shields.io/github/license/aidos-lab/MANTRA) 

![image](_static/manifold_triangulation_orbit.gif)

## Getting the Dataset

The raw MANTRA dataset consisting of the $2$ and $3$ manifolds with up to $10$ vertices 
is provided [here](https://github.com/aidos-lab/mantra/releases/latest). 
For machine learning applications and research, we provide a custom [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/stable/) dataset in the form of a python package. 

For installations via pip, run  

The raw datasets, consisting of the 2 and 3 manifolds with up to 10
vertices, can be manually downloaded 
[here](https://github.com/aidos-lab/mantra/releases/latest). 
A pytorch geometric wrapper for the dataset is installable via the following 
command.

```python
pip install mantra-dataset
```

After installation the dataset can be used with the follwing snippet.

```python
from mantra.datasets import ManifoldTriangulations

dataset = ManifoldTriangulations(root="./data", manifold="2", version="latest")
```

## Folder Structure

## Data Format

> This section is mostly *information-oriented* and provides a brief
> overview of the data format, followed by a short [example](#example).

Each dataset consists of a list of triangulations, with each
triangulation having the following attributes:

* `id` (required, `str`): This attribute refers to the original ID of
  the triangulation as used by the creator of the dataset (see
  [below](#acknowledgments)). This facilitates comparisons to the
  original dataset if necessary.

* `triangulation` (required, `list` of `list` of `int`): A doubly-nested
  list of the top-level simplices of the triangulation.

* `n_vertices` (required, `int`): The number of vertices in the
  triangulation. This is **not** the number of simplices.

* `name` (required, `str`): A canonical name of the triangulation, such
  as `S^2` for the two-dimensional [sphere](https://en.wikipedia.org/wiki/N-sphere).
  If no canonical name exists, we store an empty string.

* `betti_numbers` (required, `list` of `int`): A list of the [Betti
  numbers](https://en.wikipedia.org/wiki/Betti_number) of the
  triangulation, computed using $Z$ coefficients. This implies that
  [torsion](https://en.wikipedia.org/wiki/Homology_(mathematics))
  coefficients are stored in another attribute.

* `torsion_coefficients` (required, `list` of `str`): A list of the
  [torsion
  coefficients](https://en.wikipedia.org/wiki/Homology_(mathematics)) of
  the triangulation. An empty string `""` indicates that no torsion
  coefficients are available in that dimension. Otherwise, the original
  spelling of torsion coefficients is retained, so a valid entry might
  be `"Z_2"`. 

* `genus` (optional, `int`): For 2-manifolds, contains the
  [genus](https://en.wikipedia.org/wiki/Genus_(mathematics)) of the
  triangulation.

* `orientable` (optional, `bool`): Specifies whether the triangulation
  is [orientable](https://en.wikipedia.org/wiki/Orientability) or not.

### Example

```json
[
  {
    "id": "manifold_2_4_1",
    "triangulation": [
      [1,2,3],
      [1,2,4],
      [1,3,4],
      [2,3,4]
    ],
    "dimension": 2,
    "n_vertices": 4,
    "betti_numbers": [
      1,
      0,
      1
    ],
    "torsion_coefficients": [
      "",
      "",
      ""
    ],
    "name": "S^2",
    "genus": 0,
    "orientable": true
  },
  {
    "id": "manifold_2_5_1",
    "triangulation": [
      [1,2,3],
      [1,2,4],
      [1,3,5],
      [1,4,5],
      [2,3,4],
      [3,4,5]
    ],
    "dimension": 2,
    "n_vertices": 5,
    "betti_numbers": [
      1,
      0,
      1
    ],
    "torsion_coefficients": [
      "",
      "",
      ""
    ],
    "name": "S^2",
    "genus": 0,
    "orientable": true
  }
]
```

### Design Decisions

> This section is *understanding-oriented* and provides additional
> justifications for our data format.

The datasets are converted from their original (mixed) lexicographical
format. A triangulation in lexicographical format could look like this:

```
manifold_lex_d2_n6_#1=[[1,2,3],[1,2,4],[1,3,4],[2,3,5],[2,4,5],[3,4,6],
  [3,5,6],[4,5,6]]
```

A triangulation in *mixed* lexicographical format could look like this:

```
manifold_2_6_1=[[1,2,3],[1,2,4],[1,3,5],[1,4,6],
  [1,5,6],[2,3,4],[3,4,5],[4,5,6]]
```

This format is **hard to parse**. Moreover, any *additional* information
about the triangulations, including information about homology groups or
orientability, for instance, requires additional files.

We thus decided to use a format that permits us to keep everything in
one place, including any additional attributes for a specific
triangulation. A desirable data format needs to satisfy the following
properties:

1. It should be easy to parse and modify, ideally in a number of
   programming languages.

2. It should be human-readable and `diff`-able in order to permit
   simplified comparisons.

3. It should scale reasonably well to larger triangulations.

After some considerations, we decided to opt for `gzip`-compressed JSON
files. [JSON](https://www.json.org) is well-specified and supported in
virtually all major programming languages out of the box. While the
compressed file is *not* human-readable on its own, the uncompressed
version can easily be used for additional data analysis tasks. This also
greatly simplifies maintenance operations on the dataset. While it can
be argued that there are formats that scale even better, they are
not well-applicable to our use case since each triangulation
typically consists of different numbers of top-level simplices. This
rules out column-based formats like [Parquet](https://parquet.apache.org/).

We are open to revisiting this decision in the future.

As for the *storage* of the data as such, we decided to keep only the
top-level simplices (as is done in the original format) since this
substantially saves disk space. The drawback is that the client has to
supply the remainder of the triangulation. Given that the triangulations
in our dataset are not too large, we deem this to be an acceptable
compromise. Moreover, data structures such as [simplex
trees](https://en.wikipedia.org/wiki/Simplex_tree) can be used to
further improve scalability if necessary.

The decision to keep only top-level simplices is **final**.

Finally, our data format includes, whenever possible and available,
additional information about a triangulation, including the [Betti
numbers](https://en.wikipedia.org/wiki/Betti_number) and a *name*,
i.e., a canonical description, of the topological space described
by the triangulation. We opted to minimize any inconvenience that
would arise from having to perform additional parsing operations.

Please use the following citation for our work:

```bibtex
@misc{ballester2024mantramanifoldtriangulationsassemblage,
      title={ {MANTRA}: {T}he {M}anifold {T}riangulations {A}ssemblage}, 
      author={Rub{\'e}n Ballester and Ernst R{\"o}ell and Daniel Bin Schmid and Mathieu Alain and Sergio Escalera and Carles Casacuberta and Bastian Rieck},
      year={2024},
      eprint={2410.02392},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.02392}, 
}
```

## Acknowledgments

This work is dedicated to [Frank H. Lutz](https://www3.math.tu-berlin.de/IfM/Nachrufe/Frank_Lutz/stellar/),
who passed away unexpectedly on November 10, 2023. May his memory be
a blessing.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mantra-dataset",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Ernst R\u00f6ell <ernst.roeell@helmholtz-munich.de>",
    "keywords": "topology, deep learning, tda, tdl, topological data analysis, topological deep learning",
    "author": null,
    "author_email": "Ernst R\u00f6ell <ernst.roeell@helmholtz-munich.de>, Bastian Rieck <bastian.rieck@helmholtz-munich.de>",
    "download_url": "https://files.pythonhosted.org/packages/1a/68/975e4aba054805e106694821498c7f7582f45129d3fd605e17151d95ca0a/mantra_dataset-0.0.6.tar.gz",
    "platform": null,
    "description": "# MANTRA: Manifold Triangulations Assembly\n\n[![Maintainability](https://api.codeclimate.com/v1/badges/82f86d7e2f0aae342055/maintainability)](https://codeclimate.com/github/aidos-lab/MANTRA/maintainability) ![GitHub contributors](https://img.shields.io/github/contributors/aidos-lab/MANTRA) ![GitHub](https://img.shields.io/github/license/aidos-lab/MANTRA) \n\n![image](_static/manifold_triangulation_orbit.gif)\n\n## Getting the Dataset\n\nThe raw MANTRA dataset consisting of the $2$ and $3$ manifolds with up to $10$ vertices \nis provided [here](https://github.com/aidos-lab/mantra/releases/latest). \nFor machine learning applications and research, we provide a custom [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/stable/) dataset in the form of a python package. \n\nFor installations via pip, run  \n\nThe raw datasets, consisting of the 2 and 3 manifolds with up to 10\nvertices, can be manually downloaded \n[here](https://github.com/aidos-lab/mantra/releases/latest). \nA pytorch geometric wrapper for the dataset is installable via the following \ncommand.\n\n```python\npip install mantra-dataset\n```\n\nAfter installation the dataset can be used with the follwing snippet.\n\n```python\nfrom mantra.datasets import ManifoldTriangulations\n\ndataset = ManifoldTriangulations(root=\"./data\", manifold=\"2\", version=\"latest\")\n```\n\n## Folder Structure\n\n## Data Format\n\n> This section is mostly *information-oriented* and provides a brief\n> overview of the data format, followed by a short [example](#example).\n\nEach dataset consists of a list of triangulations, with each\ntriangulation having the following attributes:\n\n* `id` (required, `str`): This attribute refers to the original ID of\n  the triangulation as used by the creator of the dataset (see\n  [below](#acknowledgments)). This facilitates comparisons to the\n  original dataset if necessary.\n\n* `triangulation` (required, `list` of `list` of `int`): A doubly-nested\n  list of the top-level simplices of the triangulation.\n\n* `n_vertices` (required, `int`): The number of vertices in the\n  triangulation. This is **not** the number of simplices.\n\n* `name` (required, `str`): A canonical name of the triangulation, such\n  as `S^2` for the two-dimensional [sphere](https://en.wikipedia.org/wiki/N-sphere).\n  If no canonical name exists, we store an empty string.\n\n* `betti_numbers` (required, `list` of `int`): A list of the [Betti\n  numbers](https://en.wikipedia.org/wiki/Betti_number) of the\n  triangulation, computed using $Z$ coefficients. This implies that\n  [torsion](https://en.wikipedia.org/wiki/Homology_(mathematics))\n  coefficients are stored in another attribute.\n\n* `torsion_coefficients` (required, `list` of `str`): A list of the\n  [torsion\n  coefficients](https://en.wikipedia.org/wiki/Homology_(mathematics)) of\n  the triangulation. An empty string `\"\"` indicates that no torsion\n  coefficients are available in that dimension. Otherwise, the original\n  spelling of torsion coefficients is retained, so a valid entry might\n  be `\"Z_2\"`. \n\n* `genus` (optional, `int`): For 2-manifolds, contains the\n  [genus](https://en.wikipedia.org/wiki/Genus_(mathematics)) of the\n  triangulation.\n\n* `orientable` (optional, `bool`): Specifies whether the triangulation\n  is [orientable](https://en.wikipedia.org/wiki/Orientability) or not.\n\n### Example\n\n```json\n[\n  {\n    \"id\": \"manifold_2_4_1\",\n    \"triangulation\": [\n      [1,2,3],\n      [1,2,4],\n      [1,3,4],\n      [2,3,4]\n    ],\n    \"dimension\": 2,\n    \"n_vertices\": 4,\n    \"betti_numbers\": [\n      1,\n      0,\n      1\n    ],\n    \"torsion_coefficients\": [\n      \"\",\n      \"\",\n      \"\"\n    ],\n    \"name\": \"S^2\",\n    \"genus\": 0,\n    \"orientable\": true\n  },\n  {\n    \"id\": \"manifold_2_5_1\",\n    \"triangulation\": [\n      [1,2,3],\n      [1,2,4],\n      [1,3,5],\n      [1,4,5],\n      [2,3,4],\n      [3,4,5]\n    ],\n    \"dimension\": 2,\n    \"n_vertices\": 5,\n    \"betti_numbers\": [\n      1,\n      0,\n      1\n    ],\n    \"torsion_coefficients\": [\n      \"\",\n      \"\",\n      \"\"\n    ],\n    \"name\": \"S^2\",\n    \"genus\": 0,\n    \"orientable\": true\n  }\n]\n```\n\n### Design Decisions\n\n> This section is *understanding-oriented* and provides additional\n> justifications for our data format.\n\nThe datasets are converted from their original (mixed) lexicographical\nformat. A triangulation in lexicographical format could look like this:\n\n```\nmanifold_lex_d2_n6_#1=[[1,2,3],[1,2,4],[1,3,4],[2,3,5],[2,4,5],[3,4,6],\n  [3,5,6],[4,5,6]]\n```\n\nA triangulation in *mixed* lexicographical format could look like this:\n\n```\nmanifold_2_6_1=[[1,2,3],[1,2,4],[1,3,5],[1,4,6],\n  [1,5,6],[2,3,4],[3,4,5],[4,5,6]]\n```\n\nThis format is **hard to parse**. Moreover, any *additional* information\nabout the triangulations, including information about homology groups or\norientability, for instance, requires additional files.\n\nWe thus decided to use a format that permits us to keep everything in\none place, including any additional attributes for a specific\ntriangulation. A desirable data format needs to satisfy the following\nproperties:\n\n1. It should be easy to parse and modify, ideally in a number of\n   programming languages.\n\n2. It should be human-readable and `diff`-able in order to permit\n   simplified comparisons.\n\n3. It should scale reasonably well to larger triangulations.\n\nAfter some considerations, we decided to opt for `gzip`-compressed JSON\nfiles. [JSON](https://www.json.org) is well-specified and supported in\nvirtually all major programming languages out of the box. While the\ncompressed file is *not* human-readable on its own, the uncompressed\nversion can easily be used for additional data analysis tasks. This also\ngreatly simplifies maintenance operations on the dataset. While it can\nbe argued that there are formats that scale even better, they are\nnot well-applicable to our use case since each triangulation\ntypically consists of different numbers of top-level simplices. This\nrules out column-based formats like [Parquet](https://parquet.apache.org/).\n\nWe are open to revisiting this decision in the future.\n\nAs for the *storage* of the data as such, we decided to keep only the\ntop-level simplices (as is done in the original format) since this\nsubstantially saves disk space. The drawback is that the client has to\nsupply the remainder of the triangulation. Given that the triangulations\nin our dataset are not too large, we deem this to be an acceptable\ncompromise. Moreover, data structures such as [simplex\ntrees](https://en.wikipedia.org/wiki/Simplex_tree) can be used to\nfurther improve scalability if necessary.\n\nThe decision to keep only top-level simplices is **final**.\n\nFinally, our data format includes, whenever possible and available,\nadditional information about a triangulation, including the [Betti\nnumbers](https://en.wikipedia.org/wiki/Betti_number) and a *name*,\ni.e., a canonical description, of the topological space described\nby the triangulation. We opted to minimize any inconvenience that\nwould arise from having to perform additional parsing operations.\n\nPlease use the following citation for our work:\n\n```bibtex\n@misc{ballester2024mantramanifoldtriangulationsassemblage,\n      title={ {MANTRA}: {T}he {M}anifold {T}riangulations {A}ssemblage}, \n      author={Rub{\\'e}n Ballester and Ernst R{\\\"o}ell and Daniel Bin Schmid and Mathieu Alain and Sergio Escalera and Carles Casacuberta and Bastian Rieck},\n      year={2024},\n      eprint={2410.02392},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2410.02392}, \n}\n```\n\n## Acknowledgments\n\nThis work is dedicated to [Frank H. Lutz](https://www3.math.tu-berlin.de/IfM/Nachrufe/Frank_Lutz/stellar/),\nwho passed away unexpectedly on November 10, 2023. May his memory be\na blessing.\n",
    "bugtrack_url": null,
    "license": "Copyright (c) 2024 Ernst R\u00f6ell, Daniel Bin Schmid and Bastian Rieck  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ",
    "summary": "A package for working with higher-order datasets like manifold triangulations.",
    "version": "0.0.6",
    "project_urls": null,
    "split_keywords": [
        "topology",
        " deep learning",
        " tda",
        " tdl",
        " topological data analysis",
        " topological deep learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d5a207eb80c7bf67b4d466350ce32e51c303ac7099e306a72642f3b03f4d1d1",
                "md5": "7cb37efc5544c84f76019db6a4dacb1d",
                "sha256": "6179c48f959d27b671ed7137925f6479c0e5e9fb04911929484719b17282d844"
            },
            "downloads": -1,
            "filename": "mantra_dataset-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7cb37efc5544c84f76019db6a4dacb1d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 11253,
            "upload_time": "2024-11-12T16:14:51",
            "upload_time_iso_8601": "2024-11-12T16:14:51.252248Z",
            "url": "https://files.pythonhosted.org/packages/8d/5a/207eb80c7bf67b4d466350ce32e51c303ac7099e306a72642f3b03f4d1d1/mantra_dataset-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1a68975e4aba054805e106694821498c7f7582f45129d3fd605e17151d95ca0a",
                "md5": "f0ad8c96d94da461c7a1fb86ee7e77e0",
                "sha256": "273571c71266d05380583bfd9821acbf1e127e30d12284a9fbb68a5c7cdb1e4d"
            },
            "downloads": -1,
            "filename": "mantra_dataset-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "f0ad8c96d94da461c7a1fb86ee7e77e0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 13605,
            "upload_time": "2024-11-12T16:14:53",
            "upload_time_iso_8601": "2024-11-12T16:14:53.199292Z",
            "url": "https://files.pythonhosted.org/packages/1a/68/975e4aba054805e106694821498c7f7582f45129d3fd605e17151d95ca0a/mantra_dataset-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-12 16:14:53",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "mantra-dataset"
}
        
Elapsed time: 0.35366s