sylloge


Namesylloge JSON
Version 0.2.1 PyPI version JSON
download
home_pagehttps://github.com/dobraczka/sylloge
SummarySmall library to simplify collecting and loading of entity alignment benchmark datasets
upload_time2023-09-08 14:41:25
maintainer
docs_urlNone
authorDaniel Obraczka
requires_python>=3.8,<4.0
licenseMIT
keywords entity resolution knowledge graph datasets entity alignment
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
<img src="https://github.com/dobraczka/sylloge/raw/main/docs/logo.png" alt="sylloge logo", width=200/>
</p>

<h2 align="center">sylloge</h2>

<p align="center">
<a href="https://github.com/dobraczka/sylloge/actions/workflows/main.yml"><img alt="Actions Status" src="https://github.com/dobraczka/sylloge/actions/workflows/main.yml/badge.svg?branch=main"></a>
<a href='https://sylloge.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/sylloge/badge/?version=latest' alt='Documentation Status' /></a>
<a href="https://pypi.org/project/sylloge"/><img alt="Stable python versions" src="https://img.shields.io/pypi/pyversions/sylloge"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>

This simple library aims to collect entity-alignment benchmark datasets and make them easily available.

Usage
=====
Load benchmark datasets:
```
>>> from sylloge import OpenEA
>>> ds = OpenEA()
>>> ds
OpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
>>> ds.rel_triples_right.head()
                                       head                             relation                                    tail
0   http://www.wikidata.org/entity/Q6176218   http://www.wikidata.org/entity/P27     http://www.wikidata.org/entity/Q145
1   http://www.wikidata.org/entity/Q212675  http://www.wikidata.org/entity/P161  http://www.wikidata.org/entity/Q446064
2   http://www.wikidata.org/entity/Q13512243  http://www.wikidata.org/entity/P840      http://www.wikidata.org/entity/Q84
3   http://www.wikidata.org/entity/Q2268591   http://www.wikidata.org/entity/P31   http://www.wikidata.org/entity/Q11424
4   http://www.wikidata.org/entity/Q11300470  http://www.wikidata.org/entity/P178  http://www.wikidata.org/entity/Q170420
>>> ds.attr_triples_left.head()
                                  head                                          relation                                               tail
0  http://dbpedia.org/resource/E534644                http://dbpedia.org/ontology/imdbId                                            0044475
1  http://dbpedia.org/resource/E340590               http://dbpedia.org/ontology/runtime  6480.0^^<http://www.w3.org/2001/XMLSchema#double>
2  http://dbpedia.org/resource/E840454  http://dbpedia.org/ontology/activeYearsStartYear     1948^^<http://www.w3.org/2001/XMLSchema#gYear>
3  http://dbpedia.org/resource/E971710       http://purl.org/dc/elements/1.1/description                          English singer-songwriter
4  http://dbpedia.org/resource/E022831       http://dbpedia.org/ontology/militaryCommand                     Commandant of the Marine Corps
>>> ds.ent_links.head()
                                  left                                    right
0  http://dbpedia.org/resource/E123186    http://www.wikidata.org/entity/Q21197
1  http://dbpedia.org/resource/E228902  http://www.wikidata.org/entity/Q5909974
2  http://dbpedia.org/resource/E718575   http://www.wikidata.org/entity/Q707008
3  http://dbpedia.org/resource/E469216  http://www.wikidata.org/entity/Q1471945
4  http://dbpedia.org/resource/E649433  http://www.wikidata.org/entity/Q1198381
```

You can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:

```
   >>> ds.canonical_name
   'openea_d_w_15k_v1'
```

Create id-mapped dataset for embedding-based methods:

```
>>> from sylloge import IdMappedEADataset
>>> id_mapped_ds = IdMappedEADataset.from_ea_dataset(ds)
>>> id_mapped_ds
IdMappedEADataset(rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, entity_mapping=30000, rel_mapping=417, attr_rel_mapping=990, attr_mapping=138836, folds=5)
>>> id_mapped_ds.rel_triples_right
[[26048   330 16880]
 [19094   293 23348]
 [16554   407 29192]
 ...
 [16480   330 15109]
 [18465   254 19956]
 [26040   290 28560]]
```

You can use [dask](https://www.dask.org/) as backend for larger datasets:
```
>>> ds = OpenEA(backend="dask")
>>> ds
OpenEA(backend=dask, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
```
Which replaces pandas DataFrames with dask DataFrames.

Datasets can be written/read as parquet via `to_parquet` or `read_parquet`.
After the initial read datasets are cached using this format. The `cache_path` can be explicitly set and caching behaviour can be disable via `use_cache=False`, when initalizing a dataset.

Some datasets come with pre-determined splits:

```bash
tree ~/.data/sylloge/open_ea/cached/D_W_15K_V1 
├── attr_triples_left_parquet
├── attr_triples_right_parquet
├── dataset_names.txt
├── ent_links_parquet
├── folds
│   ├── 1
│   │   ├── test_parquet
│   │   ├── train_parquet
│   │   └── val_parquet
│   ├── 2
│   │   ├── test_parquet
│   │   ├── train_parquet
│   │   └── val_parquet
│   ├── 3
│   │   ├── test_parquet
│   │   ├── train_parquet
│   │   └── val_parquet
│   ├── 4
│   │   ├── test_parquet
│   │   ├── train_parquet
│   │   └── val_parquet
│   └── 5
│       ├── test_parquet
│       ├── train_parquet
│       └── val_parquet
├── rel_triples_left_parquet
└── rel_triples_right_parquet
```
some don't:
```bash
tree ~/.data/sylloge/oaei/cached/starwars_swg
├── attr_triples_left_parquet
│   └── part.0.parquet
├── attr_triples_right_parquet
│   └── part.0.parquet
├── dataset_names.txt
├── ent_links_parquet
│   └── part.0.parquet
├── rel_triples_left_parquet
│   └── part.0.parquet
└── rel_triples_right_parquet
    └── part.0.parquet
```


Installation
============
```bash
pip install sylloge 
```

Datasets
========
| Dataset family name | Year | # of Datasets | Sources | References |
|:--------------------|:----:|:-------------:|:-------:|:----------|
| [OpenEA](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OpenEA) | 2020 | 16 | DBpedia, Yago, Wikidata |  [Paper](http://www.vldb.org/pvldb/vol13/p2326-sun.pdf), [Repo](https://github.com/nju-websoft/OpenEA#dataset-overview) |
| [MovieGraphBenchmark](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.MovieGraphBenchmark) | 2022 | 3 | IMDB, TMDB, TheTVDB | [Paper](http://ceur-ws.org/Vol-2873/paper8.pdf), [Repo](https://github.com/ScaDS/MovieGraphBenchmark) |
| [OAEI](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OAEI) | 2022 | 5 | Fandom wikis | [Paper](https://ceur-ws.org/Vol-3324/oaei22_paper0.pdf), [Website](http://oaei.ontologymatching.org/2022/knowledgegraph/index.html) |

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/dobraczka/sylloge",
    "name": "sylloge",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "entity resolution,knowledge graph,datasets,entity alignment",
    "author": "Daniel Obraczka",
    "author_email": "obraczka@informatik.uni-leipzig.de",
    "download_url": "https://files.pythonhosted.org/packages/15/e4/b7e3444826219cad14d828825bfa33ddbacaa9411f009fd9e75e297b6af8/sylloge-0.2.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n<img src=\"https://github.com/dobraczka/sylloge/raw/main/docs/logo.png\" alt=\"sylloge logo\", width=200/>\n</p>\n\n<h2 align=\"center\">sylloge</h2>\n\n<p align=\"center\">\n<a href=\"https://github.com/dobraczka/sylloge/actions/workflows/main.yml\"><img alt=\"Actions Status\" src=\"https://github.com/dobraczka/sylloge/actions/workflows/main.yml/badge.svg?branch=main\"></a>\n<a href='https://sylloge.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/sylloge/badge/?version=latest' alt='Documentation Status' /></a>\n<a href=\"https://pypi.org/project/sylloge\"/><img alt=\"Stable python versions\" src=\"https://img.shields.io/pypi/pyversions/sylloge\"></a>\n<a href=\"https://github.com/psf/black\"><img alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a>\n</p>\n\nThis simple library aims to collect entity-alignment benchmark datasets and make them easily available.\n\nUsage\n=====\nLoad benchmark datasets:\n```\n>>> from sylloge import OpenEA\n>>> ds = OpenEA()\n>>> ds\nOpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)\n>>> ds.rel_triples_right.head()\n                                       head                             relation                                    tail\n0   http://www.wikidata.org/entity/Q6176218   http://www.wikidata.org/entity/P27     http://www.wikidata.org/entity/Q145\n1   http://www.wikidata.org/entity/Q212675  http://www.wikidata.org/entity/P161  http://www.wikidata.org/entity/Q446064\n2   http://www.wikidata.org/entity/Q13512243  http://www.wikidata.org/entity/P840      http://www.wikidata.org/entity/Q84\n3   http://www.wikidata.org/entity/Q2268591   http://www.wikidata.org/entity/P31   http://www.wikidata.org/entity/Q11424\n4   http://www.wikidata.org/entity/Q11300470  http://www.wikidata.org/entity/P178  http://www.wikidata.org/entity/Q170420\n>>> ds.attr_triples_left.head()\n                                  head                                          relation                                               tail\n0  http://dbpedia.org/resource/E534644                http://dbpedia.org/ontology/imdbId                                            0044475\n1  http://dbpedia.org/resource/E340590               http://dbpedia.org/ontology/runtime  6480.0^^<http://www.w3.org/2001/XMLSchema#double>\n2  http://dbpedia.org/resource/E840454  http://dbpedia.org/ontology/activeYearsStartYear     1948^^<http://www.w3.org/2001/XMLSchema#gYear>\n3  http://dbpedia.org/resource/E971710       http://purl.org/dc/elements/1.1/description                          English singer-songwriter\n4  http://dbpedia.org/resource/E022831       http://dbpedia.org/ontology/militaryCommand                     Commandant of the Marine Corps\n>>> ds.ent_links.head()\n                                  left                                    right\n0  http://dbpedia.org/resource/E123186    http://www.wikidata.org/entity/Q21197\n1  http://dbpedia.org/resource/E228902  http://www.wikidata.org/entity/Q5909974\n2  http://dbpedia.org/resource/E718575   http://www.wikidata.org/entity/Q707008\n3  http://dbpedia.org/resource/E469216  http://www.wikidata.org/entity/Q1471945\n4  http://dbpedia.org/resource/E649433  http://www.wikidata.org/entity/Q1198381\n```\n\nYou can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:\n\n```\n   >>> ds.canonical_name\n   'openea_d_w_15k_v1'\n```\n\nCreate id-mapped dataset for embedding-based methods:\n\n```\n>>> from sylloge import IdMappedEADataset\n>>> id_mapped_ds = IdMappedEADataset.from_ea_dataset(ds)\n>>> id_mapped_ds\nIdMappedEADataset(rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, entity_mapping=30000, rel_mapping=417, attr_rel_mapping=990, attr_mapping=138836, folds=5)\n>>> id_mapped_ds.rel_triples_right\n[[26048   330 16880]\n [19094   293 23348]\n [16554   407 29192]\n ...\n [16480   330 15109]\n [18465   254 19956]\n [26040   290 28560]]\n```\n\nYou can use [dask](https://www.dask.org/) as backend for larger datasets:\n```\n>>> ds = OpenEA(backend=\"dask\")\n>>> ds\nOpenEA(backend=dask, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)\n```\nWhich replaces pandas DataFrames with dask DataFrames.\n\nDatasets can be written/read as parquet via `to_parquet` or `read_parquet`.\nAfter the initial read datasets are cached using this format. The `cache_path` can be explicitly set and caching behaviour can be disable via `use_cache=False`, when initalizing a dataset.\n\nSome datasets come with pre-determined splits:\n\n```bash\ntree ~/.data/sylloge/open_ea/cached/D_W_15K_V1 \n\u251c\u2500\u2500 attr_triples_left_parquet\n\u251c\u2500\u2500 attr_triples_right_parquet\n\u251c\u2500\u2500 dataset_names.txt\n\u251c\u2500\u2500 ent_links_parquet\n\u251c\u2500\u2500 folds\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 1\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 test_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 train_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 val_parquet\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 2\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 test_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 train_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 val_parquet\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 3\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 test_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 train_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 val_parquet\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 4\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 test_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 train_parquet\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 val_parquet\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 5\n\u2502\u00a0\u00a0     \u251c\u2500\u2500 test_parquet\n\u2502\u00a0\u00a0     \u251c\u2500\u2500 train_parquet\n\u2502\u00a0\u00a0     \u2514\u2500\u2500 val_parquet\n\u251c\u2500\u2500 rel_triples_left_parquet\n\u2514\u2500\u2500 rel_triples_right_parquet\n```\nsome don't:\n```bash\ntree ~/.data/sylloge/oaei/cached/starwars_swg\n\u251c\u2500\u2500 attr_triples_left_parquet\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 part.0.parquet\n\u251c\u2500\u2500 attr_triples_right_parquet\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 part.0.parquet\n\u251c\u2500\u2500 dataset_names.txt\n\u251c\u2500\u2500 ent_links_parquet\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 part.0.parquet\n\u251c\u2500\u2500 rel_triples_left_parquet\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 part.0.parquet\n\u2514\u2500\u2500 rel_triples_right_parquet\n    \u2514\u2500\u2500 part.0.parquet\n```\n\n\nInstallation\n============\n```bash\npip install sylloge \n```\n\nDatasets\n========\n| Dataset family name | Year | # of Datasets | Sources | References |\n|:--------------------|:----:|:-------------:|:-------:|:----------|\n| [OpenEA](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OpenEA) | 2020 | 16 | DBpedia, Yago, Wikidata |  [Paper](http://www.vldb.org/pvldb/vol13/p2326-sun.pdf), [Repo](https://github.com/nju-websoft/OpenEA#dataset-overview) |\n| [MovieGraphBenchmark](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.MovieGraphBenchmark) | 2022 | 3 | IMDB, TMDB, TheTVDB | [Paper](http://ceur-ws.org/Vol-2873/paper8.pdf), [Repo](https://github.com/ScaDS/MovieGraphBenchmark) |\n| [OAEI](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OAEI) | 2022 | 5 | Fandom wikis | [Paper](https://ceur-ws.org/Vol-3324/oaei22_paper0.pdf), [Website](http://oaei.ontologymatching.org/2022/knowledgegraph/index.html) |\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Small library to simplify collecting and loading of entity alignment benchmark datasets",
    "version": "0.2.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/dobraczka/sylloge/issues",
        "Documentation": "https://sylloge.readthedocs.io",
        "Homepage": "https://github.com/dobraczka/sylloge",
        "Repository": "https://github.com/dobraczka/sylloge",
        "Source": "https://github.com/dobraczka/sylloge"
    },
    "split_keywords": [
        "entity resolution",
        "knowledge graph",
        "datasets",
        "entity alignment"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "638f2b18dc70866ede0ea302dea8194810fdbd11b393710d26ce500d21f602ea",
                "md5": "d6d377f48a6a0314d6ab56e24ebc891a",
                "sha256": "c6d65c05579e92f0142bc56ac3517a1ced0cadd22951f87f4dfda820c8c638e9"
            },
            "downloads": -1,
            "filename": "sylloge-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d6d377f48a6a0314d6ab56e24ebc891a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 20757,
            "upload_time": "2023-09-08T14:41:23",
            "upload_time_iso_8601": "2023-09-08T14:41:23.860285Z",
            "url": "https://files.pythonhosted.org/packages/63/8f/2b18dc70866ede0ea302dea8194810fdbd11b393710d26ce500d21f602ea/sylloge-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "15e4b7e3444826219cad14d828825bfa33ddbacaa9411f009fd9e75e297b6af8",
                "md5": "822bd64bba505cbaf97465792f06e895",
                "sha256": "db612d5bfb28e01e174e1ba5d5d84cd441257aa5b4645e2fc587ce0e0dfba0f4"
            },
            "downloads": -1,
            "filename": "sylloge-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "822bd64bba505cbaf97465792f06e895",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 20380,
            "upload_time": "2023-09-08T14:41:25",
            "upload_time_iso_8601": "2023-09-08T14:41:25.872663Z",
            "url": "https://files.pythonhosted.org/packages/15/e4/b7e3444826219cad14d828825bfa33ddbacaa9411f009fd9e75e297b6af8/sylloge-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-08 14:41:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dobraczka",
    "github_project": "sylloge",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sylloge"
}
        
Elapsed time: 0.14773s