moviegraphbenchmark


Namemoviegraphbenchmark JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://github.com/ScaDS/MovieGraphBenchmark
SummaryBenchmark datasets for Entity Resolution on Knowledge Graphs containing information about movies, tv shows and persons from IMDB,TMDB and TheTVDB
upload_time2023-03-24 15:14:47
maintainer
docs_urlNone
authorDaniel Obraczka
requires_python>=3.7.1
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Dataset License
Due to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found [here](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=2TNAA9FRS3TJWM3AEQ2X&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1#))
What we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.

# Usage
You can simply install the package via pip:
```bash
pip install moviegraphbenchmark
```
and then run
```bash
moviegraphbenchmark
```
which will create the data in the default data path `~/.data/moviegraphbenchmark/data`

You can also define a specific folder if you want with
```bash
moviegraphbenchmark --data-path anotherpath
```

For ease-of-usage in your project you can also use this library for loading the data (this will create the data if it's not present):

```python
from moviegraphbenchmark import load_data
ds = load_data()
# by default this will load `imdb-tmdb`
print(ds.attr_triples_1)

# specify other pair and specific data path
ds = load_data(pair="imdb-tmdb",data_path="anotherpath")

# the dataclass contains all the files loaded as pandas dataframes
print(ds.attr_triples_2)
print(ds.rel_triples_1)
print(ds.rel_triples_2)
print(ds.ent_links)
for fold in in ds.folds:
    print(fold)
```

# Dataset structure
There are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the `data` folder. 
The data structure follows the structure used in [OpenEA](https://github.com/nju-websoft/OpenEA).
Each folder contains the information of the knowledge graphs (`attr_triples_*`,`rel_triples_*`) and the gold standard of entity links (`ent_links`). The triples are labeled with `1` and `2` where e.g. for imdb-tmdb `1` refers to imdb and `2` to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation.

# Citing
This dataset was first presented in this paper:
```
@inproceedings{EAGERKGCW2021,
  author    = {Daniel Obraczka and
               Jonathan Schuchart and
               Erhard Rahm},
  editor    = {David Chaves-Fraga and
               Anastasia Dimou and
               Pieter Heyvaert and
               Freddy Priyatna and
               Juan Sequeda},
  title     = {Embedding-Assisted Entity Resolution for Knowledge Graphs},
  booktitle = {Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC 2021), Online, June 5, 2021},
  series    = {{CEUR} Workshop Proceedings},
  volume    = {2873},
  publisher = {CEUR-WS.org},
  year      = {2021},
  url       = {http://ceur-ws.org/Vol-2873/paper8.pdf},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ScaDS/MovieGraphBenchmark",
    "name": "moviegraphbenchmark",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.1",
    "maintainer_email": "",
    "keywords": "",
    "author": "Daniel Obraczka",
    "author_email": "obraczka@informatik.uni-leipzig.de",
    "download_url": "https://files.pythonhosted.org/packages/94/c7/2ae3f45d98036b8875f90b3b652cd453623f38db12ee4836cecff2f6db55/moviegraphbenchmark-1.0.2.tar.gz",
    "platform": null,
    "description": "# Dataset License\nDue to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found [here](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=2TNAA9FRS3TJWM3AEQ2X&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1#))\nWhat we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.\n\n# Usage\nYou can simply install the package via pip:\n```bash\npip install moviegraphbenchmark\n```\nand then run\n```bash\nmoviegraphbenchmark\n```\nwhich will create the data in the default data path `~/.data/moviegraphbenchmark/data`\n\nYou can also define a specific folder if you want with\n```bash\nmoviegraphbenchmark --data-path anotherpath\n```\n\nFor ease-of-usage in your project you can also use this library for loading the data (this will create the data if it's not present):\n\n```python\nfrom moviegraphbenchmark import load_data\nds = load_data()\n# by default this will load `imdb-tmdb`\nprint(ds.attr_triples_1)\n\n# specify other pair and specific data path\nds = load_data(pair=\"imdb-tmdb\",data_path=\"anotherpath\")\n\n# the dataclass contains all the files loaded as pandas dataframes\nprint(ds.attr_triples_2)\nprint(ds.rel_triples_1)\nprint(ds.rel_triples_2)\nprint(ds.ent_links)\nfor fold in in ds.folds:\n    print(fold)\n```\n\n# Dataset structure\nThere are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the `data` folder. \nThe data structure follows the structure used in [OpenEA](https://github.com/nju-websoft/OpenEA).\nEach folder contains the information of the knowledge graphs (`attr_triples_*`,`rel_triples_*`) and the gold standard of entity links (`ent_links`). The triples are labeled with `1` and `2` where e.g. for imdb-tmdb `1` refers to imdb and `2` to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation.\n\n# Citing\nThis dataset was first presented in this paper:\n```\n@inproceedings{EAGERKGCW2021,\n  author    = {Daniel Obraczka and\n               Jonathan Schuchart and\n               Erhard Rahm},\n  editor    = {David Chaves-Fraga and\n               Anastasia Dimou and\n               Pieter Heyvaert and\n               Freddy Priyatna and\n               Juan Sequeda},\n  title     = {Embedding-Assisted Entity Resolution for Knowledge Graphs},\n  booktitle = {Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC 2021), Online, June 5, 2021},\n  series    = {{CEUR} Workshop Proceedings},\n  volume    = {2873},\n  publisher = {CEUR-WS.org},\n  year      = {2021},\n  url       = {http://ceur-ws.org/Vol-2873/paper8.pdf},\n}\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Benchmark datasets for Entity Resolution on Knowledge Graphs containing information about movies, tv shows and persons from IMDB,TMDB and TheTVDB",
    "version": "1.0.2",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "45daa3076158473a77a1b7cafbcfcd69cfae71a9f6619d8d041692fea1f0e4ce",
                "md5": "22f3dbb9fec99b0d3cb4fde80207944a",
                "sha256": "563950280c237d78ebf9662bad376872744487268edf113df2d146d41d3e0951"
            },
            "downloads": -1,
            "filename": "moviegraphbenchmark-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "22f3dbb9fec99b0d3cb4fde80207944a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.1",
            "size": 8640,
            "upload_time": "2023-03-24T15:14:45",
            "upload_time_iso_8601": "2023-03-24T15:14:45.934732Z",
            "url": "https://files.pythonhosted.org/packages/45/da/a3076158473a77a1b7cafbcfcd69cfae71a9f6619d8d041692fea1f0e4ce/moviegraphbenchmark-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "94c72ae3f45d98036b8875f90b3b652cd453623f38db12ee4836cecff2f6db55",
                "md5": "49b46c680611883d5564215bf5b33101",
                "sha256": "b3d9771328e74b4897ecaa9dd620d8da6a314d35b5962d1955c3703af5f61144"
            },
            "downloads": -1,
            "filename": "moviegraphbenchmark-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "49b46c680611883d5564215bf5b33101",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.1",
            "size": 8797,
            "upload_time": "2023-03-24T15:14:47",
            "upload_time_iso_8601": "2023-03-24T15:14:47.265960Z",
            "url": "https://files.pythonhosted.org/packages/94/c7/2ae3f45d98036b8875f90b3b652cd453623f38db12ee4836cecff2f6db55/moviegraphbenchmark-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-24 15:14:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "ScaDS",
    "github_project": "MovieGraphBenchmark",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "moviegraphbenchmark"
}
        
Elapsed time: 0.32435s