wikipedia2vec


Namewikipedia2vec JSON
Version 1.0.5 PyPI version JSON
download
home_pagehttp://wikipedia2vec.github.io/
SummaryA tool for learning vector representations of words and entities from Wikipedia
upload_time2021-04-03 06:48:01
maintainer
docs_urlNone
authorStudio Ousia
requires_python
license
keywords wikipedia embedding wikipedia2vec
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Wikipedia2Vec
=============

[![Fury badge](https://badge.fury.io/py/wikipedia2vec.png)](http://badge.fury.io/py/wikipedia2vec)
[![CircleCI](https://circleci.com/gh/wikipedia2vec/wikipedia2vec.svg?style=svg)](https://circleci.com/gh/wikipedia2vec/wikipedia2vec)

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia.
It is developed and maintained by [Studio Ousia](http://www.ousia.jp).

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space.
Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the [conventional skip-gram model](https://en.wikipedia.org/wiki/Word2vec) to learn the embeddings of words, and its extension proposed in [Yamada et al. (2016)](https://arxiv.org/abs/1601.01343) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available [here](https://arxiv.org/abs/1812.06280).

Documentation  are available online at [http://wikipedia2vec.github.io/](http://wikipedia2vec.github.io/).

## Basic Usage

Wikipedia2Vec can be installed via PyPI:

```bash
% pip install wikipedia2vec
```

With this tool, embeddings can be learned by running a *train* command with a Wikipedia dump as input.
For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

```bash
% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE
```

Then, the learned embeddings are written to *MODEL\_FILE*.
Note that this command can take many optional parameters.
Please refer to [our documentation](https://wikipedia2vec.github.io/wikipedia2vec/commands/) for further details.

## Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from [this page](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).

## Use Cases

Wikipedia2Vec has been applied to the following tasks:

* Entity linking: [Yamada et al., 2016](https://arxiv.org/abs/1601.01343), [Eshel et al., 2017](https://arxiv.org/abs/1706.09147), [Chen et al., 2019](https://arxiv.org/abs/1911.03834), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681), [van Hulst et al., 2020](https://arxiv.org/abs/2006.01969).
* Named entity recognition: [Sato et al., 2017](http://www.aclweb.org/anthology/I17-2017), [Lara-Clares and Garcia-Serrano, 2019](http://ceur-ws.org/Vol-2421/eHealth-KD_paper_6.pdf).
* Question answering: [Yamada et al., 2017](https://arxiv.org/abs/1803.08652), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).
* Entity typing: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960).
* Text classification: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960), [Yamada and Shindo, 2019](https://arxiv.org/abs/1909.01259), [Alam et al., 2020](https://link.springer.com/chapter/10.1007/978-3-030-61244-3_9).
* Relation classification: [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).
* Paraphrase detection: [Duong et al., 2018](https://ieeexplore.ieee.org/abstract/document/8606845).
* Knowledge graph completion: [Shah et al., 2019](https://aaai.org/ojs/index.php/AAAI/article/view/4162), [Shah et al., 2020](https://www.aclweb.org/anthology/2020.textgraphs-1.9/).
* Fake news detection: [Singh et al., 2019](https://arxiv.org/abs/1906.11126), [Ghosal et al., 2020](https://arxiv.org/abs/2010.10836).
* Plot analysis of movies: [Papalampidi et al., 2019](https://arxiv.org/abs/1908.10328).
* Novel entity discovery: [Zhang et al., 2020](https://arxiv.org/abs/2002.00206).
* Entity retrieval: [Gerritse et al., 2020](https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_7).
* Deepfake detection: [Zhong et al., 2020](https://arxiv.org/abs/2010.07475).
* Conversational information seeking: [Rodriguez et al., 2020](https://arxiv.org/abs/2005.00172).
* Query expansion: [Rosin et al., 2020](https://arxiv.org/abs/2012.12065).

## References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, [Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia](https://arxiv.org/abs/1812.06280).

```
@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}
```

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/abs/1601.01343).

```
@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}
```

The text classification model implemented in [this example](https://github.com/wikipedia2vec/wikipedia2vec/tree/master/examples/text_classification) was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, [Neural Attentive Bag-of-Entities Model for Text Classification](https://arxiv.org/abs/1909.01259).

```
@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}
```

## License

[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)
            

Raw data

            {
    "_id": null,
    "home_page": "http://wikipedia2vec.github.io/",
    "name": "wikipedia2vec",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "wikipedia,embedding,wikipedia2vec",
    "author": "Studio Ousia",
    "author_email": "ikuya@ousia.jp",
    "download_url": "https://files.pythonhosted.org/packages/89/83/15ab878fe5a93590b80bac8c3a8b0ad5f5dec5d0ea1071f9a17dbce5c33b/wikipedia2vec-1.0.5.tar.gz",
    "platform": "",
    "description": "Wikipedia2Vec\n=============\n\n[![Fury badge](https://badge.fury.io/py/wikipedia2vec.png)](http://badge.fury.io/py/wikipedia2vec)\n[![CircleCI](https://circleci.com/gh/wikipedia2vec/wikipedia2vec.svg?style=svg)](https://circleci.com/gh/wikipedia2vec/wikipedia2vec)\n\nWikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia.\nIt is developed and maintained by [Studio Ousia](http://www.ousia.jp).\n\nThis tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space.\nEmbeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.\n\nThis tool implements the [conventional skip-gram model](https://en.wikipedia.org/wiki/Word2vec) to learn the embeddings of words, and its extension proposed in [Yamada et al. (2016)](https://arxiv.org/abs/1601.01343) to learn the embeddings of entities.\n\nAn empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available [here](https://arxiv.org/abs/1812.06280).\n\nDocumentation  are available online at [http://wikipedia2vec.github.io/](http://wikipedia2vec.github.io/).\n\n## Basic Usage\n\nWikipedia2Vec can be installed via PyPI:\n\n```bash\n% pip install wikipedia2vec\n```\n\nWith this tool, embeddings can be learned by running a *train* command with a Wikipedia dump as input.\nFor example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:\n\n```bash\n% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2\n% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE\n```\n\nThen, the learned embeddings are written to *MODEL\\_FILE*.\nNote that this command can take many optional parameters.\nPlease refer to [our documentation](https://wikipedia2vec.github.io/wikipedia2vec/commands/) for further details.\n\n## Pretrained Embeddings\n\nPretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from [this page](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).\n\n## Use Cases\n\nWikipedia2Vec has been applied to the following tasks:\n\n* Entity linking: [Yamada et al., 2016](https://arxiv.org/abs/1601.01343), [Eshel et al., 2017](https://arxiv.org/abs/1706.09147), [Chen et al., 2019](https://arxiv.org/abs/1911.03834), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681), [van Hulst et al., 2020](https://arxiv.org/abs/2006.01969).\n* Named entity recognition: [Sato et al., 2017](http://www.aclweb.org/anthology/I17-2017), [Lara-Clares and Garcia-Serrano, 2019](http://ceur-ws.org/Vol-2421/eHealth-KD_paper_6.pdf).\n* Question answering: [Yamada et al., 2017](https://arxiv.org/abs/1803.08652), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).\n* Entity typing: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960).\n* Text classification: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960), [Yamada and Shindo, 2019](https://arxiv.org/abs/1909.01259), [Alam et al., 2020](https://link.springer.com/chapter/10.1007/978-3-030-61244-3_9).\n* Relation classification: [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).\n* Paraphrase detection: [Duong et al., 2018](https://ieeexplore.ieee.org/abstract/document/8606845).\n* Knowledge graph completion: [Shah et al., 2019](https://aaai.org/ojs/index.php/AAAI/article/view/4162), [Shah et al., 2020](https://www.aclweb.org/anthology/2020.textgraphs-1.9/).\n* Fake news detection: [Singh et al., 2019](https://arxiv.org/abs/1906.11126), [Ghosal et al., 2020](https://arxiv.org/abs/2010.10836).\n* Plot analysis of movies: [Papalampidi et al., 2019](https://arxiv.org/abs/1908.10328).\n* Novel entity discovery: [Zhang et al., 2020](https://arxiv.org/abs/2002.00206).\n* Entity retrieval: [Gerritse et al., 2020](https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_7).\n* Deepfake detection: [Zhong et al., 2020](https://arxiv.org/abs/2010.07475).\n* Conversational information seeking: [Rodriguez et al., 2020](https://arxiv.org/abs/2005.00172).\n* Query expansion: [Rosin et al., 2020](https://arxiv.org/abs/2012.12065).\n\n## References\n\nIf you use Wikipedia2Vec in a scientific publication, please cite the following paper:\n\nIkuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, [Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia](https://arxiv.org/abs/1812.06280).\n\n```\n@inproceedings{yamada2020wikipedia2vec,\n  title = \"{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia\",\n  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},\n  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},\n  year = {2020},\n  publisher = {Association for Computational Linguistics},\n  pages = {23--30}\n}\n```\n\nThe embedding model was originally proposed in the following paper:\n\nIkuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/abs/1601.01343).\n\n```\n@inproceedings{yamada2016joint,\n  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},\n  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},\n  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},\n  year={2016},\n  publisher={Association for Computational Linguistics},\n  pages={250--259}\n}\n```\n\nThe text classification model implemented in [this example](https://github.com/wikipedia2vec/wikipedia2vec/tree/master/examples/text_classification) was proposed in the following paper:\n\nIkuya Yamada, Hiroyuki Shindo, [Neural Attentive Bag-of-Entities Model for Text Classification](https://arxiv.org/abs/1909.01259).\n\n```\n@article{yamada2019neural,\n  title={Neural Attentive Bag-of-Entities Model for Text Classification},\n  author={Yamada, Ikuya and Shindo, Hiroyuki},\n  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},\n  year={2019},\n  publisher={Association for Computational Linguistics},\n  pages = {563--573}\n}\n```\n\n## License\n\n[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)",
    "bugtrack_url": null,
    "license": "",
    "summary": "A tool for learning vector representations of words and entities from Wikipedia",
    "version": "1.0.5",
    "project_urls": {
        "Homepage": "http://wikipedia2vec.github.io/"
    },
    "split_keywords": [
        "wikipedia",
        "embedding",
        "wikipedia2vec"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "898315ab878fe5a93590b80bac8c3a8b0ad5f5dec5d0ea1071f9a17dbce5c33b",
                "md5": "c007ec61d374d69b04ce89f9ae006c76",
                "sha256": "2906c1a9142dad04b67241b63758db4cbdf26ae299d8c1b5e658d8461ee8ae51"
            },
            "downloads": -1,
            "filename": "wikipedia2vec-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "c007ec61d374d69b04ce89f9ae006c76",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1164953,
            "upload_time": "2021-04-03T06:48:01",
            "upload_time_iso_8601": "2021-04-03T06:48:01.356176Z",
            "url": "https://files.pythonhosted.org/packages/89/83/15ab878fe5a93590b80bac8c3a8b0ad5f5dec5d0ea1071f9a17dbce5c33b/wikipedia2vec-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-04-03 06:48:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "wikipedia2vec"
}
        
Elapsed time: 1.33535s