Wikipedia2Vec
=============
[![Fury badge](https://badge.fury.io/py/wikipedia2vec.png)](http://badge.fury.io/py/wikipedia2vec)
[![CircleCI](https://circleci.com/gh/wikipedia2vec/wikipedia2vec.svg?style=svg)](https://circleci.com/gh/wikipedia2vec/wikipedia2vec)
Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia.
It is developed and maintained by [Studio Ousia](http://www.ousia.jp).
This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space.
Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.
This tool implements the [conventional skip-gram model](https://en.wikipedia.org/wiki/Word2vec) to learn the embeddings of words, and its extension proposed in [Yamada et al. (2016)](https://arxiv.org/abs/1601.01343) to learn the embeddings of entities.
An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available [here](https://arxiv.org/abs/1812.06280).
Documentation are available online at [http://wikipedia2vec.github.io/](http://wikipedia2vec.github.io/).
## Basic Usage
Wikipedia2Vec can be installed via PyPI:
```bash
% pip install wikipedia2vec
```
With this tool, embeddings can be learned by running a *train* command with a Wikipedia dump as input.
For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:
```bash
% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE
```
Then, the learned embeddings are written to *MODEL\_FILE*.
Note that this command can take many optional parameters.
Please refer to [our documentation](https://wikipedia2vec.github.io/wikipedia2vec/commands/) for further details.
## Pretrained Embeddings
Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from [this page](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).
## Use Cases
Wikipedia2Vec has been applied to the following tasks:
* Entity linking: [Yamada et al., 2016](https://arxiv.org/abs/1601.01343), [Eshel et al., 2017](https://arxiv.org/abs/1706.09147), [Chen et al., 2019](https://arxiv.org/abs/1911.03834), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681), [van Hulst et al., 2020](https://arxiv.org/abs/2006.01969).
* Named entity recognition: [Sato et al., 2017](http://www.aclweb.org/anthology/I17-2017), [Lara-Clares and Garcia-Serrano, 2019](http://ceur-ws.org/Vol-2421/eHealth-KD_paper_6.pdf).
* Question answering: [Yamada et al., 2017](https://arxiv.org/abs/1803.08652), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).
* Entity typing: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960).
* Text classification: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960), [Yamada and Shindo, 2019](https://arxiv.org/abs/1909.01259), [Alam et al., 2020](https://link.springer.com/chapter/10.1007/978-3-030-61244-3_9).
* Relation classification: [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).
* Paraphrase detection: [Duong et al., 2018](https://ieeexplore.ieee.org/abstract/document/8606845).
* Knowledge graph completion: [Shah et al., 2019](https://aaai.org/ojs/index.php/AAAI/article/view/4162), [Shah et al., 2020](https://www.aclweb.org/anthology/2020.textgraphs-1.9/).
* Fake news detection: [Singh et al., 2019](https://arxiv.org/abs/1906.11126), [Ghosal et al., 2020](https://arxiv.org/abs/2010.10836).
* Plot analysis of movies: [Papalampidi et al., 2019](https://arxiv.org/abs/1908.10328).
* Novel entity discovery: [Zhang et al., 2020](https://arxiv.org/abs/2002.00206).
* Entity retrieval: [Gerritse et al., 2020](https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_7).
* Deepfake detection: [Zhong et al., 2020](https://arxiv.org/abs/2010.07475).
* Conversational information seeking: [Rodriguez et al., 2020](https://arxiv.org/abs/2005.00172).
* Query expansion: [Rosin et al., 2020](https://arxiv.org/abs/2012.12065).
## References
If you use Wikipedia2Vec in a scientific publication, please cite the following paper:
Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, [Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia](https://arxiv.org/abs/1812.06280).
```
@inproceedings{yamada2020wikipedia2vec,
title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020},
publisher = {Association for Computational Linguistics},
pages = {23--30}
}
```
The embedding model was originally proposed in the following paper:
Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/abs/1601.01343).
```
@inproceedings{yamada2016joint,
title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
year={2016},
publisher={Association for Computational Linguistics},
pages={250--259}
}
```
The text classification model implemented in [this example](https://github.com/wikipedia2vec/wikipedia2vec/tree/master/examples/text_classification) was proposed in the following paper:
Ikuya Yamada, Hiroyuki Shindo, [Neural Attentive Bag-of-Entities Model for Text Classification](https://arxiv.org/abs/1909.01259).
```
@article{yamada2019neural,
title={Neural Attentive Bag-of-Entities Model for Text Classification},
author={Yamada, Ikuya and Shindo, Hiroyuki},
booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
year={2019},
publisher={Association for Computational Linguistics},
pages = {563--573}
}
```
## License
[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)
Raw data
{
"_id": null,
"home_page": "http://wikipedia2vec.github.io/",
"name": "wikipedia2vec",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "wikipedia,embedding,wikipedia2vec",
"author": "Studio Ousia",
"author_email": "ikuya@ousia.jp",
"download_url": "https://files.pythonhosted.org/packages/89/83/15ab878fe5a93590b80bac8c3a8b0ad5f5dec5d0ea1071f9a17dbce5c33b/wikipedia2vec-1.0.5.tar.gz",
"platform": "",
"description": "Wikipedia2Vec\n=============\n\n[![Fury badge](https://badge.fury.io/py/wikipedia2vec.png)](http://badge.fury.io/py/wikipedia2vec)\n[![CircleCI](https://circleci.com/gh/wikipedia2vec/wikipedia2vec.svg?style=svg)](https://circleci.com/gh/wikipedia2vec/wikipedia2vec)\n\nWikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia.\nIt is developed and maintained by [Studio Ousia](http://www.ousia.jp).\n\nThis tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space.\nEmbeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.\n\nThis tool implements the [conventional skip-gram model](https://en.wikipedia.org/wiki/Word2vec) to learn the embeddings of words, and its extension proposed in [Yamada et al. (2016)](https://arxiv.org/abs/1601.01343) to learn the embeddings of entities.\n\nAn empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available [here](https://arxiv.org/abs/1812.06280).\n\nDocumentation are available online at [http://wikipedia2vec.github.io/](http://wikipedia2vec.github.io/).\n\n## Basic Usage\n\nWikipedia2Vec can be installed via PyPI:\n\n```bash\n% pip install wikipedia2vec\n```\n\nWith this tool, embeddings can be learned by running a *train* command with a Wikipedia dump as input.\nFor example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:\n\n```bash\n% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2\n% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE\n```\n\nThen, the learned embeddings are written to *MODEL\\_FILE*.\nNote that this command can take many optional parameters.\nPlease refer to [our documentation](https://wikipedia2vec.github.io/wikipedia2vec/commands/) for further details.\n\n## Pretrained Embeddings\n\nPretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from [this page](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).\n\n## Use Cases\n\nWikipedia2Vec has been applied to the following tasks:\n\n* Entity linking: [Yamada et al., 2016](https://arxiv.org/abs/1601.01343), [Eshel et al., 2017](https://arxiv.org/abs/1706.09147), [Chen et al., 2019](https://arxiv.org/abs/1911.03834), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681), [van Hulst et al., 2020](https://arxiv.org/abs/2006.01969).\n* Named entity recognition: [Sato et al., 2017](http://www.aclweb.org/anthology/I17-2017), [Lara-Clares and Garcia-Serrano, 2019](http://ceur-ws.org/Vol-2421/eHealth-KD_paper_6.pdf).\n* Question answering: [Yamada et al., 2017](https://arxiv.org/abs/1803.08652), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).\n* Entity typing: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960).\n* Text classification: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960), [Yamada and Shindo, 2019](https://arxiv.org/abs/1909.01259), [Alam et al., 2020](https://link.springer.com/chapter/10.1007/978-3-030-61244-3_9).\n* Relation classification: [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).\n* Paraphrase detection: [Duong et al., 2018](https://ieeexplore.ieee.org/abstract/document/8606845).\n* Knowledge graph completion: [Shah et al., 2019](https://aaai.org/ojs/index.php/AAAI/article/view/4162), [Shah et al., 2020](https://www.aclweb.org/anthology/2020.textgraphs-1.9/).\n* Fake news detection: [Singh et al., 2019](https://arxiv.org/abs/1906.11126), [Ghosal et al., 2020](https://arxiv.org/abs/2010.10836).\n* Plot analysis of movies: [Papalampidi et al., 2019](https://arxiv.org/abs/1908.10328).\n* Novel entity discovery: [Zhang et al., 2020](https://arxiv.org/abs/2002.00206).\n* Entity retrieval: [Gerritse et al., 2020](https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_7).\n* Deepfake detection: [Zhong et al., 2020](https://arxiv.org/abs/2010.07475).\n* Conversational information seeking: [Rodriguez et al., 2020](https://arxiv.org/abs/2005.00172).\n* Query expansion: [Rosin et al., 2020](https://arxiv.org/abs/2012.12065).\n\n## References\n\nIf you use Wikipedia2Vec in a scientific publication, please cite the following paper:\n\nIkuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, [Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia](https://arxiv.org/abs/1812.06280).\n\n```\n@inproceedings{yamada2020wikipedia2vec,\n title = \"{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia\",\n author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},\n booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},\n year = {2020},\n publisher = {Association for Computational Linguistics},\n pages = {23--30}\n}\n```\n\nThe embedding model was originally proposed in the following paper:\n\nIkuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/abs/1601.01343).\n\n```\n@inproceedings{yamada2016joint,\n title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},\n author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},\n booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},\n year={2016},\n publisher={Association for Computational Linguistics},\n pages={250--259}\n}\n```\n\nThe text classification model implemented in [this example](https://github.com/wikipedia2vec/wikipedia2vec/tree/master/examples/text_classification) was proposed in the following paper:\n\nIkuya Yamada, Hiroyuki Shindo, [Neural Attentive Bag-of-Entities Model for Text Classification](https://arxiv.org/abs/1909.01259).\n\n```\n@article{yamada2019neural,\n title={Neural Attentive Bag-of-Entities Model for Text Classification},\n author={Yamada, Ikuya and Shindo, Hiroyuki},\n booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},\n year={2019},\n publisher={Association for Computational Linguistics},\n pages = {563--573}\n}\n```\n\n## License\n\n[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)",
"bugtrack_url": null,
"license": "",
"summary": "A tool for learning vector representations of words and entities from Wikipedia",
"version": "1.0.5",
"project_urls": {
"Homepage": "http://wikipedia2vec.github.io/"
},
"split_keywords": [
"wikipedia",
"embedding",
"wikipedia2vec"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "898315ab878fe5a93590b80bac8c3a8b0ad5f5dec5d0ea1071f9a17dbce5c33b",
"md5": "c007ec61d374d69b04ce89f9ae006c76",
"sha256": "2906c1a9142dad04b67241b63758db4cbdf26ae299d8c1b5e658d8461ee8ae51"
},
"downloads": -1,
"filename": "wikipedia2vec-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "c007ec61d374d69b04ce89f9ae006c76",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1164953,
"upload_time": "2021-04-03T06:48:01",
"upload_time_iso_8601": "2021-04-03T06:48:01.356176Z",
"url": "https://files.pythonhosted.org/packages/89/83/15ab878fe5a93590b80bac8c3a8b0ad5f5dec5d0ea1071f9a17dbce5c33b/wikipedia2vec-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-04-03 06:48:01",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "wikipedia2vec"
}