wikimapper


Namewikimapper JSON
Version 0.1.17 PyPI version JSON
download
home_pagehttps://github.com/jcklie/wikimapper
SummaryMapping Wikidata and Wikipedia entities to each other
upload_time2023-01-21 21:51:46
maintainer
docs_urlNone
authorJan-Christoph Klie
requires_python>=3.5.0
licenseApache License 2.0
keywords wikidata wikipedia wiki kb knowledge-base
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            wikimapper
==========

.. image:: https://img.shields.io/pypi/l/wikimapper.svg
  :alt: PyPI - License
  :target: https://pypi.org/project/wikimapper/

.. image:: https://img.shields.io/pypi/pyversions/wikimapper.svg
  :alt: PyPI - Python Version
  :target: https://pypi.org/project/wikimapper/

.. image:: https://img.shields.io/pypi/v/wikimapper.svg
  :alt: PyPI
  :target: https://pypi.org/project/wikimapper/

.. image:: https://img.shields.io/badge/code%20style-black-000000.svg
  :target: https://github.com/ambv/black  

This small Python library helps you to map Wikipedia page titles (e.g. `Manatee
<https://en.wikipedia.org/wiki/Manatee>`_ to `Q42797 <https://www.wikidata.org/wiki/Q42797>`_)
and vice versa. This is done by creating an index of these mappings from a Wikipedia SQL dump.
Precomputed indices can be found under `Precomputed indices`_. Redirects are taken into account.

Installation
------------

This package can be installed via ``pip``, the Python package manager.

.. code:: bash

    pip install wikimapper

If all you want is just mapping, then you can also just download ``wikimapper/mapper.py`` and
add it to your project. It does not have any external dependencies.

Usage
-----

Using the mapping functionality requires a precomputed index. It is created from Wikipedia
SQL dumps (see `Create your own index`_) or can be downloaded for certain languages
(see `Precomputed indices`_). For the following to work, it is assumed that an index either
has been created or downloaded. Using the command line for batch mapping is not recommended,
as it requires repeated opening and closing the database, leading to a speed penalty.

Map Wikipedia page title to Wikidata id
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

    from wikimapper import WikiMapper

    mapper = WikiMapper("index_enwiki-latest.db")
    wikidata_id = mapper.title_to_id("Python_(programming_language)")
    print(wikidata_id) # Q28865

or from the command line via

.. code:: bash

    $ wikimapper title2id index_enwiki-latest.db Germany
    Q183

Map Wikipedia URL to Wikidata id
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

    from wikimapper import WikiMapper

    mapper = WikiMapper("index_enwiki-latest.db")
    wikidata_id = mapper.url_to_id("https://en.wikipedia.org/wiki/Python_(programming_language)")
    print(wikidata_id) # Q28865

or from the command line via

.. code:: bash

    $ wikimapper url2id index_enwiki-latest.db https://en.wikipedia.org/wiki/Germany
    Q183

It is not checked whether the URL origins from the same Wiki as the index you created!

Map Wikidata id to Wikipedia page title
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

    from wikimapper import WikiMapper

    mapper = WikiMapper("index_enwiki-latest.db")
    titles = mapper.id_to_titles("Q183")
    print(titles) # Germany, Deutschland, ...

or from the command line via

.. code:: bash

    $ wikimapper id2titles data/index_enwiki-latest.db Q183
    Germany
    Bundesrepublik_Deutschland
    Land_der_Dichter_und_Denker
    Jerman
    ...

Mapping id to title can lead to more than one result, as some pages in Wikipedia are
redirects, all linking to the same Wikidata item.

Create your own index
~~~~~~~~~~~~~~~~~~~~~

While some indices are precomupted, it is sometimes useful to create your own. The
following section describes the steps need. Regarding creation speed: The index creation
code works, but is not optimized. It takes around 10 minutes on my Notebook (T480s)
to create it for English Wikipedia if the data is already downloaded.

**1. Download the data**

The easiest way is to use the command line tool that ships with this package. It
can be e.g. invoked by

.. code:: bash

    $ wikimapper download enwiki-latest --dir data

Use ``wikimapper download --help`` for a full description of the tool.

The abbreviation for the Wiki of your choice can be found on `Wikipedia
<https://en.wikipedia.org/wiki/List_of_Wikipedias>`_. Available SQL dumps can be
e.g. found on `Wikimedia <https://dumps.wikimedia.org/>`_, you need to suffix
the Wiki name, e.g. ``https://dumps.wikimedia.org/dewiki/`` for the German one.
If possible, use a different mirror than the default in order to spread the resource usage.

**2. Create the index**

The next step is to create an index from the downloaded dump. The easiest way is to use
the command line tool that ships with this package. It can be e.g. invoked by

.. code:: bash

    $ wikimapper create enwiki-latest --dumpdir data --target data/index_enwiki-latest.db

This creates an index for the previously downloaded dump and saves it in ``data/index_enwiki-latest.db``.
Use ``wikimapper create --help`` for a full description of the tool.

Precomputed indices
-------------------

.. _precomputed:

Several precomputed indices can be found `here <https://public.ukp.informatik.tu-darmstadt.de/wikimapper/>`_ .

Command line interface
----------------------

This package comes with a command line interface that is automatically available
when installing via ``pip``. It can be invoked by ``wikimapper`` from the command
line.

::

    $ wikimapper

    usage: wikimapper [-h] [--version]
                      {download,create,title2id,url2id,id2titles} ...

    Map Wikipedia page titles to Wikidata IDs and vice versa.

    positional arguments:
      {download,create,title2id,url2id,id2titles}
                            sub-command help
        download            Download Wikipedia dumps for creating a custom index.
        create              Use a previously downloaded Wikipedia dump to create a
                            custom index.
        title2id            Map a Wikipedia title to a Wikidata ID.
        url2id              Map a Wikipedia URL to a Wikidata ID.
        id2titles           Map a Wikidata ID to one or more Wikipedia titles.

    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit

See ``wikimapper ${sub-command} --help`` for more information.

Development
-----------

The required dependencies are managed by **pip**. A virtual environment
containing all needed packages for development and production can be
created and activated by

::

    virtualenv venv --python=python3 --no-site-packages
    source venv/bin/activate
    pip install -e ".[test, dev, doc]"

The tests can be run in the current environment by invoking

::

    make test

or in a clean environment via

::

    tox

FAQ
---

How does the parsing of the dump work?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`jamesmishra <https://github.com/jamesmishra/mysqldump-to-csv>`__ has noticed that
SQL dumps from Wikipedia almost look like CSV. He provides some basic functions
to parse insert statements into tuples. We then use the Wikipedia SQL page
dump to get the mapping between title and internal id, page props to get
the Wikidata ID for a title and then the redirect dump in order to fill
titles that are only redirects and do not have an entry in the page props table.

Why do you not use the Wikidata SPARQL endpoint for that?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is possible to query the official Wikidata SPARQL endpoint to do the mapping:

.. code:: sparql

    prefix schema: <http://schema.org/>
    SELECT * WHERE {
      <https://en.wikipedia.org/wiki/Manatee> schema:about ?item .
    }

This has several issues: First, it uses the network, which is slow. Second, I try to use
that endpoint as infrequent as possible to save their resources (my use case is to map
data sets that have easily tens of thousands of entries). Third, I had coverage issues due
to redirects in Wikipedia not being resolved (around ~20% of the time for some older data sets).
So I created this package to do the mapping offline instead.

Acknowledgements
----------------

I am very thankful for `jamesmishra <https://github.com/jamesmishra>`__  to provide
`mysqldump-to-csv <https://github.com/jamesmishra/mysqldump-to-csv>`__ . Also,
`mbugert <https://github.com/mbugert>`__ helped me tremendously understanding
Wikipedia dumps and giving me the idea on how to map.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jcklie/wikimapper",
    "name": "wikimapper",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.5.0",
    "maintainer_email": "",
    "keywords": "wikidata wikipedia wiki kb knowledge-base",
    "author": "Jan-Christoph Klie",
    "author_email": "git@mrklie.com",
    "download_url": "https://files.pythonhosted.org/packages/4d/d2/5e0148ce0c81f4f8da029a4772694d135b5ee5cdf4177edb0fbf80883b0a/wikimapper-0.1.17.tar.gz",
    "platform": null,
    "description": "wikimapper\n==========\n\n.. image:: https://img.shields.io/pypi/l/wikimapper.svg\n  :alt: PyPI - License\n  :target: https://pypi.org/project/wikimapper/\n\n.. image:: https://img.shields.io/pypi/pyversions/wikimapper.svg\n  :alt: PyPI - Python Version\n  :target: https://pypi.org/project/wikimapper/\n\n.. image:: https://img.shields.io/pypi/v/wikimapper.svg\n  :alt: PyPI\n  :target: https://pypi.org/project/wikimapper/\n\n.. image:: https://img.shields.io/badge/code%20style-black-000000.svg\n  :target: https://github.com/ambv/black  \n\nThis small Python library helps you to map Wikipedia page titles (e.g. `Manatee\n<https://en.wikipedia.org/wiki/Manatee>`_ to `Q42797 <https://www.wikidata.org/wiki/Q42797>`_)\nand vice versa. This is done by creating an index of these mappings from a Wikipedia SQL dump.\nPrecomputed indices can be found under `Precomputed indices`_. Redirects are taken into account.\n\nInstallation\n------------\n\nThis package can be installed via ``pip``, the Python package manager.\n\n.. code:: bash\n\n    pip install wikimapper\n\nIf all you want is just mapping, then you can also just download ``wikimapper/mapper.py`` and\nadd it to your project. It does not have any external dependencies.\n\nUsage\n-----\n\nUsing the mapping functionality requires a precomputed index. It is created from Wikipedia\nSQL dumps (see `Create your own index`_) or can be downloaded for certain languages\n(see `Precomputed indices`_). For the following to work, it is assumed that an index either\nhas been created or downloaded. Using the command line for batch mapping is not recommended,\nas it requires repeated opening and closing the database, leading to a speed penalty.\n\nMap Wikipedia page title to Wikidata id\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from wikimapper import WikiMapper\n\n    mapper = WikiMapper(\"index_enwiki-latest.db\")\n    wikidata_id = mapper.title_to_id(\"Python_(programming_language)\")\n    print(wikidata_id) # Q28865\n\nor from the command line via\n\n.. code:: bash\n\n    $ wikimapper title2id index_enwiki-latest.db Germany\n    Q183\n\nMap Wikipedia URL to Wikidata id\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from wikimapper import WikiMapper\n\n    mapper = WikiMapper(\"index_enwiki-latest.db\")\n    wikidata_id = mapper.url_to_id(\"https://en.wikipedia.org/wiki/Python_(programming_language)\")\n    print(wikidata_id) # Q28865\n\nor from the command line via\n\n.. code:: bash\n\n    $ wikimapper url2id index_enwiki-latest.db https://en.wikipedia.org/wiki/Germany\n    Q183\n\nIt is not checked whether the URL origins from the same Wiki as the index you created!\n\nMap Wikidata id to Wikipedia page title\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from wikimapper import WikiMapper\n\n    mapper = WikiMapper(\"index_enwiki-latest.db\")\n    titles = mapper.id_to_titles(\"Q183\")\n    print(titles) # Germany, Deutschland, ...\n\nor from the command line via\n\n.. code:: bash\n\n    $ wikimapper id2titles data/index_enwiki-latest.db Q183\n    Germany\n    Bundesrepublik_Deutschland\n    Land_der_Dichter_und_Denker\n    Jerman\n    ...\n\nMapping id to title can lead to more than one result, as some pages in Wikipedia are\nredirects, all linking to the same Wikidata item.\n\nCreate your own index\n~~~~~~~~~~~~~~~~~~~~~\n\nWhile some indices are precomupted, it is sometimes useful to create your own. The\nfollowing section describes the steps need. Regarding creation speed: The index creation\ncode works, but is not optimized. It takes around 10 minutes on my Notebook (T480s)\nto create it for English Wikipedia if the data is already downloaded.\n\n**1. Download the data**\n\nThe easiest way is to use the command line tool that ships with this package. It\ncan be e.g. invoked by\n\n.. code:: bash\n\n    $ wikimapper download enwiki-latest --dir data\n\nUse ``wikimapper download --help`` for a full description of the tool.\n\nThe abbreviation for the Wiki of your choice can be found on `Wikipedia\n<https://en.wikipedia.org/wiki/List_of_Wikipedias>`_. Available SQL dumps can be\ne.g. found on `Wikimedia <https://dumps.wikimedia.org/>`_, you need to suffix\nthe Wiki name, e.g. ``https://dumps.wikimedia.org/dewiki/`` for the German one.\nIf possible, use a different mirror than the default in order to spread the resource usage.\n\n**2. Create the index**\n\nThe next step is to create an index from the downloaded dump. The easiest way is to use\nthe command line tool that ships with this package. It can be e.g. invoked by\n\n.. code:: bash\n\n    $ wikimapper create enwiki-latest --dumpdir data --target data/index_enwiki-latest.db\n\nThis creates an index for the previously downloaded dump and saves it in ``data/index_enwiki-latest.db``.\nUse ``wikimapper create --help`` for a full description of the tool.\n\nPrecomputed indices\n-------------------\n\n.. _precomputed:\n\nSeveral precomputed indices can be found `here <https://public.ukp.informatik.tu-darmstadt.de/wikimapper/>`_ .\n\nCommand line interface\n----------------------\n\nThis package comes with a command line interface that is automatically available\nwhen installing via ``pip``. It can be invoked by ``wikimapper`` from the command\nline.\n\n::\n\n    $ wikimapper\n\n    usage: wikimapper [-h] [--version]\n                      {download,create,title2id,url2id,id2titles} ...\n\n    Map Wikipedia page titles to Wikidata IDs and vice versa.\n\n    positional arguments:\n      {download,create,title2id,url2id,id2titles}\n                            sub-command help\n        download            Download Wikipedia dumps for creating a custom index.\n        create              Use a previously downloaded Wikipedia dump to create a\n                            custom index.\n        title2id            Map a Wikipedia title to a Wikidata ID.\n        url2id              Map a Wikipedia URL to a Wikidata ID.\n        id2titles           Map a Wikidata ID to one or more Wikipedia titles.\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      --version             show program's version number and exit\n\nSee ``wikimapper ${sub-command} --help`` for more information.\n\nDevelopment\n-----------\n\nThe required dependencies are managed by **pip**. A virtual environment\ncontaining all needed packages for development and production can be\ncreated and activated by\n\n::\n\n    virtualenv venv --python=python3 --no-site-packages\n    source venv/bin/activate\n    pip install -e \".[test, dev, doc]\"\n\nThe tests can be run in the current environment by invoking\n\n::\n\n    make test\n\nor in a clean environment via\n\n::\n\n    tox\n\nFAQ\n---\n\nHow does the parsing of the dump work?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n`jamesmishra <https://github.com/jamesmishra/mysqldump-to-csv>`__ has noticed that\nSQL dumps from Wikipedia almost look like CSV. He provides some basic functions\nto parse insert statements into tuples. We then use the Wikipedia SQL page\ndump to get the mapping between title and internal id, page props to get\nthe Wikidata ID for a title and then the redirect dump in order to fill\ntitles that are only redirects and do not have an entry in the page props table.\n\nWhy do you not use the Wikidata SPARQL endpoint for that?\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIt is possible to query the official Wikidata SPARQL endpoint to do the mapping:\n\n.. code:: sparql\n\n    prefix schema: <http://schema.org/>\n    SELECT * WHERE {\n      <https://en.wikipedia.org/wiki/Manatee> schema:about ?item .\n    }\n\nThis has several issues: First, it uses the network, which is slow. Second, I try to use\nthat endpoint as infrequent as possible to save their resources (my use case is to map\ndata sets that have easily tens of thousands of entries). Third, I had coverage issues due\nto redirects in Wikipedia not being resolved (around ~20% of the time for some older data sets).\nSo I created this package to do the mapping offline instead.\n\nAcknowledgements\n----------------\n\nI am very thankful for `jamesmishra <https://github.com/jamesmishra>`__  to provide\n`mysqldump-to-csv <https://github.com/jamesmishra/mysqldump-to-csv>`__ . Also,\n`mbugert <https://github.com/mbugert>`__ helped me tremendously understanding\nWikipedia dumps and giving me the idea on how to map.",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Mapping Wikidata and Wikipedia entities to each other",
    "version": "0.1.17",
    "split_keywords": [
        "wikidata",
        "wikipedia",
        "wiki",
        "kb",
        "knowledge-base"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4dd25e0148ce0c81f4f8da029a4772694d135b5ee5cdf4177edb0fbf80883b0a",
                "md5": "24efc40ddcde2985fb9fc813eba0ebfa",
                "sha256": "c8d2b16192776753cdf698535fdd12c9c1d47571767e9ded6635c97c2bba34d6"
            },
            "downloads": -1,
            "filename": "wikimapper-0.1.17.tar.gz",
            "has_sig": false,
            "md5_digest": "24efc40ddcde2985fb9fc813eba0ebfa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5.0",
            "size": 18292,
            "upload_time": "2023-01-21T21:51:46",
            "upload_time_iso_8601": "2023-01-21T21:51:46.409755Z",
            "url": "https://files.pythonhosted.org/packages/4d/d2/5e0148ce0c81f4f8da029a4772694d135b5ee5cdf4177edb0fbf80883b0a/wikimapper-0.1.17.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-21 21:51:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "jcklie",
    "github_project": "wikimapper",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "wikimapper"
}
        
Elapsed time: 0.52748s