mwsql


Namemwsql JSON
Version 1.0.4 PyPI version JSON
download
home_page
Summarymwsql is a set of utilities for processing MediaWiki SQL dump data
upload_time2024-02-19 08:24:59
maintainer
docs_urlNone
authorSlavina Stefanova
requires_python>=3.9,<4.0
licenseGPL-3.0-or-later
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. image:: https://badge.fury.io/py/mwsql.svg
    :target: https://badge.fury.io/py/mwsql

.. image:: https://github.com/mediawiki-utilities/python-mwsql/actions/workflows/test.yml/badge.svg
   :target: https://github.com/mediawiki-utilities/python-mwsql/actions/workflows/test.yml

.. image:: https://readthedocs.org/projects/ansicolortags/badge/?version=latest
   :target: http://ansicolortags.readthedocs.io/?badge=latest


Overview
========

``mwsql`` provides utilities for working with Wikimedia SQL dump files.
It supports Python 3.9 and later versions.

``mwsql`` abstracts the messiness of working with SQL dump files.
Each Wikimedia SQL dump file contains one database table.
The most common use case for ``mwsql`` is to convert this table into a more user-friendly Python ``Dump class`` instance.
This lets you access the table's metadata (db names, field names, data types, etc.) as attributes, and its content – the table rows – as a generator, which enables processing of larger-than-memory datasets due to the inherent lazy/delayed execution of Python generators.

``mwsql`` also provides a method to convert SQL dump files into CSV.
You can find more information on how to use ``mwsql`` in the `usage examples`_.


Installation
------------

You can install ``mwsql`` with ``pip``:

.. code-block:: bash

   $ pip install mwsql


Basic Usage
-----------

.. code-block:: pycon

   >>> from mwsql import Dump
   >>> dump = Dump.from_file('simplewiki-latest-change_tag_def.sql.gz')
   >>> dump.head(5)
   ['ctd_id', 'ctd_name', 'ctd_user_defined', 'ctd_count']
   ['1', 'mw-replace', '0', '10453']
   ['2', 'visualeditor', '0', '309141']
   ['3', 'mw-undo', '0', '59767']
   ['4', 'mw-rollback', '0', '71585']
   ['5', 'mobile edit', '0', '234682']
   >>> dump.dtypes
   {'ctd_id': int, 'ctd_name': str, 'ctd_user_defined': int, 'ctd_count': int}
   >>> rows = dump.rows(convert_dtypes=True)
   >>> next(rows)
   [1, 'mw-replace', 0, 10453]


Known Issues
------------


Encoding errors
~~~~~~~~~~~~~~~

Wikimedia SQL dumps use utf-8 encoding.
Unfortunately, some fields can contain non-recognized characters, raising an encoding error when attempting to parse the dump file.
If this happens while reading in the file, it's recommended to try again using a different encoding. ``latin-1`` will sometimes solve the problem; if not, you're encouraged to try with other encodings.
If iterating over the rows throws an encoding error, you can try changing the encoding.
In this case, you don't need to recreate the dump – just pass in a new encoding via the ``dump.encoding`` attribute.


Parsing errors
~~~~~~~~~~~~~~

Some Wikimedia SQL dumps contain string-type fields that are sometimes not correctly parsed, resulting in fields being split up into several parts.
This is more likely to happen when parsing dumps containing file names from Wikimedia Commons or containing external links with many query parameters.
If you're parsing any of the other dumps, you're unlikely to run into this issue.

In most cases, this issue affects a relatively very small proportion of the total rows parsed.
For instance, Wikimedia Commons ``page`` dump contains approximately 99 million entries, out of which ~13.000 are incorrectly parsed.
Wikimedia Commons ``page links`` on the other hand, contains ~760M records, and only 20 are wrongly parsed.

This issue is most commonly caused by the parser mistaking a single quote (or apostrophe, as they're identical) within a string for the single quote that marks the end of said string.
There's currently no known workaround other than manually removing the rows that contain more fields than expected, or if they are relatively few, manually merging the split fields.

Future versions of ``mwsql`` will improve the parser to correctly identify when single quotes should be treated as string delimiters and when they should be escaped. For now, it's essential to be aware that this problem exists.


Project information
-------------------

``mwsql`` is released under the `GPLv3`_.
You can find the complete documentation at `Read the Docs`_. If you run into bugs, you can file them in our `issue tracker`_.
Have ideas on how to make ``mwsql`` better?
Contributions are most welcome – we have put together a guide on how to `get started`_.


.. _`GPLv3`: https://choosealicense.com/licenses/gpl-3.0/
.. _`Read the Docs`: https://mwsql.readthedocs.io/en/latest/
.. _`usage examples`: https://mwsql.readthedocs.io/en/latest/examples.html
.. _`get started`: https://mwsql.readthedocs.io/en/latest/contributing.html
.. _`issue tracker`: https://github.com/blancadesal/mwsql/issues

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "mwsql",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Slavina Stefanova",
    "author_email": "slavina.stefanova@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/2c/15/55506e21dadf01c2a69639666c795e945c6dfea322d63b30b5a7121f55ce/mwsql-1.0.4.tar.gz",
    "platform": null,
    "description": ".. image:: https://badge.fury.io/py/mwsql.svg\n    :target: https://badge.fury.io/py/mwsql\n\n.. image:: https://github.com/mediawiki-utilities/python-mwsql/actions/workflows/test.yml/badge.svg\n   :target: https://github.com/mediawiki-utilities/python-mwsql/actions/workflows/test.yml\n\n.. image:: https://readthedocs.org/projects/ansicolortags/badge/?version=latest\n   :target: http://ansicolortags.readthedocs.io/?badge=latest\n\n\nOverview\n========\n\n``mwsql`` provides utilities for working with Wikimedia SQL dump files.\nIt supports Python 3.9 and later versions.\n\n``mwsql`` abstracts the messiness of working with SQL dump files.\nEach Wikimedia SQL dump file contains one database table.\nThe most common use case for ``mwsql`` is to convert this table into a more user-friendly Python ``Dump class`` instance.\nThis lets you access the table's metadata (db names, field names, data types, etc.) as attributes, and its content \u2013 the table rows \u2013 as a generator, which enables processing of larger-than-memory datasets due to the inherent lazy/delayed execution of Python generators.\n\n``mwsql`` also provides a method to convert SQL dump files into CSV.\nYou can find more information on how to use ``mwsql`` in the `usage examples`_.\n\n\nInstallation\n------------\n\nYou can install ``mwsql`` with ``pip``:\n\n.. code-block:: bash\n\n   $ pip install mwsql\n\n\nBasic Usage\n-----------\n\n.. code-block:: pycon\n\n   >>> from mwsql import Dump\n   >>> dump = Dump.from_file('simplewiki-latest-change_tag_def.sql.gz')\n   >>> dump.head(5)\n   ['ctd_id', 'ctd_name', 'ctd_user_defined', 'ctd_count']\n   ['1', 'mw-replace', '0', '10453']\n   ['2', 'visualeditor', '0', '309141']\n   ['3', 'mw-undo', '0', '59767']\n   ['4', 'mw-rollback', '0', '71585']\n   ['5', 'mobile edit', '0', '234682']\n   >>> dump.dtypes\n   {'ctd_id': int, 'ctd_name': str, 'ctd_user_defined': int, 'ctd_count': int}\n   >>> rows = dump.rows(convert_dtypes=True)\n   >>> next(rows)\n   [1, 'mw-replace', 0, 10453]\n\n\nKnown Issues\n------------\n\n\nEncoding errors\n~~~~~~~~~~~~~~~\n\nWikimedia SQL dumps use utf-8 encoding.\nUnfortunately, some fields can contain non-recognized characters, raising an encoding error when attempting to parse the dump file.\nIf this happens while reading in the file, it's recommended to try again using a different encoding. ``latin-1`` will sometimes solve the problem; if not, you're encouraged to try with other encodings.\nIf iterating over the rows throws an encoding error, you can try changing the encoding.\nIn this case, you don't need to recreate the dump \u2013 just pass in a new encoding via the ``dump.encoding`` attribute.\n\n\nParsing errors\n~~~~~~~~~~~~~~\n\nSome Wikimedia SQL dumps contain string-type fields that are sometimes not correctly parsed, resulting in fields being split up into several parts.\nThis is more likely to happen when parsing dumps containing file names from Wikimedia Commons or containing external links with many query parameters.\nIf you're parsing any of the other dumps, you're unlikely to run into this issue.\n\nIn most cases, this issue affects a relatively very small proportion of the total rows parsed.\nFor instance, Wikimedia Commons ``page`` dump contains approximately 99 million entries, out of which ~13.000 are incorrectly parsed.\nWikimedia Commons ``page links`` on the other hand, contains ~760M records, and only 20 are wrongly parsed.\n\nThis issue is most commonly caused by the parser mistaking a single quote (or apostrophe, as they're identical) within a string for the single quote that marks the end of said string.\nThere's currently no known workaround other than manually removing the rows that contain more fields than expected, or if they are relatively few, manually merging the split fields.\n\nFuture versions of ``mwsql`` will improve the parser to correctly identify when single quotes should be treated as string delimiters and when they should be escaped. For now, it's essential to be aware that this problem exists.\n\n\nProject information\n-------------------\n\n``mwsql`` is released under the `GPLv3`_.\nYou can find the complete documentation at `Read the Docs`_. If you run into bugs, you can file them in our `issue tracker`_.\nHave ideas on how to make ``mwsql`` better?\nContributions are most welcome \u2013 we have put together a guide on how to `get started`_.\n\n\n.. _`GPLv3`: https://choosealicense.com/licenses/gpl-3.0/\n.. _`Read the Docs`: https://mwsql.readthedocs.io/en/latest/\n.. _`usage examples`: https://mwsql.readthedocs.io/en/latest/examples.html\n.. _`get started`: https://mwsql.readthedocs.io/en/latest/contributing.html\n.. _`issue tracker`: https://github.com/blancadesal/mwsql/issues\n",
    "bugtrack_url": null,
    "license": "GPL-3.0-or-later",
    "summary": "mwsql is a set of utilities for processing MediaWiki SQL dump data",
    "version": "1.0.4",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d6415be5b1bb09f091871971a20b15ac2a055164e85b5958855620802619ff8c",
                "md5": "f0a4e3b8baec72fb16d7eab6e78caa43",
                "sha256": "ab9ad290be66c13848a0e9d7fc4072160430a9a6ca5fa9c8cbb406e7ebc12452"
            },
            "downloads": -1,
            "filename": "mwsql-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f0a4e3b8baec72fb16d7eab6e78caa43",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 22485,
            "upload_time": "2024-02-19T08:24:57",
            "upload_time_iso_8601": "2024-02-19T08:24:57.621912Z",
            "url": "https://files.pythonhosted.org/packages/d6/41/5be5b1bb09f091871971a20b15ac2a055164e85b5958855620802619ff8c/mwsql-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2c1555506e21dadf01c2a69639666c795e945c6dfea322d63b30b5a7121f55ce",
                "md5": "ea81864e330b905f116843e2e0139d5e",
                "sha256": "c3b24603bda93cdde6c6f3d600805d6a4df4b7bc274aaf24127cf6714e914a0d"
            },
            "downloads": -1,
            "filename": "mwsql-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "ea81864e330b905f116843e2e0139d5e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 22426,
            "upload_time": "2024-02-19T08:24:59",
            "upload_time_iso_8601": "2024-02-19T08:24:59.060941Z",
            "url": "https://files.pythonhosted.org/packages/2c/15/55506e21dadf01c2a69639666c795e945c6dfea322d63b30b5a7121f55ce/mwsql-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-19 08:24:59",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "mwsql"
}
        
Elapsed time: 0.20600s