data-extractor


Namedata-extractor JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryCombine XPath, CSS Selectors and JSONPath for Web data extracting.
upload_time2024-10-12 07:26:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords data-extractor data-extraction xpath css-selectors jsonpath
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ==============
Data Extractor
==============

|license| |Pypi Status| |Python version| |Package version| |PyPI - Downloads|
|GitHub last commit| |Code style: black| |Build Status| |codecov|
|Documentation Status| |PDM managed|

Combine **XPath**, **CSS Selectors** and **JSONPath** for Web data extracting.

Quickstarts
<<<<<<<<<<<

Installation
~~~~~~~~~~~~

Install the stable version from PYPI.

.. code-block:: shell

    pip install "data-extractor[jsonpath-extractor]"  # for extracting JSON data
    pip install "data-extractor[lxml]"  # for extracting HTML data

Or install the latest version from Github.

.. code-block:: shell

    pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"

Extract JSON data
~~~~~~~~~~~~~~~~~

Currently supports to extract JSON data with below optional dependencies

- jsonpath-extractor_
- jsonpath-rw_
- jsonpath-rw-ext_

.. _jsonpath-extractor: https://github.com/linw1995/jsonpath
.. _jsonpath-rw: https://github.com/kennknowles/python-jsonpath-rw
.. _jsonpath-rw-ext: https://python-jsonpath-rw-ext.readthedocs.org/en/latest/

install one dependency of them to extract JSON data.

Extract HTML(XML) data
~~~~~~~~~~~~~~~~~~~~~~

Currently supports to extract HTML(XML) data with below optional dependencies

- lxml_ for using XPath_
- cssselect_ for using CSS-Selectors_

.. _lxml: https://lxml.de/
.. _XPath: https://www.w3.org/TR/xpath-10/
.. _cssselect: https://cssselect.readthedocs.io/en/latest/
.. _CSS-Selectors: https://www.w3.org/TR/selectors-3/

Usage
~~~~~

.. code-block:: python3

    from data_extractor import Field, Item, JSONExtractor


    class Count(Item):
        followings = Field(JSONExtractor("countFollowings"))
        fans = Field(JSONExtractor("countFans"))


    class User(Item):
        name_ = Field(JSONExtractor("name"), name="name")
        age = Field(JSONExtractor("age"), default=17)
        count = Count()


    assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
        {
            "data": {
                "users": [
                    {
                        "name": "john",
                        "age": 19,
                        "countFollowings": 14,
                        "countFans": 212,
                    },
                    {
                        "name": "jack",
                        "description": "",
                        "countFollowings": 54,
                        "countFans": 312,
                    },
                ]
            }
        }
    ) == [
        {"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
        {"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
    ]

Changelog
<<<<<<<<<

v1.0.0
~~~~~~~~~~

**Feature**

- Generic extractor with convertor (#83)
- mypy plugin for type annotation of extracting result (#83)


Contributing
<<<<<<<<<<<<


Environment Setup
~~~~~~~~~~~~~~~~~

Clone the source codes from Github.

.. code-block:: shell

    git clone https://github.com/linw1995/data_extractor.git
    cd data_extractor

Setup the development environment.
Please make sure you install the pdm_,
pre-commit_ and nox_ CLIs in your environment.

.. code-block:: shell

    make init
    make PYTHON=3.7 init  # for specific python version

Linting
~~~~~~~

Use pre-commit_ for installing linters to ensure a good code style.

.. code-block:: shell

    make pre-commit

Run linters. Some linters run via CLI nox_, so make sure you install it.

.. code-block:: shell

    make check-all

Testing
~~~~~~~

Run quick tests.

.. code-block:: shell

    make

Run quick tests with verbose.

.. code-block:: shell

    make vtest

Run tests with coverage.
Testing in multiple Python environments is powered by CLI nox_.

.. code-block:: shell

    make cov

.. _pdm: https://github.com/pdm-project/pdm
.. _pre-commit: https://pre-commit.com/
.. _nox: https://nox.thea.codes/en/stable/

.. |license| image:: https://img.shields.io/github/license/linw1995/data_extractor.svg
    :target: https://github.com/linw1995/data_extractor/blob/master/LICENSE

.. |Pypi Status| image:: https://img.shields.io/pypi/status/data_extractor.svg
    :target: https://pypi.org/project/data_extractor

.. |Python version| image:: https://img.shields.io/pypi/pyversions/data_extractor.svg
    :target: https://pypi.org/project/data_extractor

.. |Package version| image:: https://img.shields.io/pypi/v/data_extractor.svg
    :target: https://pypi.org/project/data_extractor

.. |PyPI - Downloads| image:: https://img.shields.io/pypi/dm/data-extractor.svg
    :target: https://pypi.org/project/data_extractor

.. |GitHub last commit| image:: https://img.shields.io/github/last-commit/linw1995/data_extractor.svg
    :target: https://github.com/linw1995/data_extractor

.. |Code style: black| image:: https://img.shields.io/badge/code%20style-black-000000.svg
    :target: https://github.com/ambv/black

.. |Build Status| image:: https://github.com/linw1995/data_extractor/workflows/Lint&Test/badge.svg
    :target: https://github.com/linw1995/data_extractor/actions?query=workflow%3ALint%26Test

.. |codecov| image:: https://codecov.io/gh/linw1995/data_extractor/branch/master/graph/badge.svg
    :target: https://codecov.io/gh/linw1995/data_extractor

.. |Documentation Status| image:: https://readthedocs.org/projects/data-extractor/badge/?version=latest
    :target: https://data-extractor.readthedocs.io/en/latest/?badge=latest

.. |PDM managed| image:: https://img.shields.io/badge/pdm-managed-blueviolet
    :target: https://pdm.fming.dev


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "data-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "data-extractor, data-extraction, xpath, css-selectors, jsonpath",
    "author": null,
    "author_email": "\u6797\u73ae (Jade Lin) <linw1995@icloud.com>",
    "download_url": "https://files.pythonhosted.org/packages/ea/c5/990d6b152a61d0085b4290e5ef190a5f8423ce394b2ef9dffad7b5985e91/data-extractor-1.0.0.tar.gz",
    "platform": null,
    "description": "==============\nData Extractor\n==============\n\n|license| |Pypi Status| |Python version| |Package version| |PyPI - Downloads|\n|GitHub last commit| |Code style: black| |Build Status| |codecov|\n|Documentation Status| |PDM managed|\n\nCombine **XPath**, **CSS Selectors** and **JSONPath** for Web data extracting.\n\nQuickstarts\n<<<<<<<<<<<\n\nInstallation\n~~~~~~~~~~~~\n\nInstall the stable version from PYPI.\n\n.. code-block:: shell\n\n    pip install \"data-extractor[jsonpath-extractor]\"  # for extracting JSON data\n    pip install \"data-extractor[lxml]\"  # for extracting HTML data\n\nOr install the latest version from Github.\n\n.. code-block:: shell\n\n    pip install \"data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master\"\n\nExtract JSON data\n~~~~~~~~~~~~~~~~~\n\nCurrently supports to extract JSON data with below optional dependencies\n\n- jsonpath-extractor_\n- jsonpath-rw_\n- jsonpath-rw-ext_\n\n.. _jsonpath-extractor: https://github.com/linw1995/jsonpath\n.. _jsonpath-rw: https://github.com/kennknowles/python-jsonpath-rw\n.. _jsonpath-rw-ext: https://python-jsonpath-rw-ext.readthedocs.org/en/latest/\n\ninstall one dependency of them to extract JSON data.\n\nExtract HTML(XML) data\n~~~~~~~~~~~~~~~~~~~~~~\n\nCurrently supports to extract HTML(XML) data with below optional dependencies\n\n- lxml_ for using XPath_\n- cssselect_ for using CSS-Selectors_\n\n.. _lxml: https://lxml.de/\n.. _XPath: https://www.w3.org/TR/xpath-10/\n.. _cssselect: https://cssselect.readthedocs.io/en/latest/\n.. _CSS-Selectors: https://www.w3.org/TR/selectors-3/\n\nUsage\n~~~~~\n\n.. code-block:: python3\n\n    from data_extractor import Field, Item, JSONExtractor\n\n\n    class Count(Item):\n        followings = Field(JSONExtractor(\"countFollowings\"))\n        fans = Field(JSONExtractor(\"countFans\"))\n\n\n    class User(Item):\n        name_ = Field(JSONExtractor(\"name\"), name=\"name\")\n        age = Field(JSONExtractor(\"age\"), default=17)\n        count = Count()\n\n\n    assert User(JSONExtractor(\"data.users[*]\"), is_many=True).extract(\n        {\n            \"data\": {\n                \"users\": [\n                    {\n                        \"name\": \"john\",\n                        \"age\": 19,\n                        \"countFollowings\": 14,\n                        \"countFans\": 212,\n                    },\n                    {\n                        \"name\": \"jack\",\n                        \"description\": \"\",\n                        \"countFollowings\": 54,\n                        \"countFans\": 312,\n                    },\n                ]\n            }\n        }\n    ) == [\n        {\"name\": \"john\", \"age\": 19, \"count\": {\"followings\": 14, \"fans\": 212}},\n        {\"name\": \"jack\", \"age\": 17, \"count\": {\"followings\": 54, \"fans\": 312}},\n    ]\n\nChangelog\n<<<<<<<<<\n\nv1.0.0\n~~~~~~~~~~\n\n**Feature**\n\n- Generic extractor with convertor (#83)\n- mypy plugin for type annotation of extracting result (#83)\n\n\nContributing\n<<<<<<<<<<<<\n\n\nEnvironment Setup\n~~~~~~~~~~~~~~~~~\n\nClone the source codes from Github.\n\n.. code-block:: shell\n\n    git clone https://github.com/linw1995/data_extractor.git\n    cd data_extractor\n\nSetup the development environment.\nPlease make sure you install the pdm_,\npre-commit_ and nox_ CLIs in your environment.\n\n.. code-block:: shell\n\n    make init\n    make PYTHON=3.7 init  # for specific python version\n\nLinting\n~~~~~~~\n\nUse pre-commit_ for installing linters to ensure a good code style.\n\n.. code-block:: shell\n\n    make pre-commit\n\nRun linters. Some linters run via CLI nox_, so make sure you install it.\n\n.. code-block:: shell\n\n    make check-all\n\nTesting\n~~~~~~~\n\nRun quick tests.\n\n.. code-block:: shell\n\n    make\n\nRun quick tests with verbose.\n\n.. code-block:: shell\n\n    make vtest\n\nRun tests with coverage.\nTesting in multiple Python environments is powered by CLI nox_.\n\n.. code-block:: shell\n\n    make cov\n\n.. _pdm: https://github.com/pdm-project/pdm\n.. _pre-commit: https://pre-commit.com/\n.. _nox: https://nox.thea.codes/en/stable/\n\n.. |license| image:: https://img.shields.io/github/license/linw1995/data_extractor.svg\n    :target: https://github.com/linw1995/data_extractor/blob/master/LICENSE\n\n.. |Pypi Status| image:: https://img.shields.io/pypi/status/data_extractor.svg\n    :target: https://pypi.org/project/data_extractor\n\n.. |Python version| image:: https://img.shields.io/pypi/pyversions/data_extractor.svg\n    :target: https://pypi.org/project/data_extractor\n\n.. |Package version| image:: https://img.shields.io/pypi/v/data_extractor.svg\n    :target: https://pypi.org/project/data_extractor\n\n.. |PyPI - Downloads| image:: https://img.shields.io/pypi/dm/data-extractor.svg\n    :target: https://pypi.org/project/data_extractor\n\n.. |GitHub last commit| image:: https://img.shields.io/github/last-commit/linw1995/data_extractor.svg\n    :target: https://github.com/linw1995/data_extractor\n\n.. |Code style: black| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n    :target: https://github.com/ambv/black\n\n.. |Build Status| image:: https://github.com/linw1995/data_extractor/workflows/Lint&Test/badge.svg\n    :target: https://github.com/linw1995/data_extractor/actions?query=workflow%3ALint%26Test\n\n.. |codecov| image:: https://codecov.io/gh/linw1995/data_extractor/branch/master/graph/badge.svg\n    :target: https://codecov.io/gh/linw1995/data_extractor\n\n.. |Documentation Status| image:: https://readthedocs.org/projects/data-extractor/badge/?version=latest\n    :target: https://data-extractor.readthedocs.io/en/latest/?badge=latest\n\n.. |PDM managed| image:: https://img.shields.io/badge/pdm-managed-blueviolet\n    :target: https://pdm.fming.dev\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Combine XPath, CSS Selectors and JSONPath for Web data extracting.",
    "version": "1.0.0",
    "project_urls": {
        "documentation": "https://data-extractor.readthedocs.io/en/latest/",
        "homepage": "https://github.com/linw1995/data_extractor",
        "repository": "https://github.com/linw1995/data_extractor"
    },
    "split_keywords": [
        "data-extractor",
        " data-extraction",
        " xpath",
        " css-selectors",
        " jsonpath"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b3fff69ea6da05b522a4b15ae4079688a1c2a8ca17afc5f4b605772459e9f6c",
                "md5": "2a9126ecafe148fe2c279317299ea950",
                "sha256": "187ee022315d60959ecdb9f88f0fb72cd0773f204f0c8684c5876037c5f43f01"
            },
            "downloads": -1,
            "filename": "data_extractor-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2a9126ecafe148fe2c279317299ea950",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 18394,
            "upload_time": "2024-10-12T07:26:42",
            "upload_time_iso_8601": "2024-10-12T07:26:42.638802Z",
            "url": "https://files.pythonhosted.org/packages/0b/3f/ff69ea6da05b522a4b15ae4079688a1c2a8ca17afc5f4b605772459e9f6c/data_extractor-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eac5990d6b152a61d0085b4290e5ef190a5f8423ce394b2ef9dffad7b5985e91",
                "md5": "1b90483b13380cc18d614b4a3bc945db",
                "sha256": "c5acac06b0ebd4cc7a13122c33f2835e75a5ce8f7f33fb144f00397de48dac70"
            },
            "downloads": -1,
            "filename": "data-extractor-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1b90483b13380cc18d614b4a3bc945db",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 28309,
            "upload_time": "2024-10-12T07:26:44",
            "upload_time_iso_8601": "2024-10-12T07:26:44.270191Z",
            "url": "https://files.pythonhosted.org/packages/ea/c5/990d6b152a61d0085b4290e5ef190a5f8423ce394b2ef9dffad7b5985e91/data-extractor-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-12 07:26:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "linw1995",
    "github_project": "data_extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "data-extractor"
}
        
Elapsed time: 1.63452s