==============
Data Extractor
==============
|license| |Pypi Status| |Python version| |Package version| |PyPI - Downloads|
|GitHub last commit| |Code style: black| |Build Status| |codecov|
|Documentation Status| |PDM managed|
Combine **XPath**, **CSS Selectors** and **JSONPath** for Web data extracting.
Quickstarts
<<<<<<<<<<<
Installation
~~~~~~~~~~~~
Install the stable version from PYPI.
.. code-block:: shell
pip install "data-extractor[jsonpath-extractor]" # for extracting JSON data
pip install "data-extractor[lxml]" # for extracting HTML data
Or install the latest version from Github.
.. code-block:: shell
pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"
Extract JSON data
~~~~~~~~~~~~~~~~~
Currently supports to extract JSON data with below optional dependencies
- jsonpath-extractor_
- jsonpath-rw_
- jsonpath-rw-ext_
.. _jsonpath-extractor: https://github.com/linw1995/jsonpath
.. _jsonpath-rw: https://github.com/kennknowles/python-jsonpath-rw
.. _jsonpath-rw-ext: https://python-jsonpath-rw-ext.readthedocs.org/en/latest/
install one dependency of them to extract JSON data.
Extract HTML(XML) data
~~~~~~~~~~~~~~~~~~~~~~
Currently supports to extract HTML(XML) data with below optional dependencies
- lxml_ for using XPath_
- cssselect_ for using CSS-Selectors_
.. _lxml: https://lxml.de/
.. _XPath: https://www.w3.org/TR/xpath-10/
.. _cssselect: https://cssselect.readthedocs.io/en/latest/
.. _CSS-Selectors: https://www.w3.org/TR/selectors-3/
Usage
~~~~~
.. code-block:: python3
from data_extractor import Field, Item, JSONExtractor
class Count(Item):
followings = Field(JSONExtractor("countFollowings"))
fans = Field(JSONExtractor("countFans"))
class User(Item):
name_ = Field(JSONExtractor("name"), name="name")
age = Field(JSONExtractor("age"), default=17)
count = Count()
assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
{
"data": {
"users": [
{
"name": "john",
"age": 19,
"countFollowings": 14,
"countFans": 212,
},
{
"name": "jack",
"description": "",
"countFollowings": 54,
"countFans": 312,
},
]
}
}
) == [
{"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
{"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
]
Changelog
<<<<<<<<<
v1.0.1
~~~~~~
**Build**
- Supports Python 3.13
Contributing
<<<<<<<<<<<<
Environment Setup
~~~~~~~~~~~~~~~~~
Clone the source codes from Github.
.. code-block:: shell
git clone https://github.com/linw1995/data_extractor.git
cd data_extractor
Setup the development environment.
Please make sure you install the pdm_,
pre-commit_ and nox_ CLIs in your environment.
.. code-block:: shell
make init
make PYTHON=3.7 init # for specific python version
Linting
~~~~~~~
Use pre-commit_ for installing linters to ensure a good code style.
.. code-block:: shell
make pre-commit
Run linters. Some linters run via CLI nox_, so make sure you install it.
.. code-block:: shell
make check-all
Testing
~~~~~~~
Run quick tests.
.. code-block:: shell
make
Run quick tests with verbose.
.. code-block:: shell
make vtest
Run tests with coverage.
Testing in multiple Python environments is powered by CLI nox_.
.. code-block:: shell
make cov
.. _pdm: https://github.com/pdm-project/pdm
.. _pre-commit: https://pre-commit.com/
.. _nox: https://nox.thea.codes/en/stable/
.. |license| image:: https://img.shields.io/github/license/linw1995/data_extractor.svg
:target: https://github.com/linw1995/data_extractor/blob/master/LICENSE
.. |Pypi Status| image:: https://img.shields.io/pypi/status/data_extractor.svg
:target: https://pypi.org/project/data_extractor
.. |Python version| image:: https://img.shields.io/pypi/pyversions/data_extractor.svg
:target: https://pypi.org/project/data_extractor
.. |Package version| image:: https://img.shields.io/pypi/v/data_extractor.svg
:target: https://pypi.org/project/data_extractor
.. |PyPI - Downloads| image:: https://img.shields.io/pypi/dm/data-extractor.svg
:target: https://pypi.org/project/data_extractor
.. |GitHub last commit| image:: https://img.shields.io/github/last-commit/linw1995/data_extractor.svg
:target: https://github.com/linw1995/data_extractor
.. |Code style: black| image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/ambv/black
.. |Build Status| image:: https://github.com/linw1995/data_extractor/workflows/Lint&Test/badge.svg
:target: https://github.com/linw1995/data_extractor/actions?query=workflow%3ALint%26Test
.. |codecov| image:: https://codecov.io/gh/linw1995/data_extractor/branch/master/graph/badge.svg
:target: https://codecov.io/gh/linw1995/data_extractor
.. |Documentation Status| image:: https://readthedocs.org/projects/data-extractor/badge/?version=latest
:target: https://data-extractor.readthedocs.io/en/latest/?badge=latest
.. |PDM managed| image:: https://img.shields.io/badge/pdm-managed-blueviolet
:target: https://pdm.fming.dev
Raw data
{
"_id": null,
"home_page": null,
"name": "data-extractor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "data-extractor, data-extraction, xpath, css-selectors, jsonpath",
"author": null,
"author_email": "\u6797\u73ae (Jade Lin) <linw1995@icloud.com>",
"download_url": "https://files.pythonhosted.org/packages/19/ab/4a9fff19fe0fcb15eb83083fa51288bd1f21aa1acbc12fba8dd93c8c6597/data-extractor-1.0.1.tar.gz",
"platform": null,
"description": "==============\nData Extractor\n==============\n\n|license| |Pypi Status| |Python version| |Package version| |PyPI - Downloads|\n|GitHub last commit| |Code style: black| |Build Status| |codecov|\n|Documentation Status| |PDM managed|\n\nCombine **XPath**, **CSS Selectors** and **JSONPath** for Web data extracting.\n\nQuickstarts\n<<<<<<<<<<<\n\nInstallation\n~~~~~~~~~~~~\n\nInstall the stable version from PYPI.\n\n.. code-block:: shell\n\n pip install \"data-extractor[jsonpath-extractor]\" # for extracting JSON data\n pip install \"data-extractor[lxml]\" # for extracting HTML data\n\nOr install the latest version from Github.\n\n.. code-block:: shell\n\n pip install \"data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master\"\n\nExtract JSON data\n~~~~~~~~~~~~~~~~~\n\nCurrently supports to extract JSON data with below optional dependencies\n\n- jsonpath-extractor_\n- jsonpath-rw_\n- jsonpath-rw-ext_\n\n.. _jsonpath-extractor: https://github.com/linw1995/jsonpath\n.. _jsonpath-rw: https://github.com/kennknowles/python-jsonpath-rw\n.. _jsonpath-rw-ext: https://python-jsonpath-rw-ext.readthedocs.org/en/latest/\n\ninstall one dependency of them to extract JSON data.\n\nExtract HTML(XML) data\n~~~~~~~~~~~~~~~~~~~~~~\n\nCurrently supports to extract HTML(XML) data with below optional dependencies\n\n- lxml_ for using XPath_\n- cssselect_ for using CSS-Selectors_\n\n.. _lxml: https://lxml.de/\n.. _XPath: https://www.w3.org/TR/xpath-10/\n.. _cssselect: https://cssselect.readthedocs.io/en/latest/\n.. _CSS-Selectors: https://www.w3.org/TR/selectors-3/\n\nUsage\n~~~~~\n\n.. code-block:: python3\n\n from data_extractor import Field, Item, JSONExtractor\n\n\n class Count(Item):\n followings = Field(JSONExtractor(\"countFollowings\"))\n fans = Field(JSONExtractor(\"countFans\"))\n\n\n class User(Item):\n name_ = Field(JSONExtractor(\"name\"), name=\"name\")\n age = Field(JSONExtractor(\"age\"), default=17)\n count = Count()\n\n\n assert User(JSONExtractor(\"data.users[*]\"), is_many=True).extract(\n {\n \"data\": {\n \"users\": [\n {\n \"name\": \"john\",\n \"age\": 19,\n \"countFollowings\": 14,\n \"countFans\": 212,\n },\n {\n \"name\": \"jack\",\n \"description\": \"\",\n \"countFollowings\": 54,\n \"countFans\": 312,\n },\n ]\n }\n }\n ) == [\n {\"name\": \"john\", \"age\": 19, \"count\": {\"followings\": 14, \"fans\": 212}},\n {\"name\": \"jack\", \"age\": 17, \"count\": {\"followings\": 54, \"fans\": 312}},\n ]\n\nChangelog\n<<<<<<<<<\n\nv1.0.1\n~~~~~~\n\n**Build**\n\n- Supports Python 3.13\n\n\n\nContributing\n<<<<<<<<<<<<\n\n\nEnvironment Setup\n~~~~~~~~~~~~~~~~~\n\nClone the source codes from Github.\n\n.. code-block:: shell\n\n git clone https://github.com/linw1995/data_extractor.git\n cd data_extractor\n\nSetup the development environment.\nPlease make sure you install the pdm_,\npre-commit_ and nox_ CLIs in your environment.\n\n.. code-block:: shell\n\n make init\n make PYTHON=3.7 init # for specific python version\n\nLinting\n~~~~~~~\n\nUse pre-commit_ for installing linters to ensure a good code style.\n\n.. code-block:: shell\n\n make pre-commit\n\nRun linters. Some linters run via CLI nox_, so make sure you install it.\n\n.. code-block:: shell\n\n make check-all\n\nTesting\n~~~~~~~\n\nRun quick tests.\n\n.. code-block:: shell\n\n make\n\nRun quick tests with verbose.\n\n.. code-block:: shell\n\n make vtest\n\nRun tests with coverage.\nTesting in multiple Python environments is powered by CLI nox_.\n\n.. code-block:: shell\n\n make cov\n\n.. _pdm: https://github.com/pdm-project/pdm\n.. _pre-commit: https://pre-commit.com/\n.. _nox: https://nox.thea.codes/en/stable/\n\n.. |license| image:: https://img.shields.io/github/license/linw1995/data_extractor.svg\n :target: https://github.com/linw1995/data_extractor/blob/master/LICENSE\n\n.. |Pypi Status| image:: https://img.shields.io/pypi/status/data_extractor.svg\n :target: https://pypi.org/project/data_extractor\n\n.. |Python version| image:: https://img.shields.io/pypi/pyversions/data_extractor.svg\n :target: https://pypi.org/project/data_extractor\n\n.. |Package version| image:: https://img.shields.io/pypi/v/data_extractor.svg\n :target: https://pypi.org/project/data_extractor\n\n.. |PyPI - Downloads| image:: https://img.shields.io/pypi/dm/data-extractor.svg\n :target: https://pypi.org/project/data_extractor\n\n.. |GitHub last commit| image:: https://img.shields.io/github/last-commit/linw1995/data_extractor.svg\n :target: https://github.com/linw1995/data_extractor\n\n.. |Code style: black| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n :target: https://github.com/ambv/black\n\n.. |Build Status| image:: https://github.com/linw1995/data_extractor/workflows/Lint&Test/badge.svg\n :target: https://github.com/linw1995/data_extractor/actions?query=workflow%3ALint%26Test\n\n.. |codecov| image:: https://codecov.io/gh/linw1995/data_extractor/branch/master/graph/badge.svg\n :target: https://codecov.io/gh/linw1995/data_extractor\n\n.. |Documentation Status| image:: https://readthedocs.org/projects/data-extractor/badge/?version=latest\n :target: https://data-extractor.readthedocs.io/en/latest/?badge=latest\n\n.. |PDM managed| image:: https://img.shields.io/badge/pdm-managed-blueviolet\n :target: https://pdm.fming.dev\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Combine XPath, CSS Selectors and JSONPath for Web data extracting.",
"version": "1.0.1",
"project_urls": {
"documentation": "https://data-extractor.readthedocs.io/en/latest/",
"homepage": "https://github.com/linw1995/data_extractor",
"repository": "https://github.com/linw1995/data_extractor"
},
"split_keywords": [
"data-extractor",
" data-extraction",
" xpath",
" css-selectors",
" jsonpath"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1f9e4c7f72bd7e7a0879eb751c8599dda74a8542d736e97c8f673eaff8226b68",
"md5": "a1e87a5b66c2376a1429bd55cd603df8",
"sha256": "2126dc68207b650ae884cac6caf8dec0388c6334ad379069b166d6336e27b1e7"
},
"downloads": -1,
"filename": "data_extractor-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a1e87a5b66c2376a1429bd55cd603df8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 18346,
"upload_time": "2024-10-13T03:23:42",
"upload_time_iso_8601": "2024-10-13T03:23:42.842034Z",
"url": "https://files.pythonhosted.org/packages/1f/9e/4c7f72bd7e7a0879eb751c8599dda74a8542d736e97c8f673eaff8226b68/data_extractor-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "19ab4a9fff19fe0fcb15eb83083fa51288bd1f21aa1acbc12fba8dd93c8c6597",
"md5": "e579f24780210425917e241421abdf42",
"sha256": "3ff9424d4859ecd1a4f3b0f9f5b614117a0efa2747493365169b54c6af51aa90"
},
"downloads": -1,
"filename": "data-extractor-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "e579f24780210425917e241421abdf42",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 28153,
"upload_time": "2024-10-13T03:23:44",
"upload_time_iso_8601": "2024-10-13T03:23:44.520854Z",
"url": "https://files.pythonhosted.org/packages/19/ab/4a9fff19fe0fcb15eb83083fa51288bd1f21aa1acbc12fba8dd93c8c6597/data-extractor-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-13 03:23:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "linw1995",
"github_project": "data_extractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "data-extractor"
}