python-dataservice


Namepython-dataservice JSON
Version 0.14.0 PyPI version JSON
download
home_pageNone
SummaryLightweight async data gathering for Python
upload_time2024-11-18 11:53:03
maintainerNone
docs_urlNone
authorNomadMonad
requires_python<4.0,>=3.11
licenseMIT
keywords async data gathering scraping web scraping web crawling crawling data extraction data scraping api data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. image:: https://img.shields.io/pypi/pyversions/python-dataservice.svg
   :alt: Python Versions

DataService
===========

Lightweight - async - data gathering for Python.
____________________________________________________________________________________
DataService is a lightweight web scraping and general purpose data gathering library for Python.

Designed for simplicity, it's built upon common web scraping and data gathering patterns.

No complex API to learn, just standard Python idioms.

Dual synchronous and asynchronous support.

Installation
------------
Please note that DataService requires Python 3.11 or higher.

You can install DataService via pip:

.. code-block:: bash

    pip install python-dataservice


You can also install the optional ``playwright`` dependency to use the ``PlaywrightClient``:

.. code-block:: bash

    pip install python-dataservice[playwright]

To install Playwright, run:

.. code-block:: bash

    python -m playwright install

or simply:

.. code-block:: bash

    playwright install

How to use DataService
----------------------

To start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.

.. code-block:: python

    start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
    data_service = DataService(start_requests)
    data = tuple(data_service)

A ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.

The client can be any async Python callable that accepts a ``Request`` object and returns a ``Response`` object.
``DataService`` provides an ``HttpXClient`` class by default, which is based on the ``httpx`` library, but you are free to use your own custom async client.

The callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.

In this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.

Example ``parse_books_page`` function:

.. code-block:: python

    def parse_books_page(response: Response):
        articles = response.html.find_all("article", {"class": "product_pod"})
        return {
            "url": response.url,
            "title": response.html.title.get_text(strip=True),
            "articles": len(articles),
        }

This function takes a ``Response`` object, which has a ``html`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.

The callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.

If you have used ``Scrapy`` before, you will find this pattern familiar.

For more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.

For a detailed API reference, check out the `API <https://dataservice.readthedocs.io/en/latest/modules.html>`_  section.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "python-dataservice",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.11",
    "maintainer_email": null,
    "keywords": "async, data gathering, scraping, web scraping, web crawling, crawling, data extraction, data scraping, API, data",
    "author": "NomadMonad",
    "author_email": "romagnoli.luca@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0b/4e/199c46f9f45e83f503e3ef169f554a3611d050c61a46e72cad29da2d49a9/python_dataservice-0.14.0.tar.gz",
    "platform": null,
    "description": ".. image:: https://img.shields.io/pypi/pyversions/python-dataservice.svg\n   :alt: Python Versions\n\nDataService\n===========\n\nLightweight - async - data gathering for Python.\n____________________________________________________________________________________\nDataService is a lightweight web scraping and general purpose data gathering library for Python.\n\nDesigned for simplicity, it's built upon common web scraping and data gathering patterns.\n\nNo complex API to learn, just standard Python idioms.\n\nDual synchronous and asynchronous support.\n\nInstallation\n------------\nPlease note that DataService requires Python 3.11 or higher.\n\nYou can install DataService via pip:\n\n.. code-block:: bash\n\n    pip install python-dataservice\n\n\nYou can also install the optional ``playwright`` dependency to use the ``PlaywrightClient``:\n\n.. code-block:: bash\n\n    pip install python-dataservice[playwright]\n\nTo install Playwright, run:\n\n.. code-block:: bash\n\n    python -m playwright install\n\nor simply:\n\n.. code-block:: bash\n\n    playwright install\n\nHow to use DataService\n----------------------\n\nTo start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.\n\n.. code-block:: python\n\n    start_requests = [Request(url=\"https://books.toscrape.com/index.html\", callback=parse_books_page, client=HttpXClient())]\n    data_service = DataService(start_requests)\n    data = tuple(data_service)\n\nA ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.\n\nThe client can be any async Python callable that accepts a ``Request`` object and returns a ``Response`` object.\n``DataService`` provides an ``HttpXClient`` class by default, which is based on the ``httpx`` library, but you are free to use your own custom async client.\n\nThe callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.\n\nIn this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.\n\nExample ``parse_books_page`` function:\n\n.. code-block:: python\n\n    def parse_books_page(response: Response):\n        articles = response.html.find_all(\"article\", {\"class\": \"product_pod\"})\n        return {\n            \"url\": response.url,\n            \"title\": response.html.title.get_text(strip=True),\n            \"articles\": len(articles),\n        }\n\nThis function takes a ``Response`` object, which has a ``html`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.\n\nThe callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.\n\nIf you have used ``Scrapy`` before, you will find this pattern familiar.\n\nFor more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.\n\nFor a detailed API reference, check out the `API <https://dataservice.readthedocs.io/en/latest/modules.html>`_  section.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Lightweight async data gathering for Python",
    "version": "0.14.0",
    "project_urls": {
        "Documentation": "https://readthedocs.org/projects/dataservice/"
    },
    "split_keywords": [
        "async",
        " data gathering",
        " scraping",
        " web scraping",
        " web crawling",
        " crawling",
        " data extraction",
        " data scraping",
        " api",
        " data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "33e478816db3e92732938ccb01c28e419b1965ea9be1f8e7a96158510c08b9f5",
                "md5": "2347543acc28b88dcc1b25a6b9c3bb7f",
                "sha256": "dc9731f3585328e7b349c7e31711ee081392d9f35667c4e31b4afa2b5b31db88"
            },
            "downloads": -1,
            "filename": "python_dataservice-0.14.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2347543acc28b88dcc1b25a6b9c3bb7f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.11",
            "size": 27362,
            "upload_time": "2024-11-18T11:53:02",
            "upload_time_iso_8601": "2024-11-18T11:53:02.184942Z",
            "url": "https://files.pythonhosted.org/packages/33/e4/78816db3e92732938ccb01c28e419b1965ea9be1f8e7a96158510c08b9f5/python_dataservice-0.14.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b4e199c46f9f45e83f503e3ef169f554a3611d050c61a46e72cad29da2d49a9",
                "md5": "94770eaafec90d88c66c3564e6d24d1d",
                "sha256": "13a02756d1da0388ff48860504b45e841d8bc60fad52bd66ce425bd1df1bd22e"
            },
            "downloads": -1,
            "filename": "python_dataservice-0.14.0.tar.gz",
            "has_sig": false,
            "md5_digest": "94770eaafec90d88c66c3564e6d24d1d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.11",
            "size": 23845,
            "upload_time": "2024-11-18T11:53:03",
            "upload_time_iso_8601": "2024-11-18T11:53:03.280706Z",
            "url": "https://files.pythonhosted.org/packages/0b/4e/199c46f9f45e83f503e3ef169f554a3611d050c61a46e72cad29da2d49a9/python_dataservice-0.14.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-18 11:53:03",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "python-dataservice"
}
        
Elapsed time: 1.10129s