python-dataservice


Namepython-dataservice JSON
Version 0.0.15 PyPI version JSON
download
home_pageNone
SummaryLightweight async data gathering for Python
upload_time2024-09-20 13:39:33
maintainerNone
docs_urlNone
authorNomadMonad
requires_python<4.0,>=3.12
licenseMIT
keywords async data gathering scraping web scraping web crawling crawling data extraction data scraping api data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            DataService
===========

Lightweight - async - data gathering for Python.
____________________________________________________________________________________
DataService is a lightweight data gathering library for Python.

Designed for simplicity, it's built upon common web scraping and data gathering patterns.

No complex API to learn, just standard Python idioms.

Async implementation, sync interface.

Installation
------------

You can install DataService via pip:

.. code-block:: bash

    pip install python-dataservice

Please note that this initial version requires Python 3.12 or higher.
For future releases I am aiming to support older versions of Python.

How to use DataService
----------------------

To start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.

.. code-block:: python

    start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
    data_service = DataService(start_requests)
    data = tuple(data_service)

A ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.

The client can be any Python callable that accepts a ``Request`` object and returns a ``Response`` object. ``DataService`` provides an ``HttpXClient`` class, which is based on the ``httpx`` library, but you are free to use your own custom async client.

The callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.

In this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.

Example ``parse_books_page`` function:

.. code-block:: python

    def parse_books_page(response: Response):
        articles = response.html.find_all("article", {"class": "product_pod"})
        return {
            "url": response.request.url,
            "title": response.html.title.get_text(strip=True),
            "articles": len(articles),
        }

This function takes a ``Response`` object, which has a ``html`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.

The callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.

If you have used Scrapy before, you will find this pattern familiar.

For more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.

For a detailed API reference, check out the `modules <https://dataservice.readthedocs.io/en/latest/modules.html>`_  section.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "python-dataservice",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.12",
    "maintainer_email": null,
    "keywords": "async, data gathering, scraping, web scraping, web crawling, crawling, data extraction, data scraping, API, data",
    "author": "NomadMonad",
    "author_email": "romagnoli.luca@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/47/da/a3fd2fdc44369a0607c3ee28c6635db1ead050876740717909d5fbaf9103/python_dataservice-0.0.15.tar.gz",
    "platform": null,
    "description": "DataService\n===========\n\nLightweight - async - data gathering for Python.\n____________________________________________________________________________________\nDataService is a lightweight data gathering library for Python.\n\nDesigned for simplicity, it's built upon common web scraping and data gathering patterns.\n\nNo complex API to learn, just standard Python idioms.\n\nAsync implementation, sync interface.\n\nInstallation\n------------\n\nYou can install DataService via pip:\n\n.. code-block:: bash\n\n    pip install python-dataservice\n\nPlease note that this initial version requires Python 3.12 or higher.\nFor future releases I am aiming to support older versions of Python.\n\nHow to use DataService\n----------------------\n\nTo start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.\n\n.. code-block:: python\n\n    start_requests = [Request(url=\"https://books.toscrape.com/index.html\", callback=parse_books_page, client=HttpXClient())]\n    data_service = DataService(start_requests)\n    data = tuple(data_service)\n\nA ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.\n\nThe client can be any Python callable that accepts a ``Request`` object and returns a ``Response`` object. ``DataService`` provides an ``HttpXClient`` class, which is based on the ``httpx`` library, but you are free to use your own custom async client.\n\nThe callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.\n\nIn this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.\n\nExample ``parse_books_page`` function:\n\n.. code-block:: python\n\n    def parse_books_page(response: Response):\n        articles = response.html.find_all(\"article\", {\"class\": \"product_pod\"})\n        return {\n            \"url\": response.request.url,\n            \"title\": response.html.title.get_text(strip=True),\n            \"articles\": len(articles),\n        }\n\nThis function takes a ``Response`` object, which has a ``html`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.\n\nThe callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.\n\nIf you have used Scrapy before, you will find this pattern familiar.\n\nFor more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.\n\nFor a detailed API reference, check out the `modules <https://dataservice.readthedocs.io/en/latest/modules.html>`_  section.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Lightweight async data gathering for Python",
    "version": "0.0.15",
    "project_urls": {
        "Documentation": "https://readthedocs.org/projects/dataservice/"
    },
    "split_keywords": [
        "async",
        " data gathering",
        " scraping",
        " web scraping",
        " web crawling",
        " crawling",
        " data extraction",
        " data scraping",
        " api",
        " data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea4f75c339fe3eb1546070b2f11c54044ffedc92008a09b5683b65b08eea6923",
                "md5": "599670a78075f47c014114546756e20d",
                "sha256": "885ceeefaede00c727b5294c3e59de3aa5414734380f883d1d0b2ae247a90d92"
            },
            "downloads": -1,
            "filename": "python_dataservice-0.0.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "599670a78075f47c014114546756e20d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.12",
            "size": 16395,
            "upload_time": "2024-09-20T13:39:32",
            "upload_time_iso_8601": "2024-09-20T13:39:32.448576Z",
            "url": "https://files.pythonhosted.org/packages/ea/4f/75c339fe3eb1546070b2f11c54044ffedc92008a09b5683b65b08eea6923/python_dataservice-0.0.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "47daa3fd2fdc44369a0607c3ee28c6635db1ead050876740717909d5fbaf9103",
                "md5": "3267202ed4df1571a78c7dc1a07cfc0a",
                "sha256": "101a8aa2f673adc2b14d33fdde41b4bda1e988bd24a94b01079dca00313b84c5"
            },
            "downloads": -1,
            "filename": "python_dataservice-0.0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "3267202ed4df1571a78c7dc1a07cfc0a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.12",
            "size": 13700,
            "upload_time": "2024-09-20T13:39:33",
            "upload_time_iso_8601": "2024-09-20T13:39:33.468501Z",
            "url": "https://files.pythonhosted.org/packages/47/da/a3fd2fdc44369a0607c3ee28c6635db1ead050876740717909d5fbaf9103/python_dataservice-0.0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-20 13:39:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "python-dataservice"
}
        
Elapsed time: 2.25980s