.. image:: https://img.shields.io/pypi/pyversions/python-dataservice.svg
:alt: Python Versions
DataService
===========
Lightweight - async - data gathering for Python.
____________________________________________________________________________________
DataService is a lightweight web scraping and general purpose data gathering library for Python.
Designed for simplicity, it's built upon common web scraping and data gathering patterns.
No complex API to learn, just standard Python idioms.
Dual synchronous and asynchronous support.
Installation
------------
Please note that DataService requires Python 3.11 or higher.
You can install DataService via pip:
.. code-block:: bash
pip install python-dataservice
You can also install the optional ``playwright`` dependency to use the ``PlaywrightClient``:
.. code-block:: bash
pip install python-dataservice[playwright]
To install Playwright, run:
.. code-block:: bash
python -m playwright install
or simply:
.. code-block:: bash
playwright install
How to use DataService
----------------------
To start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.
.. code-block:: python
start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
data_service = DataService(start_requests)
data = tuple(data_service)
A ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.
The client can be any async Python callable that accepts a ``Request`` object and returns a ``Response`` object.
``DataService`` provides an ``HttpXClient`` class by default, which is based on the ``httpx`` library, but you are free to use your own custom async client.
The callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.
In this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.
Example ``parse_books_page`` function:
.. code-block:: python
def parse_books_page(response: Response):
articles = response.html.find_all("article", {"class": "product_pod"})
return {
"url": response.url,
"title": response.html.title.get_text(strip=True),
"articles": len(articles),
}
This function takes a ``Response`` object, which has a ``html`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.
The callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.
If you have used ``Scrapy`` before, you will find this pattern familiar.
For more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.
For a detailed API reference, check out the `API <https://dataservice.readthedocs.io/en/latest/modules.html>`_ section.
Raw data
{
"_id": null,
"home_page": null,
"name": "python-dataservice",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": "async, data gathering, scraping, web scraping, web crawling, crawling, data extraction, data scraping, API, data",
"author": "NomadMonad",
"author_email": "romagnoli.luca@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/0b/4e/199c46f9f45e83f503e3ef169f554a3611d050c61a46e72cad29da2d49a9/python_dataservice-0.14.0.tar.gz",
"platform": null,
"description": ".. image:: https://img.shields.io/pypi/pyversions/python-dataservice.svg\n :alt: Python Versions\n\nDataService\n===========\n\nLightweight - async - data gathering for Python.\n____________________________________________________________________________________\nDataService is a lightweight web scraping and general purpose data gathering library for Python.\n\nDesigned for simplicity, it's built upon common web scraping and data gathering patterns.\n\nNo complex API to learn, just standard Python idioms.\n\nDual synchronous and asynchronous support.\n\nInstallation\n------------\nPlease note that DataService requires Python 3.11 or higher.\n\nYou can install DataService via pip:\n\n.. code-block:: bash\n\n pip install python-dataservice\n\n\nYou can also install the optional ``playwright`` dependency to use the ``PlaywrightClient``:\n\n.. code-block:: bash\n\n pip install python-dataservice[playwright]\n\nTo install Playwright, run:\n\n.. code-block:: bash\n\n python -m playwright install\n\nor simply:\n\n.. code-block:: bash\n\n playwright install\n\nHow to use DataService\n----------------------\n\nTo start, create a ``DataService`` instance with an ``Iterable`` of ``Request`` objects. This setup provides you with an ``Iterator`` of data objects that you can then iterate over or convert to a ``list``, ``tuple``, a ``pd.DataFrame`` or any data structure of choice.\n\n.. code-block:: python\n\n start_requests = [Request(url=\"https://books.toscrape.com/index.html\", callback=parse_books_page, client=HttpXClient())]\n data_service = DataService(start_requests)\n data = tuple(data_service)\n\nA ``Request`` is a ``Pydantic`` model that includes the URL to fetch, a reference to the ``client`` callable, and a ``callback`` function for parsing the ``Response`` object.\n\nThe client can be any async Python callable that accepts a ``Request`` object and returns a ``Response`` object.\n``DataService`` provides an ``HttpXClient`` class by default, which is based on the ``httpx`` library, but you are free to use your own custom async client.\n\nThe callback function processes a ``Response`` object and returns either ``data`` or additional ``Request`` objects.\n\nIn this trivial example we are requesting the `Books to Scrape <https://books.toscrape.com/index.html>`_ homepage and parsing the number of books on the page.\n\nExample ``parse_books_page`` function:\n\n.. code-block:: python\n\n def parse_books_page(response: Response):\n articles = response.html.find_all(\"article\", {\"class\": \"product_pod\"})\n return {\n \"url\": response.url,\n \"title\": response.html.title.get_text(strip=True),\n \"articles\": len(articles),\n }\n\nThis function takes a ``Response`` object, which has a ``html`` attribute (a ``BeautifulSoup`` object of the HTML content). The function parses the HTML content and returns data.\n\nThe callback function can ``return`` or ``yield`` either ``data`` (``dict`` or ``pydantic.BaseModel``) or more ``Request`` objects.\n\nIf you have used ``Scrapy`` before, you will find this pattern familiar.\n\nFor more examples and advanced usage, check out the `examples <https://dataservice.readthedocs.io/en/latest/examples.html>`_ section.\n\nFor a detailed API reference, check out the `API <https://dataservice.readthedocs.io/en/latest/modules.html>`_ section.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Lightweight async data gathering for Python",
"version": "0.14.0",
"project_urls": {
"Documentation": "https://readthedocs.org/projects/dataservice/"
},
"split_keywords": [
"async",
" data gathering",
" scraping",
" web scraping",
" web crawling",
" crawling",
" data extraction",
" data scraping",
" api",
" data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "33e478816db3e92732938ccb01c28e419b1965ea9be1f8e7a96158510c08b9f5",
"md5": "2347543acc28b88dcc1b25a6b9c3bb7f",
"sha256": "dc9731f3585328e7b349c7e31711ee081392d9f35667c4e31b4afa2b5b31db88"
},
"downloads": -1,
"filename": "python_dataservice-0.14.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2347543acc28b88dcc1b25a6b9c3bb7f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 27362,
"upload_time": "2024-11-18T11:53:02",
"upload_time_iso_8601": "2024-11-18T11:53:02.184942Z",
"url": "https://files.pythonhosted.org/packages/33/e4/78816db3e92732938ccb01c28e419b1965ea9be1f8e7a96158510c08b9f5/python_dataservice-0.14.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0b4e199c46f9f45e83f503e3ef169f554a3611d050c61a46e72cad29da2d49a9",
"md5": "94770eaafec90d88c66c3564e6d24d1d",
"sha256": "13a02756d1da0388ff48860504b45e841d8bc60fad52bd66ce425bd1df1bd22e"
},
"downloads": -1,
"filename": "python_dataservice-0.14.0.tar.gz",
"has_sig": false,
"md5_digest": "94770eaafec90d88c66c3564e6d24d1d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 23845,
"upload_time": "2024-11-18T11:53:03",
"upload_time_iso_8601": "2024-11-18T11:53:03.280706Z",
"url": "https://files.pythonhosted.org/packages/0b/4e/199c46f9f45e83f503e3ef169f554a3611d050c61a46e72cad29da2d49a9/python_dataservice-0.14.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-18 11:53:03",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "python-dataservice"
}