spidur

Name	spidur JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/ra0x3/spidur
Summary	🕷️ A lightweight, generic parallel runner for custom scrapers
upload_time	2025-10-08 00:11:59
maintainer	None
docs_url	None
author	ra0x3
requires_python	<4.0,>=3.9
license	MIT
keywords	scraping parallel async framework runner
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # spidur 🕷️

[![PyPI version](https://img.shields.io/pypi/v/spidur.svg)](https://pypi.org/project/spidur/)
[![License](https://img.shields.io/github/license/ra0x3/spidur)](LICENSE)
[![Tests](https://github.com/ra0x3/spidur/actions/workflows/ci.yaml/badge.svg)](https://github.com/ra0x3/spidur/actions)

**spidur** is a lightweight, hackable framework for running **multiple custom scrapers in parallel** — even on the same domain.  
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.

---

## ✨ Core ideas

- **Multiple scrapers per domain** — handle different content types (articles, images, comments, etc.) simultaneously.
- **Parallel execution** — utilizes all CPU cores.
- **Async + multiprocessing safe** — works across async methods and process pools.
- **No opinions** — you control discovery, validation, and scraping logic.
- **Results collected automatically** — each scraper contributes to a single aggregated result set.

---

## 📦 Install

```bash
pip install spidur
```

or with Poetry:

```bash
poetry add spidur
```

---

## ⚡ Example

```python
from spidur.core import Target, Scraper
from spidur.factory import ScraperFactory
from spidur.runner import Runner


# --- define your base scrapers ---

class ArticleScraper(Scraper):
    async def is_valid_url(self, url):
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page, known):
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page, url):
        return {"type": "article", "url": url, "data": f"Content of {url}"}

    async def fetch_page(self, known):
        urls = await self.discover_urls(None, known)
        urls = [u for u in urls if await self.is_valid_url(u)]
        return [await self.scrape_page(None, u) for u in urls]


class CommentScraper(Scraper):
    async def is_valid_url(self, url):
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page, known):
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page, url):
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}

    async def fetch_page(self, known):
        urls = await self.discover_urls(None, known)
        urls = [u for u in urls if await self.is_valid_url(u)]
        return [await self.scrape_page(None, u) for u in urls]


# --- register both scrapers for the same domain ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---

results = Runner.run(targets, seen=set())

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)
```

---

## 🧠 How it works

1. Each `Scraper` subclass defines:
    - `is_valid_url(url)` — ensures no invalid or duplicate URLs are processed.
    - `discover_urls()` — finds new pages to scrape.
    - `scrape_page()` — extracts structured data.
    - `fetch_page()` — orchestrates the above.

2. You register scrapers in `ScraperFactory`.

3. The `Runner`:
    - Spawns multiple processes.
    - Executes all scrapers concurrently.
    - Aggregates their results into a single dictionary keyed by scraper name.

---

## 🧪 Running tests

```bash
poetry install
poetry run pytest
```

or with plain `pip`:

```bash
pip install -e .
pytest
```

## 🧩 Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ra0x3/spidur",
    "name": "spidur",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "scraping, parallel, async, framework, runner",
    "author": "ra0x3",
    "author_email": "spam.rashad@protonmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4f/f5/3681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb/spidur-0.2.0.tar.gz",
    "platform": null,
    "description": "# spidur \ud83d\udd77\ufe0f\n\n[![PyPI version](https://img.shields.io/pypi/v/spidur.svg)](https://pypi.org/project/spidur/)\n[![License](https://img.shields.io/github/license/ra0x3/spidur)](LICENSE)\n[![Tests](https://github.com/ra0x3/spidur/actions/workflows/ci.yaml/badge.svg)](https://github.com/ra0x3/spidur/actions)\n\n**spidur** is a lightweight, hackable framework for running **multiple custom scrapers in parallel** \u2014 even on the same domain.  \nIt helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.\n\n---\n\n## \u2728 Core ideas\n\n- **Multiple scrapers per domain** \u2014 handle different content types (articles, images, comments, etc.) simultaneously.\n- **Parallel execution** \u2014 utilizes all CPU cores.\n- **Async + multiprocessing safe** \u2014 works across async methods and process pools.\n- **No opinions** \u2014 you control discovery, validation, and scraping logic.\n- **Results collected automatically** \u2014 each scraper contributes to a single aggregated result set.\n\n---\n\n## \ud83d\udce6 Install\n\n```bash\npip install spidur\n```\n\nor with Poetry:\n\n```bash\npoetry add spidur\n```\n\n---\n\n## \u26a1 Example\n\n```python\nfrom spidur.core import Target, Scraper\nfrom spidur.factory import ScraperFactory\nfrom spidur.runner import Runner\n\n\n# --- define your base scrapers ---\n\nclass ArticleScraper(Scraper):\n    async def is_valid_url(self, url):\n        return url.startswith(\"https://example.com/articles/\")\n\n    async def discover_urls(self, page, known):\n        return [\n            \"https://example.com/articles/1\",\n            \"https://example.com/articles/2\",\n        ]\n\n    async def scrape_page(self, page, url):\n        return {\"type\": \"article\", \"url\": url, \"data\": f\"Content of {url}\"}\n\n    async def fetch_page(self, known):\n        urls = await self.discover_urls(None, known)\n        urls = [u for u in urls if await self.is_valid_url(u)]\n        return [await self.scrape_page(None, u) for u in urls]\n\n\nclass CommentScraper(Scraper):\n    async def is_valid_url(self, url):\n        return url.startswith(\"https://example.com/comments/\")\n\n    async def discover_urls(self, page, known):\n        return [\n            \"https://example.com/comments/1\",\n            \"https://example.com/comments/2\",\n        ]\n\n    async def scrape_page(self, page, url):\n        return {\"type\": \"comment\", \"url\": url, \"data\": f\"Comments from {url}\"}\n\n    async def fetch_page(self, known):\n        urls = await self.discover_urls(None, known)\n        urls = [u for u in urls if await self.is_valid_url(u)]\n        return [await self.scrape_page(None, u) for u in urls]\n\n\n# --- register both scrapers for the same domain ---\n\nScraperFactory.register(\"articles\", ArticleScraper)\nScraperFactory.register(\"comments\", CommentScraper)\n\n\n# --- define your scrape targets ---\n\ntargets = [\n    Target(name=\"articles\", start_url=\"https://example.com/articles\"),\n    Target(name=\"comments\", start_url=\"https://example.com/comments\"),\n]\n\n\n# --- run them all in parallel ---\n\nresults = Runner.run(targets, seen=set())\n\nfor name, items in results.items():\n    print(f\"Results from {name}:\")\n    for item in items:\n        print(\"  \u2192\", item)\n```\n\n---\n\n## \ud83e\udde0 How it works\n\n1. Each `Scraper` subclass defines:\n    - `is_valid_url(url)` \u2014 ensures no invalid or duplicate URLs are processed.\n    - `discover_urls()` \u2014 finds new pages to scrape.\n    - `scrape_page()` \u2014 extracts structured data.\n    - `fetch_page()` \u2014 orchestrates the above.\n\n2. You register scrapers in `ScraperFactory`.\n\n3. The `Runner`:\n    - Spawns multiple processes.\n    - Executes all scrapers concurrently.\n    - Aggregates their results into a single dictionary keyed by scraper name.\n\n---\n\n## \ud83e\uddea Running tests\n\n```bash\npoetry install\npoetry run pytest\n```\n\nor with plain `pip`:\n\n```bash\npip install -e .\npytest\n```\n\n## \ud83e\udde9 Why \u201cspidur\u201d?\n\nBecause it crawls the web \u2014 but cleanly, predictably, and in parallel. \ud83d\udd78\ufe0f\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "\ud83d\udd77\ufe0f A lightweight, generic parallel runner for custom scrapers",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/ra0x3/spidur",
        "Repository": "https://github.com/ra0x3/spidur"
    },
    "split_keywords": [
        "scraping",
        " parallel",
        " async",
        " framework",
        " runner"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "af29019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e",
                "md5": "693e2e601c8c831b34ab5d15b67886c2",
                "sha256": "3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6"
            },
            "downloads": -1,
            "filename": "spidur-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "693e2e601c8c831b34ab5d15b67886c2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 5711,
            "upload_time": "2025-10-08T00:11:58",
            "upload_time_iso_8601": "2025-10-08T00:11:58.330169Z",
            "url": "https://files.pythonhosted.org/packages/af/29/019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e/spidur-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4ff53681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb",
                "md5": "32bc0d065a5f55f95de4d8926ff45d1c",
                "sha256": "6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9"
            },
            "downloads": -1,
            "filename": "spidur-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "32bc0d065a5f55f95de4d8926ff45d1c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 4549,
            "upload_time": "2025-10-08T00:11:59",
            "upload_time_iso_8601": "2025-10-08T00:11:59.899802Z",
            "url": "https://files.pythonhosted.org/packages/4f/f5/3681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb/spidur-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-08 00:11:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ra0x3",
    "github_project": "spidur",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "spidur"
}

ra0x3