# spidur π·οΈ
[](https://pypi.org/project/spidur/)
[](LICENSE)
[](https://github.com/ra0x3/spidur/actions)
**spidur** is a lightweight, hackable framework for running **multiple custom scrapers in parallel** β even on the same domain.
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.
---
## β¨ Core ideas
- **Multiple scrapers per domain** β handle different content types (articles, images, comments, etc.) simultaneously.
- **Parallel execution** β utilizes all CPU cores.
- **Async + multiprocessing safe** β works across async methods and process pools.
- **No opinions** β you control discovery, validation, and scraping logic.
- **Results collected automatically** β each scraper contributes to a single aggregated result set.
---
## π¦ Install
```bash
pip install spidur
```
or with Poetry:
```bash
poetry add spidur
```
---
## β‘ Example
```python
from spidur.core import Target, Scraper
from spidur.factory import ScraperFactory
from spidur.runner import Runner
# --- define your base scrapers ---
class ArticleScraper(Scraper):
async def is_valid_url(self, url):
return url.startswith("https://example.com/articles/")
async def discover_urls(self, page, known):
return [
"https://example.com/articles/1",
"https://example.com/articles/2",
]
async def scrape_page(self, page, url):
return {"type": "article", "url": url, "data": f"Content of {url}"}
async def fetch_page(self, known):
urls = await self.discover_urls(None, known)
urls = [u for u in urls if await self.is_valid_url(u)]
return [await self.scrape_page(None, u) for u in urls]
class CommentScraper(Scraper):
async def is_valid_url(self, url):
return url.startswith("https://example.com/comments/")
async def discover_urls(self, page, known):
return [
"https://example.com/comments/1",
"https://example.com/comments/2",
]
async def scrape_page(self, page, url):
return {"type": "comment", "url": url, "data": f"Comments from {url}"}
async def fetch_page(self, known):
urls = await self.discover_urls(None, known)
urls = [u for u in urls if await self.is_valid_url(u)]
return [await self.scrape_page(None, u) for u in urls]
# --- register both scrapers for the same domain ---
ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)
# --- define your scrape targets ---
targets = [
Target(name="articles", start_url="https://example.com/articles"),
Target(name="comments", start_url="https://example.com/comments"),
]
# --- run them all in parallel ---
results = Runner.run(targets, seen=set())
for name, items in results.items():
print(f"Results from {name}:")
for item in items:
print(" β", item)
```
---
## π§ How it works
1. Each `Scraper` subclass defines:
- `is_valid_url(url)` β ensures no invalid or duplicate URLs are processed.
- `discover_urls()` β finds new pages to scrape.
- `scrape_page()` β extracts structured data.
- `fetch_page()` β orchestrates the above.
2. You register scrapers in `ScraperFactory`.
3. The `Runner`:
- Spawns multiple processes.
- Executes all scrapers concurrently.
- Aggregates their results into a single dictionary keyed by scraper name.
---
## π§ͺ Running tests
```bash
poetry install
poetry run pytest
```
or with plain `pip`:
```bash
pip install -e .
pytest
```
## π§© Why βspidurβ?
Because it crawls the web β but cleanly, predictably, and in parallel. πΈοΈ
Raw data
{
"_id": null,
"home_page": "https://github.com/ra0x3/spidur",
"name": "spidur",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "scraping, parallel, async, framework, runner",
"author": "ra0x3",
"author_email": "spam.rashad@protonmail.com",
"download_url": "https://files.pythonhosted.org/packages/4f/f5/3681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb/spidur-0.2.0.tar.gz",
"platform": null,
"description": "# spidur \ud83d\udd77\ufe0f\n\n[](https://pypi.org/project/spidur/)\n[](LICENSE)\n[](https://github.com/ra0x3/spidur/actions)\n\n**spidur** is a lightweight, hackable framework for running **multiple custom scrapers in parallel** \u2014 even on the same domain. \nIt helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.\n\n---\n\n## \u2728 Core ideas\n\n- **Multiple scrapers per domain** \u2014 handle different content types (articles, images, comments, etc.) simultaneously.\n- **Parallel execution** \u2014 utilizes all CPU cores.\n- **Async + multiprocessing safe** \u2014 works across async methods and process pools.\n- **No opinions** \u2014 you control discovery, validation, and scraping logic.\n- **Results collected automatically** \u2014 each scraper contributes to a single aggregated result set.\n\n---\n\n## \ud83d\udce6 Install\n\n```bash\npip install spidur\n```\n\nor with Poetry:\n\n```bash\npoetry add spidur\n```\n\n---\n\n## \u26a1 Example\n\n```python\nfrom spidur.core import Target, Scraper\nfrom spidur.factory import ScraperFactory\nfrom spidur.runner import Runner\n\n\n# --- define your base scrapers ---\n\nclass ArticleScraper(Scraper):\n async def is_valid_url(self, url):\n return url.startswith(\"https://example.com/articles/\")\n\n async def discover_urls(self, page, known):\n return [\n \"https://example.com/articles/1\",\n \"https://example.com/articles/2\",\n ]\n\n async def scrape_page(self, page, url):\n return {\"type\": \"article\", \"url\": url, \"data\": f\"Content of {url}\"}\n\n async def fetch_page(self, known):\n urls = await self.discover_urls(None, known)\n urls = [u for u in urls if await self.is_valid_url(u)]\n return [await self.scrape_page(None, u) for u in urls]\n\n\nclass CommentScraper(Scraper):\n async def is_valid_url(self, url):\n return url.startswith(\"https://example.com/comments/\")\n\n async def discover_urls(self, page, known):\n return [\n \"https://example.com/comments/1\",\n \"https://example.com/comments/2\",\n ]\n\n async def scrape_page(self, page, url):\n return {\"type\": \"comment\", \"url\": url, \"data\": f\"Comments from {url}\"}\n\n async def fetch_page(self, known):\n urls = await self.discover_urls(None, known)\n urls = [u for u in urls if await self.is_valid_url(u)]\n return [await self.scrape_page(None, u) for u in urls]\n\n\n# --- register both scrapers for the same domain ---\n\nScraperFactory.register(\"articles\", ArticleScraper)\nScraperFactory.register(\"comments\", CommentScraper)\n\n\n# --- define your scrape targets ---\n\ntargets = [\n Target(name=\"articles\", start_url=\"https://example.com/articles\"),\n Target(name=\"comments\", start_url=\"https://example.com/comments\"),\n]\n\n\n# --- run them all in parallel ---\n\nresults = Runner.run(targets, seen=set())\n\nfor name, items in results.items():\n print(f\"Results from {name}:\")\n for item in items:\n print(\" \u2192\", item)\n```\n\n---\n\n## \ud83e\udde0 How it works\n\n1. Each `Scraper` subclass defines:\n - `is_valid_url(url)` \u2014 ensures no invalid or duplicate URLs are processed.\n - `discover_urls()` \u2014 finds new pages to scrape.\n - `scrape_page()` \u2014 extracts structured data.\n - `fetch_page()` \u2014 orchestrates the above.\n\n2. You register scrapers in `ScraperFactory`.\n\n3. The `Runner`:\n - Spawns multiple processes.\n - Executes all scrapers concurrently.\n - Aggregates their results into a single dictionary keyed by scraper name.\n\n---\n\n## \ud83e\uddea Running tests\n\n```bash\npoetry install\npoetry run pytest\n```\n\nor with plain `pip`:\n\n```bash\npip install -e .\npytest\n```\n\n## \ud83e\udde9 Why \u201cspidur\u201d?\n\nBecause it crawls the web \u2014 but cleanly, predictably, and in parallel. \ud83d\udd78\ufe0f\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "\ud83d\udd77\ufe0f A lightweight, generic parallel runner for custom scrapers",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/ra0x3/spidur",
"Repository": "https://github.com/ra0x3/spidur"
},
"split_keywords": [
"scraping",
" parallel",
" async",
" framework",
" runner"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "af29019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e",
"md5": "693e2e601c8c831b34ab5d15b67886c2",
"sha256": "3b3933db0d11c25a9ea4d5db4d2f645b3649bcda97aef827534e5ed312e115e6"
},
"downloads": -1,
"filename": "spidur-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "693e2e601c8c831b34ab5d15b67886c2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 5711,
"upload_time": "2025-10-08T00:11:58",
"upload_time_iso_8601": "2025-10-08T00:11:58.330169Z",
"url": "https://files.pythonhosted.org/packages/af/29/019dbe0dd43a2fac7dac24439c59dca3aaca86e0f5f3b4428d3239171a2e/spidur-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4ff53681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb",
"md5": "32bc0d065a5f55f95de4d8926ff45d1c",
"sha256": "6487c7921ac6389bb6c4e690d78a91264de0ea0c37502eb4eed6b9f8bb073df9"
},
"downloads": -1,
"filename": "spidur-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "32bc0d065a5f55f95de4d8926ff45d1c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 4549,
"upload_time": "2025-10-08T00:11:59",
"upload_time_iso_8601": "2025-10-08T00:11:59.899802Z",
"url": "https://files.pythonhosted.org/packages/4f/f5/3681c36ff496e324b2b894bf8165b8389bbfb3f0f164ad7c4449f0ded5fb/spidur-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-08 00:11:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ra0x3",
"github_project": "spidur",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "spidur"
}