# SpiderForce4AI Python Wrapper
A Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.
## Features
- HTML to Markdown conversion
- Parallel and async crawling support
- Sitemap processing
- Custom content selection
- Automatic retry mechanism
- Detailed progress tracking
- Webhook notifications
- Customizable reporting
## Installation
```bash
pip install spiderforce4ai
```
## Quick Start
```python
from spiderforce4ai import SpiderForce4AI, CrawlConfig
from pathlib import Path
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure crawling options
config = CrawlConfig(
target_selector="article",
remove_selectors=[".ads", ".navigation"],
max_concurrent_requests=5,
save_reports=True
)
# Crawl a sitemap
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
```
## Key Features
### 1. Smart Retry Mechanism
- Automatically retries failed URLs
- Monitors failure ratio to prevent server overload
- Detailed retry statistics and progress tracking
- Aborts retries if failure rate exceeds 20%
```python
# Retry behavior is automatic
config = CrawlConfig(
max_concurrent_requests=5,
request_delay=1.0 # Delay between retries
)
results = spider.crawl_urls_async(urls, config)
```
### 2. Custom Webhook Integration
- Flexible payload formatting
- Custom headers support
- Variable substitution in templates
```python
config = CrawlConfig(
webhook_url="https://your-webhook.com",
webhook_headers={
"Authorization": "Bearer token",
"X-Custom-Header": "value"
},
webhook_payload_template='''{
"url": "{url}",
"content": "{markdown}",
"status": "{status}",
"custom_field": "value"
}'''
)
```
### 3. Flexible Report Generation
- Optional report saving
- Customizable report location
- Detailed success/failure statistics
```python
config = CrawlConfig(
save_reports=True,
report_file=Path("custom_report.json"),
output_dir=Path("content")
)
```
## Crawling Methods
### 1. Single URL Processing
```python
# Synchronous
result = spider.crawl_url("https://example.com", config)
# Asynchronous
async def crawl():
result = await spider.crawl_url_async("https://example.com", config)
```
### 2. Multiple URLs
```python
urls = ["https://example.com/page1", "https://example.com/page2"]
# Server-side parallel (recommended)
results = spider.crawl_urls_server_parallel(urls, config)
# Client-side parallel
results = spider.crawl_urls_parallel(urls, config)
# Asynchronous
async def crawl():
results = await spider.crawl_urls_async(urls, config)
```
### 3. Sitemap Processing
```python
# Server-side parallel (recommended)
results = spider.crawl_sitemap_server_parallel("https://example.com/sitemap.xml", config)
# Client-side parallel
results = spider.crawl_sitemap_parallel("https://example.com/sitemap.xml", config)
# Asynchronous
async def crawl():
results = await spider.crawl_sitemap_async("https://example.com/sitemap.xml", config)
```
## Configuration Options
```python
config = CrawlConfig(
# Content Selection
target_selector="article", # Target element to extract
remove_selectors=[".ads", "#popup"], # Elements to remove
remove_selectors_regex=["modal-\\d+"], # Regex patterns for removal
# Processing
max_concurrent_requests=5, # Parallel processing limit
request_delay=0.5, # Delay between requests
timeout=30, # Request timeout
# Output
output_dir=Path("content"), # Output directory
save_reports=False, # Enable/disable report saving
report_file=Path("report.json"), # Report location
# Webhook
webhook_url="https://webhook.com", # Webhook endpoint
webhook_timeout=10, # Webhook timeout
webhook_headers={ # Custom headers
"Authorization": "Bearer token"
},
webhook_payload_template=''' # Custom payload format
{
"url": "{url}",
"content": "{markdown}",
"status": "{status}",
"error": "{error}",
"time": "{timestamp}"
}'''
)
```
## Progress Tracking
The package provides detailed progress information:
```
Fetching sitemap from https://example.com/sitemap.xml...
Found 156 URLs in sitemap
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 156/156 URLs
Retrying failed URLs: 18 (11.5% failed)
[━━━━━━━━━━━━━━━━━━━━━━━━━━━━] 100% • 18/18 retries
Crawling Summary:
Total URLs processed: 156
Initial failures: 18 (11.5%)
Final results:
✓ Successful: 150
✗ Failed: 6
Retry success rate: 12/18 (66.7%)
```
## Output Structure
### 1. Directory Layout
```
content/ # Output directory
├── example-com-page1.md # Markdown files
├── example-com-page2.md
└── report.json # Crawl report
```
### 2. Report Format
```json
{
"timestamp": "2025-02-15T10:30:00",
"config": {
"target_selector": "article",
"remove_selectors": [".ads"]
},
"results": {
"successful": [...],
"failed": [...]
},
"summary": {
"total": 156,
"successful": 150,
"failed": 6
}
}
```
## Performance Optimization
1. Server-side Parallel Processing
- Recommended for most cases
- Single HTTP request
- Reduced network overhead
- Built-in load balancing
2. Client-side Parallel Processing
- Better control over processing
- Customizable concurrency
- Progress tracking per URL
- Automatic retry handling
3. Asynchronous Processing
- Ideal for async applications
- Non-blocking operation
- Real-time progress updates
- Efficient resource usage
## Error Handling
The package provides comprehensive error handling:
- Automatic retry for failed URLs
- Failure ratio monitoring
- Detailed error reporting
- Webhook error notifications
- Progress tracking during retries
## Requirements
- Python 3.11+
- Running SpiderForce4AI service
- Internet connection
## Dependencies
- aiohttp
- asyncio
- rich
- aiofiles
- httpx
## License
MIT License
## Credits
Created by [Peter Tam](https://petertam.pro)
Raw data
{
"_id": null,
"home_page": "https://petertam.pro",
"name": "spiderforce4ai",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "web-scraping, markdown, html-to-markdown, llm, ai, content-extraction, async, parallel-processing",
"author": "Piotr Tamulewicz",
"author_email": "Piotr Tamulewicz <pt@petertam.pro>",
"download_url": "https://files.pythonhosted.org/packages/38/4c/2398147f46541ba38ad8b1d93503f69beb8b0bdb638a8c5f2f6bd6621c91/spiderforce4ai-2.6.7.tar.gz",
"platform": null,
"description": "# SpiderForce4AI Python Wrapper\n\nA Python package for web content crawling and HTML-to-Markdown conversion. Built for seamless integration with SpiderForce4AI service.\n\n## Features\n\n- HTML to Markdown conversion\n- Parallel and async crawling support\n- Sitemap processing\n- Custom content selection\n- Automatic retry mechanism\n- Detailed progress tracking\n- Webhook notifications\n- Customizable reporting\n\n## Installation\n\n```bash\npip install spiderforce4ai\n```\n\n## Quick Start\n\n```python\nfrom spiderforce4ai import SpiderForce4AI, CrawlConfig\nfrom pathlib import Path\n\n# Initialize crawler\nspider = SpiderForce4AI(\"http://localhost:3004\")\n\n# Configure crawling options\nconfig = CrawlConfig(\n target_selector=\"article\",\n remove_selectors=[\".ads\", \".navigation\"],\n max_concurrent_requests=5,\n save_reports=True\n)\n\n# Crawl a sitemap\nresults = spider.crawl_sitemap_server_parallel(\"https://example.com/sitemap.xml\", config)\n```\n\n## Key Features\n\n### 1. Smart Retry Mechanism\n- Automatically retries failed URLs\n- Monitors failure ratio to prevent server overload\n- Detailed retry statistics and progress tracking\n- Aborts retries if failure rate exceeds 20%\n\n```python\n# Retry behavior is automatic\nconfig = CrawlConfig(\n max_concurrent_requests=5,\n request_delay=1.0 # Delay between retries\n)\nresults = spider.crawl_urls_async(urls, config)\n```\n\n### 2. Custom Webhook Integration\n- Flexible payload formatting\n- Custom headers support\n- Variable substitution in templates\n\n```python\nconfig = CrawlConfig(\n webhook_url=\"https://your-webhook.com\",\n webhook_headers={\n \"Authorization\": \"Bearer token\",\n \"X-Custom-Header\": \"value\"\n },\n webhook_payload_template='''{\n \"url\": \"{url}\",\n \"content\": \"{markdown}\",\n \"status\": \"{status}\",\n \"custom_field\": \"value\"\n }'''\n)\n```\n\n### 3. Flexible Report Generation\n- Optional report saving\n- Customizable report location\n- Detailed success/failure statistics\n\n```python\nconfig = CrawlConfig(\n save_reports=True,\n report_file=Path(\"custom_report.json\"),\n output_dir=Path(\"content\")\n)\n```\n\n## Crawling Methods\n\n### 1. Single URL Processing\n\n```python\n# Synchronous\nresult = spider.crawl_url(\"https://example.com\", config)\n\n# Asynchronous\nasync def crawl():\n result = await spider.crawl_url_async(\"https://example.com\", config)\n```\n\n### 2. Multiple URLs\n\n```python\nurls = [\"https://example.com/page1\", \"https://example.com/page2\"]\n\n# Server-side parallel (recommended)\nresults = spider.crawl_urls_server_parallel(urls, config)\n\n# Client-side parallel\nresults = spider.crawl_urls_parallel(urls, config)\n\n# Asynchronous\nasync def crawl():\n results = await spider.crawl_urls_async(urls, config)\n```\n\n### 3. Sitemap Processing\n\n```python\n# Server-side parallel (recommended)\nresults = spider.crawl_sitemap_server_parallel(\"https://example.com/sitemap.xml\", config)\n\n# Client-side parallel\nresults = spider.crawl_sitemap_parallel(\"https://example.com/sitemap.xml\", config)\n\n# Asynchronous\nasync def crawl():\n results = await spider.crawl_sitemap_async(\"https://example.com/sitemap.xml\", config)\n```\n\n## Configuration Options\n\n```python\nconfig = CrawlConfig(\n # Content Selection\n target_selector=\"article\", # Target element to extract\n remove_selectors=[\".ads\", \"#popup\"], # Elements to remove\n remove_selectors_regex=[\"modal-\\\\d+\"], # Regex patterns for removal\n \n # Processing\n max_concurrent_requests=5, # Parallel processing limit\n request_delay=0.5, # Delay between requests\n timeout=30, # Request timeout\n \n # Output\n output_dir=Path(\"content\"), # Output directory\n save_reports=False, # Enable/disable report saving\n report_file=Path(\"report.json\"), # Report location\n \n # Webhook\n webhook_url=\"https://webhook.com\", # Webhook endpoint\n webhook_timeout=10, # Webhook timeout\n webhook_headers={ # Custom headers\n \"Authorization\": \"Bearer token\"\n },\n webhook_payload_template=''' # Custom payload format\n {\n \"url\": \"{url}\",\n \"content\": \"{markdown}\",\n \"status\": \"{status}\",\n \"error\": \"{error}\",\n \"time\": \"{timestamp}\"\n }'''\n)\n```\n\n## Progress Tracking\n\nThe package provides detailed progress information:\n\n```\nFetching sitemap from https://example.com/sitemap.xml...\nFound 156 URLs in sitemap\n[\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501] 100% \u2022 156/156 URLs\n\nRetrying failed URLs: 18 (11.5% failed)\n[\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501] 100% \u2022 18/18 retries\n\nCrawling Summary:\nTotal URLs processed: 156\nInitial failures: 18 (11.5%)\nFinal results:\n \u2713 Successful: 150\n \u2717 Failed: 6\nRetry success rate: 12/18 (66.7%)\n```\n\n## Output Structure\n\n### 1. Directory Layout\n```\ncontent/ # Output directory\n\u251c\u2500\u2500 example-com-page1.md # Markdown files\n\u251c\u2500\u2500 example-com-page2.md\n\u2514\u2500\u2500 report.json # Crawl report\n```\n\n### 2. Report Format\n```json\n{\n \"timestamp\": \"2025-02-15T10:30:00\",\n \"config\": {\n \"target_selector\": \"article\",\n \"remove_selectors\": [\".ads\"]\n },\n \"results\": {\n \"successful\": [...],\n \"failed\": [...]\n },\n \"summary\": {\n \"total\": 156,\n \"successful\": 150,\n \"failed\": 6\n }\n}\n```\n\n## Performance Optimization\n\n1. Server-side Parallel Processing\n - Recommended for most cases\n - Single HTTP request\n - Reduced network overhead\n - Built-in load balancing\n\n2. Client-side Parallel Processing\n - Better control over processing\n - Customizable concurrency\n - Progress tracking per URL\n - Automatic retry handling\n\n3. Asynchronous Processing\n - Ideal for async applications\n - Non-blocking operation\n - Real-time progress updates\n - Efficient resource usage\n\n## Error Handling\n\nThe package provides comprehensive error handling:\n\n- Automatic retry for failed URLs\n- Failure ratio monitoring\n- Detailed error reporting\n- Webhook error notifications\n- Progress tracking during retries\n\n## Requirements\n\n- Python 3.11+\n- Running SpiderForce4AI service\n- Internet connection\n\n## Dependencies\n\n- aiohttp\n- asyncio\n- rich\n- aiofiles\n- httpx\n\n## License\n\nMIT License\n\n## Credits\n\nCreated by [Peter Tam](https://petertam.pro)\n",
"bugtrack_url": null,
"license": null,
"summary": "Python wrapper for SpiderForce4AI HTML-to-Markdown conversion service with LLM post-processing",
"version": "2.6.7",
"project_urls": {
"Bug Tracker": "https://github.com/yourusername/spiderforce4ai/issues",
"Documentation": "https://petertam.pro/docs/spiderforce4ai",
"Homepage": "https://petertam.pro",
"Repository": "https://github.com/yourusername/spiderforce4ai"
},
"split_keywords": [
"web-scraping",
" markdown",
" html-to-markdown",
" llm",
" ai",
" content-extraction",
" async",
" parallel-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ffa66972fade85996d090ba37de61eb0af974c559eae6bb1a19019639b671777",
"md5": "eabe345771eee457d565a5b5cbf535db",
"sha256": "4e0fdf652536b55badb1914426fca6b92d82c9a7e6cdc343a090de747791671f"
},
"downloads": -1,
"filename": "spiderforce4ai-2.6.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eabe345771eee457d565a5b5cbf535db",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 16500,
"upload_time": "2025-02-16T14:44:53",
"upload_time_iso_8601": "2025-02-16T14:44:53.015485Z",
"url": "https://files.pythonhosted.org/packages/ff/a6/6972fade85996d090ba37de61eb0af974c559eae6bb1a19019639b671777/spiderforce4ai-2.6.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "384c2398147f46541ba38ad8b1d93503f69beb8b0bdb638a8c5f2f6bd6621c91",
"md5": "b19b7c5b72bb6540bd69da33e19e1058",
"sha256": "e0ddd0f0791dab4dde480ef8a82cf4b8bb0edefd552f495acf2dc1efcc6a945f"
},
"downloads": -1,
"filename": "spiderforce4ai-2.6.7.tar.gz",
"has_sig": false,
"md5_digest": "b19b7c5b72bb6540bd69da33e19e1058",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 19753,
"upload_time": "2025-02-16T14:44:55",
"upload_time_iso_8601": "2025-02-16T14:44:55.451752Z",
"url": "https://files.pythonhosted.org/packages/38/4c/2398147f46541ba38ad8b1d93503f69beb8b0bdb638a8c5f2f6bd6621c91/spiderforce4ai-2.6.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-16 14:44:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "spiderforce4ai",
"github_not_found": true,
"lcname": "spiderforce4ai"
}